data-engineering

GitHub Streamed Data

A data streaming project that analyzes GitHub repositories, leveraging Kafka, Spark, and MongoDB, with a real-time dashboard powered by Angular.

A real-time data pipeline that streams and analyzes GitHub repository data, providing insights into programming language usage and repository activity. The project uses Kafka for real-time data ingestion, Spark for processing, Spring Boot for API development, MongoDB for storage, and Angular for an interactive dashboard.

Screenshot from 2023-06-27 01-32-18

  • The Dashboard will automatically fetch new data each 30s.

Data Pipeline

Spring Boot API

Project Description

This project aims to provide valuable insights into GitHub repositories by analyzing the usage of different programming languages and identifying the most active months in terms of repository creation. It serves as an opportunity to gain familiarity with various technologies such as Kafka, Spark, Spring Boot, MongoDB, and Angular.

App Design

  • Front end using Angular.
  • Back end using Spring boot to creat a RESTful API.

Suggestions

  • In addition to the existing goals, we can enhance the project by incorporating Natural Language Processing (NLP) tasks to classify repositories based on their names and programming languages. This addition will enable us to gain insights into the dominant themes within the GitHub ecosystem, such as NLP, Image Recognition, or General Software Development ....

Key Features

  • Real-time GitHub data streaming using Kafka
  • Data processing and analytics with Apache Spark
  • RESTful API built with Spring Boot with MongoDB for scalable data storage
  • Angular-based dashboard with automatic updates every 30 seconds

Achievements

  • Implemented a real-time data pipeline using Kafka and Spark
  • Developed a live-updating dashboard with Angular
  • Integrated MongoDB for efficient data storage