This project demonstrates a real-time data analysis pipeline for food delivery systems, leveraging AWS services for scalable and fault-tolerant data processing. The architecture ingests real-time food order data, processes it using Apache Spark Streaming on Amazon EMR, and loads it into Amazon Redshift for analytics. The processed data can then be visualized using Amazon QuickSight.
- Real-time Data Processing: Uses Amazon Kinesis to stream real-time data, processed using PySpark on an Amazon EMR cluster.
- Orchestration with Airflow: Two Amazon Managed Workflows for Apache Airflow (MWAA) DAGs manage the pipeline:
create_and_load_dimDAG: Initializes Redshift tables and loads dimensional data.submit_pyspark_streaming_job_to_emrDAG: Triggers the Spark streaming job.
- AWS Services Integration: Utilizes a suite of AWS tools for efficient data processing:
- S3: Storage for raw and processed data.
- Redshift: Data warehousing for real-time and batch data analytics.
- Kinesis: Real-time data streaming.
- EMR: Spark cluster for processing streaming data.
- CodeBuild: CI/CD for Spark job deployments.
- QuickSight: Dashboard and analytics visualization.
- Custom Data Generator: Simulates user food orders for testing and demonstration purposes.
Food-Delivery-RealTime-Data-Analysis/
├── dags/
│ ├── redshift_create_tables_dag.py # Creates Redshift tables and loads dimensional data.
│ ├── spark_streaming_dag.py # Triggers the PySpark job on EMR.
├── scripts/
│ ├── orders_spark_streaming.py # PySpark script for processing real-time data from Kinesis.
├── mock_data_generator.py # Python script to simulate user food orders.
├── Food Delivery RealTime Data Analysis Architecture.png # Architecture diagram.
└── README.md # Project documentation.
- Amazon S3: Stores raw and processed data.
- Amazon Redshift: Hosts dimensional and fact tables for analytics.
- Amazon Kinesis: Streams real-time order data to be processed.
- Amazon EMR: Runs Spark Streaming for processing Kinesis data.
- Amazon Managed Airflow (MWAA): Orchestrates workflows with custom DAGs.
- Amazon QuickSight: Provides insights with data visualization dashboards.
- AWS CodeBuild: Automates deployment and testing of PySpark scripts.
-
Data Simulation:
- A Python-based mock data generator simulates real-time food orders and publishes them to an Amazon Kinesis Data Stream.
-
Orchestration with Airflow:
- DAG 1:
create_and_load_dim- Creates dimensional and fact tables in Amazon Redshift.
- Loads static data into the dimensional tables.
- DAG 2:
submit_pyspark_streaming_job_to_emr- Triggers the Spark streaming job on Amazon EMR after the first DAG completes.
- DAG 1:
-
Real-Time Data Processing:
- The Spark streaming job processes micro-batches from Kinesis and loads the results into Redshift's fact table.
- Redshift JAR Dependency: The PySpark script relies on the
Redshift JDBC DriverJAR file for connecting to the Redshift database. Ensure the JAR file is available in the EMR cluster and referenced in the script. The JAR file can be downloaded from the Amazon Redshift JDBC Drivers page.
-
Data Analytics:
- Amazon QuickSight connects to Redshift for real-time analytics and visualization.
- redshift_create_tables_dag.py:
- Creates Redshift tables and loads dimensional data.
- spark_streaming_dag.py:
- Triggers the Spark Streaming job on EMR.
- orders_spark_streaming.py:
- Reads data from Kinesis, processes it, and ingests it into Redshift in real time.
- Redshift Connection JAR: Ensure the
Redshift JDBC Driveris added to the classpath when running this script.
- mock_data_generator.py:
- Generates random food order data and publishes it to the Kinesis stream.
The processed data stored in Redshift can be visualized using Amazon QuickSight. Insights such as order trends, popular items, and delivery times can be displayed in an interactive dashboard.
- VPC Endpoints: Ensure proper communication between services using VPC endpoints.
- IAM Permissions: Assign appropriate roles and permissions for Kinesis, Redshift, EMR, and S3 access.
- Security Groups: Configure security groups for controlled access across services.
-
Clone the repository:
git clone https://github.com/desininja/Food-Delivery-RealTime-Data-Analysis.git
-
Configure AWS CLI with appropriate permissions.
-
Deploy Airflow DAGs:
- Upload
redshift_create_tables_dag.pyandspark_streaming_dag.pyto your MWAA DAGs folder.
- Upload
-
Download the Redshift JDBC Driver and place it in the EMR cluster classpath or pass the S3 location of the jar file as
--jarwhile triggering the Spark job. -
Create required AWS resources:
- Kinesis Data Stream.
- Redshift cluster and tables.
- S3 buckets for raw and processed data.
- EMR cluster.
-
Run the mock data generator:
python mock_data_generator.py
-
Trigger the DAGs in Airflow to initialize the pipeline.
-
Visualize the data using QuickSight by connecting it to Redshift.
- Integrate monitoring tools like CloudWatch for pipeline performance.
- Implement error handling and retry mechanisms for fault tolerance.
- Extend the pipeline to support additional data sources.
- Automate infrastructure provisioning using tools like Terraform or AWS CloudFormation.
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request for any improvements.
