StockIO is a real-time data streaming solution designed to process and analyze stock market data using Apache Kafka and AWS services.
This project is tested with BrowserStack
StockIO is a real-time streaming application that simulates stock market data and processes it using Apache Kafka and various AWS services.
- Hosted on AWS EC2
- The processed data is stored in Amazon S3
- Analyzed using AWS Glue and Amazon Athena.
-
Ensure the stock market dataset is available:
- Make sure you have access to the Kaggle Stock Market Dataset with the following features:
- Dataset is given in /data folder.
IndexDateOpenHighLowCloseAdj CloseVolumeCloseUSD
-
Implement the sleep function:
- To simulate the real-time data flow into Kafka, the producer script should include a sleep function. This will introduce delays between sending each data entry, mimicking real-time data streaming.
-
Execute the producer script:
- Run the script that sends data to the Kafka topic, with the sleep function applied.
-
Execute the consumer script:
- Run the script that reads data from the Kafka topic and stores it in S3.
The project architecture is designed to handle real-time stock market data and process it efficiently using the following components:
- Producer: Simulates stock market data and sends it to a Kafka topic.
- Kafka: Acts as the message broker to handle the stream of data.
- Consumer: Reads data from the Kafka topic and stores it in Amazon S3.
- AWS S3: Stores the processed stock market data.
- AWS Glue: Crawls the data in S3 to create a metadata catalog.
- Amazon Athena: Queries and analyzes the data stored in S3.
-
Launch an EC2 instance and install Kafka:
- Follow the instructions provided by Kafka to install it on your EC2 instance.
-
Start the Kafka server:
- Use the command to start the Kafka server, usually something like
bin/kafka-server-start.sh config/server.properties.
- Use the command to start the Kafka server, usually something like
-
Ensure the stock market dataset is available:
- Make sure you have access to the dataset required for the producer script.
-
Execute the producer script:
- Run the script that sends data to the Kafka topic.
- Execute the consumer script:
- Run the script that reads data from the Kafka topic and stores it in S3.
- Create a Glue crawler:
- Configure the Glue crawler to crawl the S3 bucket and create a metadata catalog.
- Use Athena to query the data:
- Utilize Amazon Athena to run queries on the data stored in S3.
pandaskafka-pythons3fsboto3


