This project was developed with the goal of applying concepts learned in the Parallel Computing course to a real Big Data scenario, using the U.S. traffic accidents dataset and the Apache Spark framework.
- Project Overview
- Objective
- Repository Structure
- Dataset Used
- Technologies
- Analyses Performed
- Results and Conclusions
- Interactive Dashboard
- How to Use This Project
- Authors
- License
To evaluate the efficiency of distributed parallel processing with Spark by comparing different levels of parallelism, data sizes, and types of analysis. An interactive Power BI dashboard was also developed to visualize results and compare sequential and parallel executions.
| File | Description |
|---|---|
mainParallel.py |
Analysis script using Apache Spark (parallel) |
mainSequential.ipynb |
Notebook with sequential analysis |
Parallel_Results.csv |
Results from parallel execution |
Sequential_Results.csv |
Results from sequential execution |
Resultados.xlsx |
Spreadsheet organizing data for cross-analysis |
Dashboard.pbix |
Power BI dashboard with comparative visualizations |
- Source: Kaggle – US Accidents (March 2023)
- Over 7 million accident records in the U.S. (2016–2023)
- Attributes include location, weather, severity, and road conditions
- Apache Spark (PySpark)
- Python 3 (Pandas, Matplotlib)
- Power BI
- Amazon EMR + S3 (for cluster execution)
- Jupyter Notebook
- Accidents by state and time of day
- Relationship between weather and severity
- Frequency of intersections and traffic signals
- Extraction of accident types using regex
- Execution time, speedup, and scalability evaluation
- Significant speedup in parallel executions compared to sequential ones
- Best performance achieved with 4 to 8 executors, depending on data fraction
- Increased overhead with too many threads for small datasets
- Visualizations demonstrate the efficiency and scalability of parallel computing
The Dashboard.pbix (Power BI) file includes:
- Visual comparison between sequential and parallel executions
- Heatmaps, line charts, bar charts, and radar charts
- Analysis by dataset fraction and number of executors
To run this project locally, first download the dataset from Kaggle:
- Go to the dataset page: US Accidents (Kaggle)
- Click on Download All
- Extract the file (usually named
US_Accidents_Dec23_updated.csv) into the same directory as the Python scripts or notebooks in this repository
📌 Make sure the dataset file is in the same folder as
mainParallel.pyormainSequential.ipynb
You can also explore the results via Power BI:
- Download the
Dashboard.pbixfile from this repository - Open it using Power BI Desktop
- You can refresh the visuals or explore filters and interactions as needed
✅ All analysis files and the dashboard are ready to use once the dataset is correctly placed in the working directory 🗣️ The dashboard interface and labels are written in Brazilian Portuguese, as it was created for academic use in Brazil.
Breno Machado Barros, Vinícius de Freitas Castro, Giordanna Santos e Souza, Gabriel Ferreira Silva, and Lauane Mateus Oliveira de Moraes.
– Project developed for the Parallel Computing course – UFG
Forked from the original repository by gabrielsfg
This project is intended for educational purposes only. The use of the dataset follows the terms of the Kaggle Dataset License.