Parallel Computing – Comparative Performance Analysis Using Apache Spark

This project was developed with the goal of applying concepts learned in the Parallel Computing course to a real Big Data scenario, using the U.S. traffic accidents dataset and the Apache Spark framework.

Summary

Project Overview
Objective
Repository Structure
Dataset Used
Technologies
Analyses Performed
Results and Conclusions
Interactive Dashboard
How to Use This Project
Authors
License

📌 Objective

To evaluate the efficiency of distributed parallel processing with Spark by comparing different levels of parallelism, data sizes, and types of analysis. An interactive Power BI dashboard was also developed to visualize results and compare sequential and parallel executions.

🗂️ Repository Structure

File	Description
`mainParallel.py`	Analysis script using Apache Spark (parallel)
`mainSequential.ipynb`	Notebook with sequential analysis
`Parallel_Results.csv`	Results from parallel execution
`Sequential_Results.csv`	Results from sequential execution
`Resultados.xlsx`	Spreadsheet organizing data for cross-analysis
`Dashboard.pbix`	Power BI dashboard with comparative visualizations

📊 Dataset Used

Source: Kaggle – US Accidents (March 2023)
Over 7 million accident records in the U.S. (2016–2023)
Attributes include location, weather, severity, and road conditions

⚙️ Technologies

Apache Spark (PySpark)
Python 3 (Pandas, Matplotlib)
Power BI
Amazon EMR + S3 (for cluster execution)
Jupyter Notebook

📈 Analyses Performed

Accidents by state and time of day
Relationship between weather and severity
Frequency of intersections and traffic signals
Extraction of accident types using regex
Execution time, speedup, and scalability evaluation

🧪 Results and Conclusions

Significant speedup in parallel executions compared to sequential ones
Best performance achieved with 4 to 8 executors, depending on data fraction
Increased overhead with too many threads for small datasets
Visualizations demonstrate the efficiency and scalability of parallel computing

📊 Interactive Dashboard

The Dashboard.pbix (Power BI) file includes:

Visual comparison between sequential and parallel executions
Heatmaps, line charts, bar charts, and radar charts
Analysis by dataset fraction and number of executors

🚀 How to Use This Project

1. Download the Dataset

To run this project locally, first download the dataset from Kaggle:

Go to the dataset page: US Accidents (Kaggle)
Click on Download All
Extract the file (usually named US_Accidents_Dec23_updated.csv) into the same directory as the Python scripts or notebooks in this repository

📌 Make sure the dataset file is in the same folder as mainParallel.py or mainSequential.ipynb

2. Open and Run the Dashboard

You can also explore the results via Power BI:

Download the Dashboard.pbix file from this repository
Open it using Power BI Desktop
You can refresh the visuals or explore filters and interactions as needed

✅ All analysis files and the dashboard are ready to use once the dataset is correctly placed in the working directory 🗣️ The dashboard interface and labels are written in Brazilian Portuguese, as it was created for academic use in Brazil.

👨‍💻 Authors

Breno Machado Barros, Vinícius de Freitas Castro, Giordanna Santos e Souza, Gabriel Ferreira Silva, and Lauane Mateus Oliveira de Moraes.

– Project developed for the Parallel Computing course – UFG

Forked from the original repository by gabrielsfg

📝 License

This project is intended for educational purposes only. The use of the dataset follows the terms of the Kaggle Dataset License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel Computing – Comparative Performance Analysis Using Apache Spark

Summary

📌 Objective

🗂️ Repository Structure

📊 Dataset Used

⚙️ Technologies

📈 Analyses Performed

🧪 Results and Conclusions

📊 Interactive Dashboard

🚀 How to Use This Project

1. Download the Dataset

2. Open and Run the Dashboard

👨‍💻 Authors

📝 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Comp. Paralela.svg		Comp. Paralela.svg
Dashboard.pbix		Dashboard.pbix
Parallel_Results.csv		Parallel_Results.csv
README.md		README.md
Resultados.xlsx		Resultados.xlsx
Sequential_Results.csv		Sequential_Results.csv
mainParallel.py		mainParallel.py
mainSequential.ipynb		mainSequential.ipynb

gabrielsfg/Parallel-Computing

Folders and files

Latest commit

History

Repository files navigation

Parallel Computing – Comparative Performance Analysis Using Apache Spark

Summary

📌 Objective

🗂️ Repository Structure

📊 Dataset Used

⚙️ Technologies

📈 Analyses Performed

🧪 Results and Conclusions

📊 Interactive Dashboard

🚀 How to Use This Project

1. Download the Dataset

2. Open and Run the Dashboard

👨‍💻 Authors

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages