IMDB-DataProcessing

Title: IMDB Data Cleaning

Objective: Show ability to ingest, clean, transform, and engineer features using PySpark on a the IMDB dataset.

Skills: CSV data extraction, data cleaning, transformation and feature engineering.

Tools: PySpark, Spark Dataframes

Topics covered:

Initial data analysis
Cleaning and transformation
Feature Engineering

Challenges:

Handling column with empty values while reading with Spark schema.
Highly inconsistent 'score' value that needed multiple data formatting steps.
Multiple country name formats for each country that had to be standarsized.

Conclusion:

Loaded csv datafile
Cleaned data
Engineered derived features

By automating data cleaning, transformation, and feature engineering, produced structured outputs ready for advanced analytics and modeling, showcasing strong data engineering practices suitable for real-world big data environments.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IMDB-DataProcessing

About

Uh oh!

Releases

Packages

Languages

Cnair02/IMDB-DataProcessing

Folders and files

Latest commit

History

Repository files navigation

IMDB-DataProcessing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages