Skip to content

Cnair02/IMDB-DataProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

IMDB-DataProcessing

Title: IMDB Data Cleaning

Objective: Show ability to ingest, clean, transform, and engineer features using PySpark on a the IMDB dataset.

Skills: CSV data extraction, data cleaning, transformation and feature engineering.

Tools: PySpark, Spark Dataframes

Topics covered:

  1. Initial data analysis
  2. Cleaning and transformation
  3. Feature Engineering

Challenges:

  1. Handling column with empty values while reading with Spark schema.
  2. Highly inconsistent 'score' value that needed multiple data formatting steps.
  3. Multiple country name formats for each country that had to be standarsized.

Conclusion:

  1. Loaded csv datafile
  2. Cleaned data
  3. Engineered derived features

By automating data cleaning, transformation, and feature engineering, produced structured outputs ready for advanced analytics and modeling, showcasing strong data engineering practices suitable for real-world big data environments.

Releases

No releases published

Packages

No packages published