Title: IMDB Data Cleaning
Objective: Show ability to ingest, clean, transform, and engineer features using PySpark on a the IMDB dataset.
Skills: CSV data extraction, data cleaning, transformation and feature engineering.
Tools: PySpark, Spark Dataframes
Topics covered:
- Initial data analysis
- Cleaning and transformation
- Feature Engineering
Challenges:
- Handling column with empty values while reading with Spark schema.
- Highly inconsistent 'score' value that needed multiple data formatting steps.
- Multiple country name formats for each country that had to be standarsized.
Conclusion:
- Loaded csv datafile
- Cleaned data
- Engineered derived features
By automating data cleaning, transformation, and feature engineering, produced structured outputs ready for advanced analytics and modeling, showcasing strong data engineering practices suitable for real-world big data environments.