This project focuses on cleaning and analyzing a dataset of employee records using Python, Pandas, NumPy, and Matplotlib.
- Contains employee details such as:
- Name
- Age
- City
- Salary
- Join Date
- Employee ID
-
Missing Data Handling
- Filled missing values in
Age,Salary,City, andName.
- Filled missing values in
-
Standardization
- Standardized city names (
NYβNew York, etc.). - Corrected name formatting and removed special characters.
- Converted
Join Dateto datetime.
- Standardized city names (
-
Outliers and Validation
- Removed unrealistic age and salary values.
- Identified and flagged invalid emails.
-
Duplicates
- Removed duplicate entries based on
NameorEmployee ID.
- Removed duplicate entries based on
-
Feature Engineering
- Extracted
Join_YearfromJoin Date. - Created valid email formats using employee names.
- Added a
Data_Quality_Flagcolumn for rows with issues.
- Extracted
- Bar chart showing average age and salary.
- Join trends per year with color-coded bars.
- Python
- Pandas
- NumPy
- Matplotlib
Alok Bhateshwar
GitHub: @alokbhateshwar
This project is open-source and available under the MIT License.