This project aims to predict an individual's salary based on a dataset of job-related and demographic features. By analyzing factors such as age, gender, education level, job title, and years of experience, the goal is to develop a regression model that can accurately estimate an individual's salary. This is a valuable tool for career planning, salary negotiation, and labor market analysis.
- Dataset: A dataset named
Salary Data.csvis used, which contains salary information and various employee attributes. - Size: 375 entries, 6 columns.
- Key Features:
Age,Gender,Education Level,Job Title,Years of Experience.
- Approach:
- Data Cleaning: The code handles missing values by dropping rows with
NaN. It also removes duplicate entries. Additionally, it dropsAge,Gender, andJob Titlecolumns, which is a significant reduction of features. - Exploratory Data Analysis: Histograms and box plots were used to visualize the distribution of numerical features and their relationship with salary. Count plots were also used for categorical features.
- Label Encoding: Applied to the
Education Levelcolumn to convert it into a numerical format. - Regression Task: The target variable is
Salary. - Models Used:
- A suite of regression models were trained, including Linear Regression, Ridge, XGBoost, Random Forest, AdaBoost, Gradient Boosting, and Bagging.
- Data Cleaning: The code handles missing values by dropping rows with
- Best R² Score:
- 0.901 with Linear Regression and Ridge Regressor.
- 0.892 with Gradient Boosting Regressor.
- The high R² scores indicate that the models are highly effective at predicting salary based on the chosen features.
- Accurate Salary Forecasting: Enables individuals to estimate their potential earnings based on their experience and education.
- Recruitment and Compensation: Assists companies in setting competitive salary ranges for different roles.
- Career Planning: Provides insights into how education and experience levels correlate with income.
- Labor Market Analysis: Supports data-driven research on salary trends and economic factors.
Install the necessary libraries:
pip install pandas numpy seaborn matplotlib scikit-learn xgboostWe welcome contributions to improve the project. You can help by:
- Re-evaluating the feature selection process, as dropping
Age,Gender, andJob Titlemay remove valuable predictive information. - Exploring more robust methods for handling missing values and duplicates.
- Performing comprehensive hyperparameter tuning and cross-validation for all regression models to maximize predictive performance.
- Adding explainability (e.g., SHAP or LIME) to understand which factors are the most significant drivers of salary.