The first step is to load the Uncleaned_DS_jobs.csv file into a pandas DataFrame. This allows us to begin the cleaning and transformation process.
import pandas as pd
# Load the dataset
df = pd.read_csv("Uncleaned_DS_jobs.csv")The following steps were performed to clean and standardize the raw data.
The Salary Estimate column contained non-numeric text and symbols (e.g., "$100K-$150K (Glassdoor est.)"), making it unusable for calculations. The following cleaning steps were applied to prepare it for numeric conversion.
Actions:
- Removed the
"(Glassdoor est.)"substring. - Stripped common currency symbols (
$) and abbreviations (K).
# Chain string replacement methods for efficiency
df["Salary Estimate"] = df["Salary Estimate"].str.replace("(Glassdoor est.)", "", regex=False) \
.str.replace("K", "", regex=False) \
.str.replace("$", "", regex=False)The Job Description column included common boilerplate headers and formatting characters that add noise to the text. These were removed to improve the quality of the data for future natural language processing (NLP) tasks.
Actions:
- Iterated through a list of common boilerplate terms (e.g.,
"Description","Overview","\\n") and removed them from each job description.
# List of boilerplate terms to remove
terms_to_remove = ["Description", "Overview", "JOB DESCRIPTION", "Summary", "\\n", "About Us", "JOB SUMMARY"]
# Loop through the list and remove each term
for term in terms_to_remove:
df["Job Description"] = df["Job Description"].str.replace(term, "", regex=False)To further enhance this dataset, the following steps are recommended:
-
Convert Salary to a Numeric Format: After cleaning, the salary column is still a string (e.g.,
"100-150"). It should be converted to one or more numeric columns (e.g.,min_salary,max_salary,avg_salary).# Example for creating min/max salary columns df[['min_salary', 'max_salary']] = df['Salary Estimate'].str.split('-', expand=True) df['min_salary'] = pd.to_numeric(df['min_salary'], errors='coerce') df['max_salary'] = pd.to_numeric(df['max_salary'], errors='coerce')
-
Handle Missing Values: Now that the data is cleaned, a strategy for handling missing values should be implemented.
# Example: Drop rows where key information is missing df.dropna(subset=["Salary Estimate", "Job Description"], inplace=True)
-
Add Richer Documentation: Use Markdown cells within the notebook to document the purpose and outcome of each cleaning step. This improves readability and reproducibility for anyone else viewing the analysis.