Skip to content

This project focuses on cleaning and preparing a dataset of data science job postings for further analysis and machine learning tasks. The raw dataset (Uncleaned_DS_jobs.csv) contains messy and unstructured information such as salary estimates with symbols/text and job descriptions with boilerplate content.

Notifications You must be signed in to change notification settings

NoahMustafa/Data-Cleaning-Using-Python-V2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🧹 Data Cleaning: Uncleaned Data Science Jobs

📁 Dataset Loading

The first step is to load the Uncleaned_DS_jobs.csv file into a pandas DataFrame. This allows us to begin the cleaning and transformation process.

import pandas as pd

# Load the dataset
df = pd.read_csv("Uncleaned_DS_jobs.csv")

🧽 Data Cleaning Steps

The following steps were performed to clean and standardize the raw data.

1. Salary Estimate Cleaning

The Salary Estimate column contained non-numeric text and symbols (e.g., "$100K-$150K (Glassdoor est.)"), making it unusable for calculations. The following cleaning steps were applied to prepare it for numeric conversion.

Actions:

  • Removed the "(Glassdoor est.)" substring.
  • Stripped common currency symbols ($) and abbreviations (K).
# Chain string replacement methods for efficiency
df["Salary Estimate"] = df["Salary Estimate"].str.replace("(Glassdoor est.)", "", regex=False) \
                                             .str.replace("K", "", regex=False) \
                                             .str.replace("$", "", regex=False)

2. Job Description Cleaning

The Job Description column included common boilerplate headers and formatting characters that add noise to the text. These were removed to improve the quality of the data for future natural language processing (NLP) tasks.

Actions:

  • Iterated through a list of common boilerplate terms (e.g., "Description", "Overview", "\\n") and removed them from each job description.
# List of boilerplate terms to remove
terms_to_remove = ["Description", "Overview", "JOB DESCRIPTION", "Summary", "\\n", "About Us", "JOB SUMMARY"]

# Loop through the list and remove each term
for term in terms_to_remove:
  df["Job Description"] = df["Job Description"].str.replace(term, "", regex=False)

✅ Next Steps & Improvements

To further enhance this dataset, the following steps are recommended:

  1. Convert Salary to a Numeric Format: After cleaning, the salary column is still a string (e.g., "100-150"). It should be converted to one or more numeric columns (e.g., min_salary, max_salary, avg_salary).

    # Example for creating min/max salary columns
    df[['min_salary', 'max_salary']] = df['Salary Estimate'].str.split('-', expand=True)
    df['min_salary'] = pd.to_numeric(df['min_salary'], errors='coerce')
    df['max_salary'] = pd.to_numeric(df['max_salary'], errors='coerce')
  2. Handle Missing Values: Now that the data is cleaned, a strategy for handling missing values should be implemented.

    # Example: Drop rows where key information is missing
    df.dropna(subset=["Salary Estimate", "Job Description"], inplace=True)
  3. Add Richer Documentation: Use Markdown cells within the notebook to document the purpose and outcome of each cleaning step. This improves readability and reproducibility for anyone else viewing the analysis.

About

This project focuses on cleaning and preparing a dataset of data science job postings for further analysis and machine learning tasks. The raw dataset (Uncleaned_DS_jobs.csv) contains messy and unstructured information such as salary estimates with symbols/text and job descriptions with boilerplate content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published