🧹 Data Cleaning: Uncleaned Data Science Jobs

📁 Dataset Loading

The first step is to load the Uncleaned_DS_jobs.csv file into a pandas DataFrame. This allows us to begin the cleaning and transformation process.

import pandas as pd

# Load the dataset
df = pd.read_csv("Uncleaned_DS_jobs.csv")

🧽 Data Cleaning Steps

The following steps were performed to clean and standardize the raw data.

1. Salary Estimate Cleaning

The Salary Estimate column contained non-numeric text and symbols (e.g., "$100K-$150K (Glassdoor est.)"), making it unusable for calculations. The following cleaning steps were applied to prepare it for numeric conversion.

Actions:

Removed the "(Glassdoor est.)" substring.
Stripped common currency symbols ($) and abbreviations (K).

# Chain string replacement methods for efficiency
df["Salary Estimate"] = df["Salary Estimate"].str.replace("(Glassdoor est.)", "", regex=False) \
                                             .str.replace("K", "", regex=False) \
                                             .str.replace("$", "", regex=False)

2. Job Description Cleaning

The Job Description column included common boilerplate headers and formatting characters that add noise to the text. These were removed to improve the quality of the data for future natural language processing (NLP) tasks.

Actions:

Iterated through a list of common boilerplate terms (e.g., "Description", "Overview", "\\n") and removed them from each job description.

# List of boilerplate terms to remove
terms_to_remove = ["Description", "Overview", "JOB DESCRIPTION", "Summary", "\\n", "About Us", "JOB SUMMARY"]

# Loop through the list and remove each term
for term in terms_to_remove:
  df["Job Description"] = df["Job Description"].str.replace(term, "", regex=False)

✅ Next Steps & Improvements

To further enhance this dataset, the following steps are recommended:

Convert Salary to a Numeric Format: After cleaning, the salary column is still a string (e.g., "100-150"). It should be converted to one or more numeric columns (e.g., min_salary, max_salary, avg_salary).

# Example for creating min/max salary columns
df[['min_salary', 'max_salary']] = df['Salary Estimate'].str.split('-', expand=True)
df['min_salary'] = pd.to_numeric(df['min_salary'], errors='coerce')
df['max_salary'] = pd.to_numeric(df['max_salary'], errors='coerce')

Handle Missing Values: Now that the data is cleaned, a strategy for handling missing values should be implemented.

# Example: Drop rows where key information is missing
df.dropna(subset=["Salary Estimate", "Job Description"], inplace=True)

Add Richer Documentation: Use Markdown cells within the notebook to document the purpose and outcome of each cleaning step. This improves readability and reproducibility for anyone else viewing the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Uncleaned_DS_jobs.csv		Uncleaned_DS_jobs.csv
clean.ipynb		clean.ipynb
cleaned.csv		cleaned.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧹 Data Cleaning: Uncleaned Data Science Jobs

📁 Dataset Loading

🧽 Data Cleaning Steps

1. Salary Estimate Cleaning

2. Job Description Cleaning

✅ Next Steps & Improvements

About

Uh oh!

Releases

Packages

Languages

NoahMustafa/Data-Cleaning-Using-Python-V2

Folders and files

Latest commit

History

Repository files navigation

🧹 Data Cleaning: Uncleaned Data Science Jobs

📁 Dataset Loading

🧽 Data Cleaning Steps

1. Salary Estimate Cleaning

2. Job Description Cleaning

✅ Next Steps & Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages