Data Cleaning: Dirty Cafe Sales Data

Project Overview

This project cleans a messy retail transaction dataset from a cafe, dirty_cafe_sales.csv. The primary goal was to process the raw data by handling invalid entries, standardizing formats, and correcting data types to prepare it for accurate analysis and modeling.

Dataset Description

Filename: dirty_cafe_sales.csv
Shape: 10,000 rows × 8 columns

Columns

Transaction ID: A unique identifier for each transaction.
Item: The name of the purchased product (e.g., Coffee, Cake).
Quantity: The number of units purchased per item.
Price Per Unit: The price of a single unit of the item.
Total Spent: The total cost, calculated as Quantity × Price Per Unit.
Payment Method: The mode of payment used.
Location: The location where the purchase was made (In-store, Takeaway, or unknown).
Transaction Date: The date of the transaction.

Data Cleaning Steps

1. Importing and Initial Inspection

The dataset was imported using the pandas library. An initial inspection revealed several data quality issues:

Invalid Entries: Non-numeric strings like "ERROR" and "UNKNOWN" were present in numerical columns.
Missing Values: Standard missing values (NaN) and non-standard placeholders ("UNKNOWN") were found across various columns.

import pandas as pd

# Load the dataset
df = pd.read_csv("dirty_cafe_sales.csv")

# Initial view of the data
print(df.head())
print(df.info())

2. Standardizing Invalid & Missing Values

To clean the data, invalid string placeholders in numerical columns were replaced with pandas' standard null marker, pd.NA. This approach is efficient and avoids SettingWithCopyWarning by operating on the DataFrame directly.

# Define values to be treated as null
invalid_values = {"ERROR": pd.NA, "UNKNOWN": pd.NA}

# List of columns to clean
columns_to_clean = ["Total Spent", "Quantity", "Price Per Unit"]

# Replace invalid values in specified columns
for col in columns_to_clean:
    df[col] = df[col].replace(invalid_values)

# Also clean the 'Location' column
df['Location'] = df['Location'].replace("UNKNOWN", pd.NA)

3. Converting Data Types

After removing non-numeric strings, the columns were converted to their appropriate data types to enable mathematical operations.

# Convert columns to numeric types, coercing errors to NaT/NaN
df["Total Spent"] = pd.to_numeric(df["Total Spent"])
df["Quantity"] = pd.to_numeric(df["Quantity"])
df["Price Per Unit"] = pd.to_numeric(df["Price Per Unit"])

# Convert transaction date to datetime objects
df["Transaction Date"] = pd.to_datetime(df["Transaction Date"], errors='coerce')

Summary of Cleaning Actions

The following table summarizes the key actions taken to clean the dataset.

Action	Status	Rationale
Loaded Data	✅	Imported the CSV file into a pandas DataFrame.
Standardized Nulls	✅	Replaced `"ERROR"` & `"UNKNOWN"` with `pd.NA` for consistency.
Corrected Data Types	✅	Converted columns to `numeric` and `datetime` types.
Handled Missing Data	⚠️	Nulls are now standardized but not yet imputed or removed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
cleaned.csv		cleaned.csv
data_clean.ipynb		data_clean.ipynb
dirty_cafe_sales.csv		dirty_cafe_sales.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Cleaning: Dirty Cafe Sales Data

Project Overview

Dataset Description

Columns

Data Cleaning Steps

1. Importing and Initial Inspection

2. Standardizing Invalid & Missing Values

3. Converting Data Types

Summary of Cleaning Actions

About

Uh oh!

Releases

Packages

Languages

NoahMustafa/Data-Cleaning-Using-Python-V1

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning: Dirty Cafe Sales Data

Project Overview

Dataset Description

Columns

Data Cleaning Steps

1. Importing and Initial Inspection

2. Standardizing Invalid & Missing Values

3. Converting Data Types

Summary of Cleaning Actions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages