Skip to content

This project cleans a messy retail transaction dataset from a cafe, `dirty_cafe_sales.csv`. The primary goal was to process the raw data by handling invalid entries, standardizing formats, and correcting data types to prepare it for accurate analysis and modeling.

Notifications You must be signed in to change notification settings

NoahMustafa/Data-Cleaning-Using-Python-V1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Cleaning: Dirty Cafe Sales Data

Project Overview

This project cleans a messy retail transaction dataset from a cafe, dirty_cafe_sales.csv. The primary goal was to process the raw data by handling invalid entries, standardizing formats, and correcting data types to prepare it for accurate analysis and modeling.


Dataset Description

  • Filename: dirty_cafe_sales.csv
  • Shape: 10,000 rows × 8 columns

Columns

  • Transaction ID: A unique identifier for each transaction.
  • Item: The name of the purchased product (e.g., Coffee, Cake).
  • Quantity: The number of units purchased per item.
  • Price Per Unit: The price of a single unit of the item.
  • Total Spent: The total cost, calculated as Quantity × Price Per Unit.
  • Payment Method: The mode of payment used.
  • Location: The location where the purchase was made (In-store, Takeaway, or unknown).
  • Transaction Date: The date of the transaction.

Data Cleaning Steps

1. Importing and Initial Inspection

The dataset was imported using the pandas library. An initial inspection revealed several data quality issues:

  • Invalid Entries: Non-numeric strings like "ERROR" and "UNKNOWN" were present in numerical columns.
  • Missing Values: Standard missing values (NaN) and non-standard placeholders ("UNKNOWN") were found across various columns.
import pandas as pd

# Load the dataset
df = pd.read_csv("dirty_cafe_sales.csv")

# Initial view of the data
print(df.head())
print(df.info())

2. Standardizing Invalid & Missing Values

To clean the data, invalid string placeholders in numerical columns were replaced with pandas' standard null marker, pd.NA. This approach is efficient and avoids SettingWithCopyWarning by operating on the DataFrame directly.

# Define values to be treated as null
invalid_values = {"ERROR": pd.NA, "UNKNOWN": pd.NA}

# List of columns to clean
columns_to_clean = ["Total Spent", "Quantity", "Price Per Unit"]

# Replace invalid values in specified columns
for col in columns_to_clean:
    df[col] = df[col].replace(invalid_values)

# Also clean the 'Location' column
df['Location'] = df['Location'].replace("UNKNOWN", pd.NA)

3. Converting Data Types

After removing non-numeric strings, the columns were converted to their appropriate data types to enable mathematical operations.

# Convert columns to numeric types, coercing errors to NaT/NaN
df["Total Spent"] = pd.to_numeric(df["Total Spent"])
df["Quantity"] = pd.to_numeric(df["Quantity"])
df["Price Per Unit"] = pd.to_numeric(df["Price Per Unit"])

# Convert transaction date to datetime objects
df["Transaction Date"] = pd.to_datetime(df["Transaction Date"], errors='coerce')

Summary of Cleaning Actions

The following table summarizes the key actions taken to clean the dataset.

Action Status Rationale
Loaded Data Imported the CSV file into a pandas DataFrame.
Standardized Nulls Replaced "ERROR" & "UNKNOWN" with pd.NA for consistency.
Corrected Data Types Converted columns to numeric and datetime types.
Handled Missing Data ⚠️ Nulls are now standardized but not yet imputed or removed.

About

This project cleans a messy retail transaction dataset from a cafe, `dirty_cafe_sales.csv`. The primary goal was to process the raw data by handling invalid entries, standardizing formats, and correcting data types to prepare it for accurate analysis and modeling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published