This project cleans a messy retail transaction dataset from a cafe, dirty_cafe_sales.csv. The primary goal was to process the raw data by handling invalid entries, standardizing formats, and correcting data types to prepare it for accurate analysis and modeling.
- Filename:
dirty_cafe_sales.csv - Shape: 10,000 rows × 8 columns
- Transaction ID: A unique identifier for each transaction.
- Item: The name of the purchased product (e.g., Coffee, Cake).
- Quantity: The number of units purchased per item.
- Price Per Unit: The price of a single unit of the item.
- Total Spent: The total cost, calculated as
Quantity×Price Per Unit. - Payment Method: The mode of payment used.
- Location: The location where the purchase was made (In-store, Takeaway, or unknown).
- Transaction Date: The date of the transaction.
The dataset was imported using the pandas library. An initial inspection revealed several data quality issues:
- Invalid Entries: Non-numeric strings like
"ERROR"and"UNKNOWN"were present in numerical columns. - Missing Values: Standard missing values (
NaN) and non-standard placeholders ("UNKNOWN") were found across various columns.
import pandas as pd
# Load the dataset
df = pd.read_csv("dirty_cafe_sales.csv")
# Initial view of the data
print(df.head())
print(df.info())To clean the data, invalid string placeholders in numerical columns were replaced with pandas' standard null marker, pd.NA. This approach is efficient and avoids SettingWithCopyWarning by operating on the DataFrame directly.
# Define values to be treated as null
invalid_values = {"ERROR": pd.NA, "UNKNOWN": pd.NA}
# List of columns to clean
columns_to_clean = ["Total Spent", "Quantity", "Price Per Unit"]
# Replace invalid values in specified columns
for col in columns_to_clean:
df[col] = df[col].replace(invalid_values)
# Also clean the 'Location' column
df['Location'] = df['Location'].replace("UNKNOWN", pd.NA)After removing non-numeric strings, the columns were converted to their appropriate data types to enable mathematical operations.
# Convert columns to numeric types, coercing errors to NaT/NaN
df["Total Spent"] = pd.to_numeric(df["Total Spent"])
df["Quantity"] = pd.to_numeric(df["Quantity"])
df["Price Per Unit"] = pd.to_numeric(df["Price Per Unit"])
# Convert transaction date to datetime objects
df["Transaction Date"] = pd.to_datetime(df["Transaction Date"], errors='coerce')The following table summarizes the key actions taken to clean the dataset.
| Action | Status | Rationale |
|---|---|---|
| Loaded Data | ✅ | Imported the CSV file into a pandas DataFrame. |
| Standardized Nulls | ✅ | Replaced "ERROR" & "UNKNOWN" with pd.NA for consistency. |
| Corrected Data Types | ✅ | Converted columns to numeric and datetime types. |
| Handled Missing Data | Nulls are now standardized but not yet imputed or removed. |