This project performs customer segmentation based on online transaction data using clustering techniques, dimensionality reduction, and visualization.
The dataset used is int_online_tx.csv, which contains transaction records including:
CustomerIDInvoiceNoStockCodeQuantityUnitPrice
- Clean and preprocess the transaction data.
- Engineer features representing customer behavior.
- Standardize the data for modeling.
- Apply PCA to reduce dimensionality.
- Perform clustering using KMeans.
- Visualize the customer segments and key statistics.
- Data Cleaning: Remove missing
CustomerIDvalues and create aSalescolumn. - Feature Engineering: Aggregate features like total transactions, products bought, total sales, and quantities.
- Standardization: Normalize features using
StandardScaler. - Dimensionality Reduction: Apply PCA to reduce to 2 principal components.
- Clustering: Use KMeans with 4 clusters.
- Visualization:
- PCA Scatterplot of clusters.
- Boxplot of total sales by cluster.
- Count plot of customers in each cluster.
- Correlation heatmap of engineered features.
- Python 3.x
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
Install requirements using:
pip install -r requirements.txt