Automated pricing categorization for 426K+ Craigslist vehicle listings
XGBoost model achieves 74.2% accuracy and 0.913 AUC in categorizing used vehicles into Budget, Mid-Range, Premium, and Luxury segmentsβreducing manual appraisal time by 85%.
- π Accuracy: 74.2% test accuracy with 0.913 AUC score
- β‘ Performance: Outperforms Random Forest by 5.6% and SVM by 14.4%
- π― Business Value: Enables automated pricing for 304K+ vehicle listings
- π Interpretability: Mileage and age account for 26% of predictive power
- Source: Craigslist Cars & Trucks Dataset
- Size: 426,880 listings β 304,798 after preprocessing
- Features: 26 base features β 53 engineered features
- Target Classes: Budget (20.5%) | Mid-Range (29.8%) | Premium (43.1%) | Luxury (6.6%)
The dataset exhibits a right-skewed distribution with 70% of vehicles priced under $20,000, and a mean price of $18,400
Left: Strong positive correlation (r=0.577) between vehicle year and price. Right: Strong negative correlation (r=-0.532) between odometer reading and price
Left: Price distribution across vehicle conditions showing seller optimism bias. Right: Feature correlation matrix revealing key pricing drivers
Distribution analysis: Ford, Chevrolet, and Toyota dominate listings (left); 88% rated as "good" or "excellent" condition (center); Gas vehicles represent 95% of market (right)
Left: California leads with 50K+ listings, followed by Florida and Texas. Right: Vehicles from 2010-2020 represent 68% of all listings
Insight 1: Condition Bias and Market Dynamics
Over 88% of listed vehicles are categorized as "good" or "excellent" condition, suggesting systematic seller optimism or strategic marketing positioning. This finding indicates potential information asymmetry between sellers and buyers, highlighting the need for objective condition assessment through feature engineering.
Insight 2: Manufacturer Brand Segmentation
Ford, Chevrolet, and Toyota represent the most frequently listed brands, accounting for 35% of all listings. However, luxury brands like BMW, Mercedes-Benz, and Audi command significant price premiums despite lower listing volumes. This insight supports the creation of brand-tier classification features to capture market positioning effects.
Insight 3: Strong Correlation Patterns
Vehicle year shows strong positive correlation (r=0.577) with price, while odometer readings exhibit strong negative correlation (r=-0.532). These patterns, along with mileage-per-year ratios (r=0.67 with odometer), informed the creation of sophisticated temporal and usage-based features.
- Handled missing values using strategic imputation (median for numeric, mode for categorical)
- Removed outliers and unrealistic prices
- Addressed 40%+ missing values in condition, cylinders, and VIN fields
Temporal Features (10):
vehicle_age,age_squared,is_new,is_old,is_vintage- Decade grouping and era-based indicators
Odometer Features (10):
- Log and square root transformations
- Mileage categories, usage rates, age-interaction terms
low_mileage,high_mileage,very_high_mileageindicators
Manufacturer Features (6):
- Brand classifications:
is_luxury,is_reliable,is_american - Geographic origin indicators (European, Japanese)
- Hierarchical brand tier encoding
Additional Features:
- Condition encoding, fuel type indicators, transmission type
- Body style categories, title status encoding
| Model | Test Accuracy | AUC Score | F1-Score |
|---|---|---|---|
| XGBoost | 74.2% | 0.913 | 0.733 |
| Random Forest | 68.6% | 0.868 | 0.679 |
| Decision Tree | 66.9% | 0.831 | 0.667 |
| Logistic Regression | 63.0% | 0.841 | 0.620 |
| SVM | 59.8% | 0.759 | 0.586 |
Complete performance analysis including ROC curves, confusion matrices, feature importance, overfitting analysis, and model rankings across all evaluated algorithms
Side-by-side comparison of classification performance across all five models, demonstrating XGBoost's superior discrimination capability
- High Mileage (0.14 importance) - Critical threshold indicator
- Vehicle Age (0.12 importance) - Primary depreciation driver
- Age Category (0.10 importance) - Ordinal age encoding
- Very High Mileage (0.08 importance) - Extreme usage indicator
- Vintage Status (0.06 importance) - Collectible potential
- Strong Performance: Budget (78% F1) and Premium (80% F1) categories
- Challenge: Luxury category shows 81% precision but only 27% recall due to class imbalance
- Generalization: 13.9% overfitting gap, controlled through L1/L2 regularization
- Cross-Validation Stability: Low standard deviation (0.020) across 5-fold CV
- 74.2% overall classification accuracy
- 0.913 AUC score indicating excellent discrimination
- 85% reduction in manual appraisal time
- 60% of correct predictions achieved in top 40% confident classifications
- Python 3.8+
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, XGBoost
- Statistical Analysis: SciPy
- Development: Jupyter Notebook, Google Colab
vehicle-price-classification/
β
βββ README.md # Project documentation
βββ LICENSE # MIT License
βββ requirements.txt # Python dependencies
β
βββ vehicle_price_classification.ipynb # Main model training notebook
βββ eda_analysis.ipynb # Exploratory data analysis notebook
β
βββ data/
β βββ Vehicles.csv # Processed dataset (304K records)
β
βββ report/
β βββ Final_Report.pdf # Comprehensive technical report
β
βββ results/
βββ classification_results_summary.csv # Model performance metrics
β
βββ performance/ # Model evaluation visualizations
β βββ main_dashboard.png
β βββ confusion_matrices_comparison.png
β
βββ eda/ # Exploratory data analysis plots
βββ price_distribution.png
βββ price_vs_year.png
βββ price_vs_mileage.png
βββ price_by_condition.png
βββ correlation_heatmap.png
βββ top_manufacturers.png
βββ condition_distribution.png
βββ fuel_type_distribution.png
βββ top_states.png
βββ year_distribution.png
1. Address Class Imbalance
- Implement SMOTE for Luxury category
- Cost-sensitive learning approaches
- Hierarchical classification for high-end vehicles
2. Enhanced Features
- Geographic market conditions and regional pricing
- Seasonal demand patterns and market cycles
- Economic indicators integration (unemployment, GDP)
- Sentiment analysis from vehicle descriptions
3. Production Deployment
- REST API development for real-time inference
- Model monitoring and drift detection
- A/B testing framework for continuous improvement
- Integration with dealership management systems
4. Advanced Modeling
- Deep learning approaches (neural networks)
- Ensemble methods combining multiple algorithms
- Time-series forecasting for price trends
- Natural language processing on vehicle descriptions
Muskan Bhatt
Aliena Iqbal Hussain Abidi
Abhijit More
Parth Kothari
Shubh Dave
Institution: Northeastern University, College of Professional Studies
Course: ALY 6040 - Data Mining Applications
Instructor: Prof. Kasun S.
Date: June 25, 2025
This project is licensed under the MIT License - see the LICENSE file for details.
Dataset provided by Austin Reese via Kaggle
Northeastern University for academic support
Prof. Kasun S. for guidance throughout the project
Craigslist for providing the platform that generated this valuable dataset
Abhijit More
Master's in Analytics, Northeastern University
LinkedIn









