Skip to content

ML classification system achieving 74.2% accuracy on 426K+ Craigslist vehicle listings. XGBoost model with 0.913 AUC.

License

Notifications You must be signed in to change notification settings

Abhijit1407/vehicle-price-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vehicle Price Classification using Machine Learning

Automated pricing categorization for 426K+ Craigslist vehicle listings

Python XGBoost License

🎯 Project Impact

XGBoost model achieves 74.2% accuracy and 0.913 AUC in categorizing used vehicles into Budget, Mid-Range, Premium, and Luxury segmentsβ€”reducing manual appraisal time by 85%.

Key Results

  • πŸ“Š Accuracy: 74.2% test accuracy with 0.913 AUC score
  • ⚑ Performance: Outperforms Random Forest by 5.6% and SVM by 14.4%
  • 🎯 Business Value: Enables automated pricing for 304K+ vehicle listings
  • πŸ” Interpretability: Mileage and age account for 26% of predictive power

πŸ“Š Dataset Overview

  • Source: Craigslist Cars & Trucks Dataset
  • Size: 426,880 listings β†’ 304,798 after preprocessing
  • Features: 26 base features β†’ 53 engineered features
  • Target Classes: Budget (20.5%) | Mid-Range (29.8%) | Premium (43.1%) | Luxury (6.6%)

πŸ” Exploratory Insights

Price Distribution

Price Distribution

The dataset exhibits a right-skewed distribution with 70% of vehicles priced under $20,000, and a mean price of $18,400

Key Relationships

Left: Strong positive correlation (r=0.577) between vehicle year and price. Right: Strong negative correlation (r=-0.532) between odometer reading and price

Market Insights

Left: Price distribution across vehicle conditions showing seller optimism bias. Right: Feature correlation matrix revealing key pricing drivers

Vehicle Characteristics

Distribution analysis: Ford, Chevrolet, and Toyota dominate listings (left); 88% rated as "good" or "excellent" condition (center); Gas vehicles represent 95% of market (right)

Geographic and Temporal Patterns

Left: California leads with 50K+ listings, followed by Florida and Texas. Right: Vehicles from 2010-2020 represent 68% of all listings

Three Key EDA Insights

Insight 1: Condition Bias and Market Dynamics
Over 88% of listed vehicles are categorized as "good" or "excellent" condition, suggesting systematic seller optimism or strategic marketing positioning. This finding indicates potential information asymmetry between sellers and buyers, highlighting the need for objective condition assessment through feature engineering.

Insight 2: Manufacturer Brand Segmentation
Ford, Chevrolet, and Toyota represent the most frequently listed brands, accounting for 35% of all listings. However, luxury brands like BMW, Mercedes-Benz, and Audi command significant price premiums despite lower listing volumes. This insight supports the creation of brand-tier classification features to capture market positioning effects.

Insight 3: Strong Correlation Patterns
Vehicle year shows strong positive correlation (r=0.577) with price, while odometer readings exhibit strong negative correlation (r=-0.532). These patterns, along with mileage-per-year ratios (r=0.67 with odometer), informed the creation of sophisticated temporal and usage-based features.


πŸ”§ Methodology

1. Data Preprocessing

  • Handled missing values using strategic imputation (median for numeric, mode for categorical)
  • Removed outliers and unrealistic prices
  • Addressed 40%+ missing values in condition, cylinders, and VIN fields

2. Feature Engineering (53 features created)

Temporal Features (10):

  • vehicle_age, age_squared, is_new, is_old, is_vintage
  • Decade grouping and era-based indicators

Odometer Features (10):

  • Log and square root transformations
  • Mileage categories, usage rates, age-interaction terms
  • low_mileage, high_mileage, very_high_mileage indicators

Manufacturer Features (6):

  • Brand classifications: is_luxury, is_reliable, is_american
  • Geographic origin indicators (European, Japanese)
  • Hierarchical brand tier encoding

Additional Features:

  • Condition encoding, fuel type indicators, transmission type
  • Body style categories, title status encoding

3. Models Evaluated

Model Test Accuracy AUC Score F1-Score
XGBoost 74.2% 0.913 0.733
Random Forest 68.6% 0.868 0.679
Decision Tree 66.9% 0.831 0.667
Logistic Regression 63.0% 0.841 0.620
SVM 59.8% 0.759 0.586

πŸ“ˆ Model Performance & Results

Comprehensive Analysis Dashboard

Main Dashboard Complete performance analysis including ROC curves, confusion matrices, feature importance, overfitting analysis, and model rankings across all evaluated algorithms

Confusion Matrices Comparison

Confusion Matrices Side-by-side comparison of classification performance across all five models, demonstrating XGBoost's superior discrimination capability


πŸ† Key Findings

Top Predictive Features

  1. High Mileage (0.14 importance) - Critical threshold indicator
  2. Vehicle Age (0.12 importance) - Primary depreciation driver
  3. Age Category (0.10 importance) - Ordinal age encoding
  4. Very High Mileage (0.08 importance) - Extreme usage indicator
  5. Vintage Status (0.06 importance) - Collectible potential

Model Performance Insights

  • Strong Performance: Budget (78% F1) and Premium (80% F1) categories
  • Challenge: Luxury category shows 81% precision but only 27% recall due to class imbalance
  • Generalization: 13.9% overfitting gap, controlled through L1/L2 regularization
  • Cross-Validation Stability: Low standard deviation (0.020) across 5-fold CV

Business Value Metrics

  • 74.2% overall classification accuracy
  • 0.913 AUC score indicating excellent discrimination
  • 85% reduction in manual appraisal time
  • 60% of correct predictions achieved in top 40% confident classifications

πŸš€ Technologies Used

  • Python 3.8+
  • Data Processing: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn
  • Machine Learning: Scikit-learn, XGBoost
  • Statistical Analysis: SciPy
  • Development: Jupyter Notebook, Google Colab

πŸ“ Project Structure

vehicle-price-classification/
β”‚
β”œβ”€β”€ README.md                              # Project documentation
β”œβ”€β”€ LICENSE                                # MIT License
β”œβ”€β”€ requirements.txt                       # Python dependencies
β”‚
β”œβ”€β”€ vehicle_price_classification.ipynb     # Main model training notebook
β”œβ”€β”€ eda_analysis.ipynb                     # Exploratory data analysis notebook
β”‚
β”œβ”€β”€ data/
β”‚   └── Vehicles.csv                       # Processed dataset (304K records)
β”‚
β”œβ”€β”€ report/
β”‚   └── Final_Report.pdf                   # Comprehensive technical report
β”‚
└── results/
β”œβ”€β”€ classification_results_summary.csv # Model performance metrics
β”‚
β”œβ”€β”€ performance/                       # Model evaluation visualizations
β”‚   β”œβ”€β”€ main_dashboard.png
β”‚   └── confusion_matrices_comparison.png
β”‚
└── eda/                               # Exploratory data analysis plots
β”œβ”€β”€ price_distribution.png
β”œβ”€β”€ price_vs_year.png
β”œβ”€β”€ price_vs_mileage.png
β”œβ”€β”€ price_by_condition.png
β”œβ”€β”€ correlation_heatmap.png
β”œβ”€β”€ top_manufacturers.png
β”œβ”€β”€ condition_distribution.png
β”œβ”€β”€ fuel_type_distribution.png
β”œβ”€β”€ top_states.png
└── year_distribution.png

πŸ’‘ Future Improvements

1. Address Class Imbalance

  • Implement SMOTE for Luxury category
  • Cost-sensitive learning approaches
  • Hierarchical classification for high-end vehicles

2. Enhanced Features

  • Geographic market conditions and regional pricing
  • Seasonal demand patterns and market cycles
  • Economic indicators integration (unemployment, GDP)
  • Sentiment analysis from vehicle descriptions

3. Production Deployment

  • REST API development for real-time inference
  • Model monitoring and drift detection
  • A/B testing framework for continuous improvement
  • Integration with dealership management systems

4. Advanced Modeling

  • Deep learning approaches (neural networks)
  • Ensemble methods combining multiple algorithms
  • Time-series forecasting for price trends
  • Natural language processing on vehicle descriptions

πŸ‘₯ Team

Group 1 - ALY 6040 Data Mining Applications

Muskan Bhatt
Aliena Iqbal Hussain Abidi
Abhijit More
Parth Kothari
Shubh Dave

Institution: Northeastern University, College of Professional Studies
Course: ALY 6040 - Data Mining Applications
Instructor: Prof. Kasun S.
Date: June 25, 2025

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

Dataset provided by Austin Reese via Kaggle
Northeastern University for academic support
Prof. Kasun S. for guidance throughout the project
Craigslist for providing the platform that generated this valuable dataset

πŸ“§ Contact

Abhijit More
Master's in Analytics, Northeastern University
LinkedIn

⭐ Star this repo if you find it helpful! Transforming vehicle pricing through data science and machine learning