Vehicle Price Classification using Machine Learning

Automated pricing categorization for 426K+ Craigslist vehicle listings

🎯 Project Impact

XGBoost model achieves 74.2% accuracy and 0.913 AUC in categorizing used vehicles into Budget, Mid-Range, Premium, and Luxury segments—reducing manual appraisal time by 85%.

Key Results

📊 Accuracy: 74.2% test accuracy with 0.913 AUC score
⚡ Performance: Outperforms Random Forest by 5.6% and SVM by 14.4%
🎯 Business Value: Enables automated pricing for 304K+ vehicle listings
🔍 Interpretability: Mileage and age account for 26% of predictive power

📊 Dataset Overview

Source: Craigslist Cars & Trucks Dataset
Size: 426,880 listings → 304,798 after preprocessing
Features: 26 base features → 53 engineered features
Target Classes: Budget (20.5%) | Mid-Range (29.8%) | Premium (43.1%) | Luxury (6.6%)

🔍 Exploratory Insights

Price Distribution

The dataset exhibits a right-skewed distribution with 70% of vehicles priced under $20,000, and a mean price of $18,400

Key Relationships

Left: Strong positive correlation (r=0.577) between vehicle year and price. Right: Strong negative correlation (r=-0.532) between odometer reading and price

Market Insights

Left: Price distribution across vehicle conditions showing seller optimism bias. Right: Feature correlation matrix revealing key pricing drivers

Vehicle Characteristics

Distribution analysis: Ford, Chevrolet, and Toyota dominate listings (left); 88% rated as "good" or "excellent" condition (center); Gas vehicles represent 95% of market (right)

Geographic and Temporal Patterns

Left: California leads with 50K+ listings, followed by Florida and Texas. Right: Vehicles from 2010-2020 represent 68% of all listings

Three Key EDA Insights

Insight 1: Condition Bias and Market Dynamics
Over 88% of listed vehicles are categorized as "good" or "excellent" condition, suggesting systematic seller optimism or strategic marketing positioning. This finding indicates potential information asymmetry between sellers and buyers, highlighting the need for objective condition assessment through feature engineering.

Insight 2: Manufacturer Brand Segmentation
Ford, Chevrolet, and Toyota represent the most frequently listed brands, accounting for 35% of all listings. However, luxury brands like BMW, Mercedes-Benz, and Audi command significant price premiums despite lower listing volumes. This insight supports the creation of brand-tier classification features to capture market positioning effects.

Insight 3: Strong Correlation Patterns
Vehicle year shows strong positive correlation (r=0.577) with price, while odometer readings exhibit strong negative correlation (r=-0.532). These patterns, along with mileage-per-year ratios (r=0.67 with odometer), informed the creation of sophisticated temporal and usage-based features.

🔧 Methodology

1. Data Preprocessing

Handled missing values using strategic imputation (median for numeric, mode for categorical)
Removed outliers and unrealistic prices
Addressed 40%+ missing values in condition, cylinders, and VIN fields

2. Feature Engineering (53 features created)

Temporal Features (10):

vehicle_age, age_squared, is_new, is_old, is_vintage
Decade grouping and era-based indicators

Odometer Features (10):

Log and square root transformations
Mileage categories, usage rates, age-interaction terms
low_mileage, high_mileage, very_high_mileage indicators

Manufacturer Features (6):

Brand classifications: is_luxury, is_reliable, is_american
Geographic origin indicators (European, Japanese)
Hierarchical brand tier encoding

Additional Features:

Condition encoding, fuel type indicators, transmission type
Body style categories, title status encoding

3. Models Evaluated

Model	Test Accuracy	AUC Score	F1-Score
XGBoost	74.2%	0.913	0.733
Random Forest	68.6%	0.868	0.679
Decision Tree	66.9%	0.831	0.667
Logistic Regression	63.0%	0.841	0.620
SVM	59.8%	0.759	0.586

📈 Model Performance & Results

Comprehensive Analysis Dashboard

Complete performance analysis including ROC curves, confusion matrices, feature importance, overfitting analysis, and model rankings across all evaluated algorithms

Confusion Matrices Comparison

Side-by-side comparison of classification performance across all five models, demonstrating XGBoost's superior discrimination capability

🏆 Key Findings

Top Predictive Features

High Mileage (0.14 importance) - Critical threshold indicator
Vehicle Age (0.12 importance) - Primary depreciation driver
Age Category (0.10 importance) - Ordinal age encoding
Very High Mileage (0.08 importance) - Extreme usage indicator
Vintage Status (0.06 importance) - Collectible potential

Model Performance Insights

Strong Performance: Budget (78% F1) and Premium (80% F1) categories
Challenge: Luxury category shows 81% precision but only 27% recall due to class imbalance
Generalization: 13.9% overfitting gap, controlled through L1/L2 regularization
Cross-Validation Stability: Low standard deviation (0.020) across 5-fold CV

Business Value Metrics

74.2% overall classification accuracy
0.913 AUC score indicating excellent discrimination
85% reduction in manual appraisal time
60% of correct predictions achieved in top 40% confident classifications

🚀 Technologies Used

Python 3.8+
Data Processing: Pandas, NumPy
Visualization: Matplotlib, Seaborn
Machine Learning: Scikit-learn, XGBoost
Statistical Analysis: SciPy
Development: Jupyter Notebook, Google Colab

📁 Project Structure

vehicle-price-classification/
│
├── README.md                              # Project documentation
├── LICENSE                                # MIT License
├── requirements.txt                       # Python dependencies
│
├── vehicle_price_classification.ipynb     # Main model training notebook
├── eda_analysis.ipynb                     # Exploratory data analysis notebook
│
├── data/
│   └── Vehicles.csv                       # Processed dataset (304K records)
│
├── report/
│   └── Final_Report.pdf                   # Comprehensive technical report
│
└── results/
├── classification_results_summary.csv # Model performance metrics
│
├── performance/                       # Model evaluation visualizations
│   ├── main_dashboard.png
│   └── confusion_matrices_comparison.png
│
└── eda/                               # Exploratory data analysis plots
├── price_distribution.png
├── price_vs_year.png
├── price_vs_mileage.png
├── price_by_condition.png
├── correlation_heatmap.png
├── top_manufacturers.png
├── condition_distribution.png
├── fuel_type_distribution.png
├── top_states.png
└── year_distribution.png

💡 Future Improvements

1. Address Class Imbalance

Implement SMOTE for Luxury category
Cost-sensitive learning approaches
Hierarchical classification for high-end vehicles

2. Enhanced Features

Geographic market conditions and regional pricing
Seasonal demand patterns and market cycles
Economic indicators integration (unemployment, GDP)
Sentiment analysis from vehicle descriptions

3. Production Deployment

REST API development for real-time inference
Model monitoring and drift detection
A/B testing framework for continuous improvement
Integration with dealership management systems

4. Advanced Modeling

Deep learning approaches (neural networks)
Ensemble methods combining multiple algorithms
Time-series forecasting for price trends
Natural language processing on vehicle descriptions

👥 Team

Group 1 - ALY 6040 Data Mining Applications

Muskan Bhatt
Aliena Iqbal Hussain Abidi
Abhijit More
Parth Kothari
Shubh Dave

Institution: Northeastern University, College of Professional Studies
Course: ALY 6040 - Data Mining Applications
Instructor: Prof. Kasun S.
Date: June 25, 2025

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Dataset provided by Austin Reese via Kaggle
Northeastern University for academic support
Prof. Kasun S. for guidance throughout the project
Craigslist for providing the platform that generated this valuable dataset

📧 Contact

Abhijit More
Master's in Analytics, Northeastern University
LinkedIn

⭐ Star this repo if you find it helpful! Transforming vehicle pricing through data science and machine learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vehicle Price Classification using Machine Learning

🎯 Project Impact

Key Results

📊 Dataset Overview

🔍 Exploratory Insights

Price Distribution

Key Relationships

Market Insights

Vehicle Characteristics

Geographic and Temporal Patterns

Three Key EDA Insights

🔧 Methodology

1. Data Preprocessing

2. Feature Engineering (53 features created)

3. Models Evaluated

📈 Model Performance & Results

Comprehensive Analysis Dashboard

Confusion Matrices Comparison

🏆 Key Findings

Top Predictive Features

Model Performance Insights

Business Value Metrics

🚀 Technologies Used

📁 Project Structure

💡 Future Improvements

👥 Team

Group 1 - ALY 6040 Data Mining Applications

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
report		report
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eda_analysis.ipynb		eda_analysis.ipynb
requirements.txt		requirements.txt
vehicle_price_classification.ipynb		vehicle_price_classification.ipynb

License

Abhijit1407/vehicle-price-classification

Folders and files

Latest commit

History

Repository files navigation

Vehicle Price Classification using Machine Learning

🎯 Project Impact

Key Results

📊 Dataset Overview

🔍 Exploratory Insights

Price Distribution

Key Relationships

Market Insights

Vehicle Characteristics

Geographic and Temporal Patterns

Three Key EDA Insights

🔧 Methodology

1. Data Preprocessing

2. Feature Engineering (53 features created)

3. Models Evaluated

📈 Model Performance & Results

Comprehensive Analysis Dashboard

Confusion Matrices Comparison

🏆 Key Findings

Top Predictive Features

Model Performance Insights

Business Value Metrics

🚀 Technologies Used

📁 Project Structure

💡 Future Improvements

👥 Team

Group 1 - ALY 6040 Data Mining Applications

📄 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages