A comprehensive machine learning project that predicts Formula 1 race winners using historical data from 1950 to 2024. The model leverages 52+ engineered features from multiple data sources including race results, qualifying times, pit stops, lap times, sprint races, and championship standings.
This project builds and compares 10 different machine learning models to predict Formula 1 race winners. The system automatically selects the best-performing model based on comprehensive evaluation metrics including accuracy, F1-score, precision, recall, and ROC-AUC.
- Comparision of 10 Machine Learnining Models: Random Forest, XGBoost, Gradient Boosting, Extra Trees, AdaBoost, Decision Tree, Logistic Regression, KNN, Naive Bayes, and SVM
- 52+ Engineered Features: Comprehensive feature engineering from 12 different data sources
- Time-Based Validation: Proper temporal split to prevent data leakage
- Automatic Best Model Selection: Identifies the optimal model based on test performance
- Real-World Application: Can predict winners for future races given driver and race data
The dataset is sourced from Kaggle - Formula 1 World Championship (1950-2024), compiled from Ergast API.
The model integrates 12 CSV files:
- circuits.csv - Circuit information (location, country, coordinates)
- constructor_results.csv - Constructor race results
- constructor_standings.csv - Constructor championship standings
- constructors.csv - Constructor/team details
- driver_standings.csv - Driver championship standings
- drivers.csv - Driver information (nationality, DOB)
- lap_times.csv - Lap-by-lap timing data
- pit_stops.csv - Pit stop strategy and duration
- qualifying.csv - Qualifying session times (Q1, Q2, Q3)
- races.csv - Race schedule and metadata
- results.csv - Final race results and positions
- sprint_results.csv - Sprint race results
- status.csv - Race finish status (completed, DNF, etc.)
- Starting grid position
- Qualifying position (final)
- Q1, Q2, Q3 lap times
- Best qualifying time
- Sprint race position
- Sprint points earned
- Sprint grid position
- Sprint race participation flag
- Average lap time
- Lap time standard deviation (consistency)
- Fastest/slowest lap times
- Total laps completed
- Lap time consistency metric
- Number of pit stops
- Average pit stop duration
- Total time lost in pits
- Fastest/slowest pit stop
- Previous wins, podiums, top 5, top 10 finishes
- Total career points
- Average finishing position
- Recent form (last 5 and 10 races)
- Win rate and podium rate
- DNF (Did Not Finish) rate
- Average starting grid position
- Average qualifying position
- Team's previous wins and podiums
- Team's total points
- Team's average finishing position
- Team's recent form
- Team win rate and podium rate
- Driver's races at specific circuit
- Driver's wins at specific circuit
- Driver's podiums at specific circuit
- Driver's average position at circuit
- Circuit-specific win rate
- Championship points before race
- Championship position before race
- Championship wins before race
- Driver ID (encoded)
- Constructor ID (encoded)
- Circuit ID (encoded)