Author: Mohammad Taha — October 2025
This project aims to predict student academic success (pass/fail) using machine learning models trained on the UCI Student Performance dataset. The workflow includes data preprocessing, feature encoding, scaling, model comparison, and evaluation.
The best-performing model was the Random Forest Classifier, achieving 89% accuracy and demonstrating strong generalization. The model highlights key academic factors that influence performance, such as previous grades and study time.
- Analyze academic and social attributes affecting student performance.
- Train multiple classification models and select the best one.
- Interpret model results to identify the most influential features.
- Provide actionable insights for educational decision-making.
- Source: UCI Machine Learning Repository (Dataset ID 320)
- Samples: 649 students
- Target: Final grade (G3) converted into binary classification: Pass (G3 ≥ 10) or Fail (G3 < 10)
Key preprocessing steps:
- Label Encoding: Applied to binary categorical features (e.g.,
sex,address,schoolsup). - One-Hot Encoding: Used for multi-class categorical variables (e.g.,
Mjob,Fjob,reason). - Standard Scaling: Numeric features (
G1,G2,absences,age,studytime) scaled usingStandardScaler. - Stratified Split: Data split into Train (60%), Validation (20%), and Test (20%) sets while maintaining class balance.
The following models were trained and compared:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
- Random Forest Classifier
The Random Forest model was optimized using RandomizedSearchCV with 5-fold cross-validation.
| Parameter | Range |
|---|---|
| n_estimators | 100 → 500 |
| max_depth | None, 10, 20, 30 |
| min_samples_split | 2, 5, 10, 20 |
| min_samples_leaf | 1, 2, 4, 6 |
| max_features | sqrt, log2 |
| Model | Accuracy (val) | F1 Score | ROC AUC |
|---|---|---|---|
| Logistic Regression | 0.82 | 0.80 | 0.84 |
| KNN | 0.79 | 0.77 | 0.81 |
| Decision Tree | 0.83 | 0.82 | 0.85 |
| Random Forest (Best) | 0.89 | 0.87 | 0.91 |
The Random Forest model provided the best performance, handling nonlinear relationships and reducing bias and variance effectively.
Top predictors of student success:
G2(Second period grade)G1(First period grade)studytime(Weekly study hours)failures(Number of past class failures)schoolsup(Extra educational support)
Academic features dominate over social variables, confirming that study effort and prior performance are strong indicators of success.
- Ambiguous boundary cases: Most misclassifications occur for students with grades between 9 and 11. Suggests potential for multi-class classification (Weak / Medium / Strong).
- Underrepresented categories: Some categorical values (e.g.,
Fjob=teacher) have few samples. Possible solution: SMOTE for class balancing. - Outliers: High absences (>70) distort results. Use clipping or winsorization.
- Minor Overfitting: Train accuracy ≈ 0.97 vs validation ≈ 0.89. Control depth and minimum samples per split.
- Extend analysis to other subjects or semesters.
- Explore advanced models: XGBoost, LightGBM, Neural Networks.
- Add longitudinal tracking for performance trends over time.
- Apply SHAP or LIME for local model explainability.
- Build an early-warning dashboard for teachers.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
# Feature Importance Plot
importances = best_rf.feature_importances_
features = X_train_final.columns
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10,6))
sns.barplot(x=importances[indices][:10], y=features[indices][:10], palette='coolwarm')
plt.title('Top 10 Most Important Features')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
# Confusion Matrix
cm = confusion_matrix(y_test, best_rf.predict(X_test_final))
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Test Data')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()flowchart TD
A[Data Loading from UCI Repository] --> B[Data Cleaning]
B --> C[Encoding Categorical Features]
C --> D[Scaling Numerical Features]
D --> E[Train / Validation / Test Split]
E --> F[Model Training]
F --> G[Hyperparameter Tuning (RandomizedSearchCV)]
G --> H[Model Evaluation]
H --> I[Feature Importance Analysis]
I --> J[Educational Interpretation and Insights]
The analysis confirms that academic performance is primarily driven by measurable study factors rather than social conditions. Schools can apply this model to:
- Identify students at risk before final exams.
- Provide tailored academic support programs.
- Allocate resources more efficiently.
- Evaluate and adjust educational strategies based on data.
- Data quality is crucial for reliable ML performance.
- Ensemble models improve stability and interpretability.
- Visual and statistical analysis together enhance insights.
- Combining analytics with education yields meaningful impact.
-
Clone the repository:
git clone https://github.com/tahamohmadf19-dev/PyCharmMiscProject2.git cd PyCharmMiscProject2 -
Install dependencies:
pip install -r requirements.txt
-
Run the machine learning model training:
python train_model.py
-
Run the student management CLI:
python student_manager_cli.py
- UCI Student Performance Dataset (ID 320)
- Scikit-Learn Documentation
- Aurélien Géron, Hands-On Machine Learning
- Romero & Ventura (2020), Educational Data Mining