Student Performance Prediction

Author: Mohammad Taha — October 2025

Project Overview

This project aims to predict student academic success (pass/fail) using machine learning models trained on the UCI Student Performance dataset. The workflow includes data preprocessing, feature encoding, scaling, model comparison, and evaluation.

The best-performing model was the Random Forest Classifier, achieving 89% accuracy and demonstrating strong generalization. The model highlights key academic factors that influence performance, such as previous grades and study time.

Objectives

Analyze academic and social attributes affecting student performance.
Train multiple classification models and select the best one.
Interpret model results to identify the most influential features.
Provide actionable insights for educational decision-making.

Dataset

Source: UCI Machine Learning Repository (Dataset ID 320)
Samples: 649 students
Target: Final grade (G3) converted into binary classification: Pass (G3 ≥ 10) or Fail (G3 < 10)

Data Preprocessing

Key preprocessing steps:

Label Encoding: Applied to binary categorical features (e.g., sex, address, schoolsup).
One-Hot Encoding: Used for multi-class categorical variables (e.g., Mjob, Fjob, reason).
Standard Scaling: Numeric features (G1, G2, absences, age, studytime) scaled using StandardScaler.
Stratified Split: Data split into Train (60%), Validation (20%), and Test (20%) sets while maintaining class balance.

Model Training and Tuning

The following models were trained and compared:

Logistic Regression
K-Nearest Neighbors (KNN)
Decision Tree Classifier
Random Forest Classifier

Hyperparameter Tuning (RandomizedSearchCV)

The Random Forest model was optimized using RandomizedSearchCV with 5-fold cross-validation.

Parameter	Range
n_estimators	100 → 500
max_depth	None, 10, 20, 30
min_samples_split	2, 5, 10, 20
min_samples_leaf	1, 2, 4, 6
max_features	sqrt, log2

Evaluation Results

Model	Accuracy (val)	F1 Score	ROC AUC
Logistic Regression	0.82	0.80	0.84
KNN	0.79	0.77	0.81
Decision Tree	0.83	0.82	0.85
Random Forest (Best)	0.89	0.87	0.91

The Random Forest model provided the best performance, handling nonlinear relationships and reducing bias and variance effectively.

Feature Importance Analysis

Top predictors of student success:

G2 (Second period grade)
G1 (First period grade)
studytime (Weekly study hours)
failures (Number of past class failures)
schoolsup (Extra educational support)

Academic features dominate over social variables, confirming that study effort and prior performance are strong indicators of success.

Error Analysis

Ambiguous boundary cases: Most misclassifications occur for students with grades between 9 and 11. Suggests potential for multi-class classification (Weak / Medium / Strong).
Underrepresented categories: Some categorical values (e.g., Fjob=teacher) have few samples. Possible solution: SMOTE for class balancing.
Outliers: High absences (>70) distort results. Use clipping or winsorization.
Minor Overfitting: Train accuracy ≈ 0.97 vs validation ≈ 0.89. Control depth and minimum samples per split.

Future Work

Extend analysis to other subjects or semesters.
Explore advanced models: XGBoost, LightGBM, Neural Networks.
Add longitudinal tracking for performance trends over time.
Apply SHAP or LIME for local model explainability.
Build an early-warning dashboard for teachers.

Visualization Section

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np

# Feature Importance Plot
importances = best_rf.feature_importances_
features = X_train_final.columns
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10,6))
sns.barplot(x=importances[indices][:10], y=features[indices][:10], palette='coolwarm')
plt.title('Top 10 Most Important Features')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

# Confusion Matrix
cm = confusion_matrix(y_test, best_rf.predict(X_test_final))
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Test Data')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Project Flowchart (Mermaid)

flowchart TD
    A[Data Loading from UCI Repository] --> B[Data Cleaning]
    B --> C[Encoding Categorical Features]
    C --> D[Scaling Numerical Features]
    D --> E[Train / Validation / Test Split]
    E --> F[Model Training]
    F --> G[Hyperparameter Tuning (RandomizedSearchCV)]
    G --> H[Model Evaluation]
    H --> I[Feature Importance Analysis]
    I --> J[Educational Interpretation and Insights]

Socio-Educational Impact

The analysis confirms that academic performance is primarily driven by measurable study factors rather than social conditions. Schools can apply this model to:

Identify students at risk before final exams.
Provide tailored academic support programs.
Allocate resources more efficiently.
Evaluate and adjust educational strategies based on data.

Lessons Learned

Data quality is crucial for reliable ML performance.
Ensemble models improve stability and interpretability.
Visual and statistical analysis together enhance insights.
Combining analytics with education yields meaningful impact.

How to Install and Run

Clone the repository:

git clone https://github.com/tahamohmadf19-dev/PyCharmMiscProject2.git
cd PyCharmMiscProject2

Install dependencies:
```
pip install -r requirements.txt
```
Run the machine learning model training:
```
python train_model.py
```
Run the student management CLI:
```
python student_manager_cli.py
```

References

UCI Student Performance Dataset (ID 320)
Scikit-Learn Documentation
Aurélien Géron, Hands-On Machine Learning
Romero & Ventura (2020), Educational Data Mining

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
student_manager_cli.py		student_manager_cli.py
train_model.py		train_model.py
train_model_advanced.py		train_model_advanced.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Student Performance Prediction

Project Overview

Objectives

Dataset

Data Preprocessing

Model Training and Tuning

Hyperparameter Tuning (RandomizedSearchCV)

Evaluation Results

Feature Importance Analysis

Error Analysis

Future Work

Visualization Section

Project Flowchart (Mermaid)

Socio-Educational Impact

Lessons Learned

How to Install and Run

References

About

Uh oh!

Releases

Packages

Languages

License

tahamohmadf19-dev/Student-Performance-Prediction

Folders and files

Latest commit

History

Repository files navigation

Student Performance Prediction

Project Overview

Objectives

Dataset

Data Preprocessing

Model Training and Tuning

Hyperparameter Tuning (RandomizedSearchCV)

Evaluation Results

Feature Importance Analysis

Error Analysis

Future Work

Visualization Section

Project Flowchart (Mermaid)

Socio-Educational Impact

Lessons Learned

How to Install and Run

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages