Skip to content

Predicting student academic success using machine learning. Includes data preprocessing, model comparison (Random Forest, KNN), and feature importance analysis with 89% accuracy.

License

Notifications You must be signed in to change notification settings

tahamohmadf19-dev/Student-Performance-Prediction

Repository files navigation

Student Performance Prediction

Author: Mohammad Taha — October 2025

Project Overview

This project aims to predict student academic success (pass/fail) using machine learning models trained on the UCI Student Performance dataset. The workflow includes data preprocessing, feature encoding, scaling, model comparison, and evaluation.

The best-performing model was the Random Forest Classifier, achieving 89% accuracy and demonstrating strong generalization. The model highlights key academic factors that influence performance, such as previous grades and study time.

Objectives

  1. Analyze academic and social attributes affecting student performance.
  2. Train multiple classification models and select the best one.
  3. Interpret model results to identify the most influential features.
  4. Provide actionable insights for educational decision-making.

Dataset

  • Source: UCI Machine Learning Repository (Dataset ID 320)
  • Samples: 649 students
  • Target: Final grade (G3) converted into binary classification: Pass (G3 ≥ 10) or Fail (G3 < 10)

Data Preprocessing

Key preprocessing steps:

  1. Label Encoding: Applied to binary categorical features (e.g., sex, address, schoolsup).
  2. One-Hot Encoding: Used for multi-class categorical variables (e.g., Mjob, Fjob, reason).
  3. Standard Scaling: Numeric features (G1, G2, absences, age, studytime) scaled using StandardScaler.
  4. Stratified Split: Data split into Train (60%), Validation (20%), and Test (20%) sets while maintaining class balance.

Model Training and Tuning

The following models were trained and compared:

  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Decision Tree Classifier
  • Random Forest Classifier

Hyperparameter Tuning (RandomizedSearchCV)

The Random Forest model was optimized using RandomizedSearchCV with 5-fold cross-validation.

Parameter Range
n_estimators 100 → 500
max_depth None, 10, 20, 30
min_samples_split 2, 5, 10, 20
min_samples_leaf 1, 2, 4, 6
max_features sqrt, log2

Evaluation Results

Model Accuracy (val) F1 Score ROC AUC
Logistic Regression 0.82 0.80 0.84
KNN 0.79 0.77 0.81
Decision Tree 0.83 0.82 0.85
Random Forest (Best) 0.89 0.87 0.91

The Random Forest model provided the best performance, handling nonlinear relationships and reducing bias and variance effectively.

Feature Importance Analysis

Top predictors of student success:

  1. G2 (Second period grade)
  2. G1 (First period grade)
  3. studytime (Weekly study hours)
  4. failures (Number of past class failures)
  5. schoolsup (Extra educational support)

Academic features dominate over social variables, confirming that study effort and prior performance are strong indicators of success.

Error Analysis

  1. Ambiguous boundary cases: Most misclassifications occur for students with grades between 9 and 11. Suggests potential for multi-class classification (Weak / Medium / Strong).
  2. Underrepresented categories: Some categorical values (e.g., Fjob=teacher) have few samples. Possible solution: SMOTE for class balancing.
  3. Outliers: High absences (>70) distort results. Use clipping or winsorization.
  4. Minor Overfitting: Train accuracy ≈ 0.97 vs validation ≈ 0.89. Control depth and minimum samples per split.

Future Work

  1. Extend analysis to other subjects or semesters.
  2. Explore advanced models: XGBoost, LightGBM, Neural Networks.
  3. Add longitudinal tracking for performance trends over time.
  4. Apply SHAP or LIME for local model explainability.
  5. Build an early-warning dashboard for teachers.

Visualization Section

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np

# Feature Importance Plot
importances = best_rf.feature_importances_
features = X_train_final.columns
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10,6))
sns.barplot(x=importances[indices][:10], y=features[indices][:10], palette='coolwarm')
plt.title('Top 10 Most Important Features')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

# Confusion Matrix
cm = confusion_matrix(y_test, best_rf.predict(X_test_final))
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Test Data')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Project Flowchart (Mermaid)

flowchart TD
    A[Data Loading from UCI Repository] --> B[Data Cleaning]
    B --> C[Encoding Categorical Features]
    C --> D[Scaling Numerical Features]
    D --> E[Train / Validation / Test Split]
    E --> F[Model Training]
    F --> G[Hyperparameter Tuning (RandomizedSearchCV)]
    G --> H[Model Evaluation]
    H --> I[Feature Importance Analysis]
    I --> J[Educational Interpretation and Insights]
Loading

Socio-Educational Impact

The analysis confirms that academic performance is primarily driven by measurable study factors rather than social conditions. Schools can apply this model to:

  • Identify students at risk before final exams.
  • Provide tailored academic support programs.
  • Allocate resources more efficiently.
  • Evaluate and adjust educational strategies based on data.

Lessons Learned

  • Data quality is crucial for reliable ML performance.
  • Ensemble models improve stability and interpretability.
  • Visual and statistical analysis together enhance insights.
  • Combining analytics with education yields meaningful impact.

How to Install and Run

  1. Clone the repository:

    git clone https://github.com/tahamohmadf19-dev/PyCharmMiscProject2.git
    cd PyCharmMiscProject2
  2. Install dependencies:

    pip install -r requirements.txt
  3. Run the machine learning model training:

    python train_model.py
  4. Run the student management CLI:

    python student_manager_cli.py

References

  • UCI Student Performance Dataset (ID 320)
  • Scikit-Learn Documentation
  • Aurélien Géron, Hands-On Machine Learning
  • Romero & Ventura (2020), Educational Data Mining

About

Predicting student academic success using machine learning. Includes data preprocessing, model comparison (Random Forest, KNN), and feature importance analysis with 89% accuracy.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages