Skip to content

The Student Performance Dataset provides insights into academic achievements and extracurricular activities of students. This dataset is valuable for analyzing factors that impact student success, study habits, and parental influence.

Notifications You must be signed in to change notification settings

ODUNAYOMIDE-YAKUBU/Student-Performance-Analysis-and-Predictive-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Student Performance Analysis and Predictive Modeling

Project: Student Performance Analysis And Performance Prediction
Author / Maintainer: Odunayomide Yakubu [Data Scientist and Analyst |Data Storyteller]

Overview

This repository contains a reproducible analysis and predictive modeling pipeline for a student performance dataset. The project explores student-level features, performs EDA, preprocesses data, trains machine learning models, and serializes the best model for later use. The primary deliverable is the Jupyter notebook Student Performance1.ipynb and a set of helper scripts for reproducibility.

Goal: improve student academic performance and well-being by identifying key drivers of success and building predictive models to flag at-risk students.

πŸ“Œ Introduction

The Student Performance Dataset provides insights into academic achievements and extracurricular activities of students. This dataset is valuable for analyzing factors that impact student success, study habits, and parental influence.

πŸ“‚ Dataset Overview

  • Total Records: 6,055
  • Total Columns: 15

πŸ”‘ Key Features

  • Demographics: StudentID, Age, Gender, Ethnicity
  • Academic Performance: GPA, GradeClass, Absences, StudyTimeWeekly
  • Parental Influence: ParentalEducation, ParentalSupport
  • Extracurricular Activities: ClubInvolvement, Sports, Music, Volunteering
  • Additional Support: Tutoring

πŸ—‚ Data dictionary (column descriptions)

Short, human-readable descriptions for each column β€” update as needed to match your dataset precisely.

  • StudentID - Student unique identifier (integer)
  • Age β€” Student age in years (integer).
  • Gender β€” Student gender (e.g., Male, Female, Other).
  • Ethnicity β€” Self-reported ethnicity or category.
  • GPA β€” Grade Point Average (continuous: 0.0 – 4.0 or dataset scale).
  • GradeClass β€” Categorical grade bracket (e.g., A, B, C) or year group.
  • Absences β€” Number of class days missed.
  • StudyTimeWeekly β€” Reported weekly study hours.
  • ParentalEducation β€” Highest education level of parents/guardians.
  • ParentalSupport β€” Indicator of parental support (binary/categorical/scale).
  • ClubInvolvement β€” Participation in clubs (Yes/No or list).
  • Sports β€” Participation in sports (Yes/No or frequency).
  • Music β€” Participation in music programs (Yes/No or frequency).
  • Volunteering β€” Volunteering involvement (Yes/No or hours).
  • Tutoring β€” Whether student receives tutoring (Yes/No / hours).
  • EnrollmentStatus β€” Current enrollment status (e.g., Active, Transferred, Dropped) β€” (replace if different)

🎯 Target variable(s)

Choose target depending on your problem formulation:

  • Regression: GPA (predict continuous performance scores).
  • Classification: GradeClass (predict grade bucket or AtRisk label derived from GPA thresholds).
  • You can also define derived targets such as AtRisk (GPA < threshold) or Improved (GPA increase vs previous term).

πŸ“Š Potential Use Cases

  • Predicting student performance based on study habits.
  • Analyzing the impact of extracurricular activities on GPA.
  • Examining the role of parental support in academic success.
  • Identifying trends in absenteeism and its effects on grades.
  • Building an early-warning system to flag at-risk students for intervention.

🧰 Preprocessing Checklist (recommended)

Before training models, perform these reproducible steps:

  1. Data ingestion β€” load from data/student_performance_dataset.csv.
  2. Missing values β€” impute:
    • Categorical: most frequent or a dedicated Missing category.
    • Numerical: median (robust) or mean where appropriate.
  3. Type conversions β€” convert categorical columns to category dtype.
  4. Feature engineering β€” create derived features (e.g., StudyTimeCategory, AbsenceRate, ExtracurricularScore).
  5. Encoding β€” one-hot / ordinal encoding for categorical variables (choose consistent mapping).
  6. Train/test split β€” use stratified split for classification; keep a holdout test set (e.g., 20%).
  7. Imbalanced classes β€” handle via resampling, class weights, or targeted metrics.
  8. Versioning β€” save processed datasets and preprocessing pipeline (e.g., joblib).

πŸ§ͺ Modeling & evaluation

Suggested approaches and metrics:

Model types

  • Logistic Regression (classification) .
  • Random Forest Regression .

Evaluation metrics

  • Regression: RMSE, MAE, RΒ² for GPA prediction.
  • Classification: Accuracy, Precision, Recall, F1-score, .
  • Business-relevant: Precision@k, Recall@k, confusion matrix for AtRisk detection.

Model validation

  • Cross-validation (k-fold) for robust estimates.
  • Use a holdout test set for final evaluation.
  • Report confidence intervals where possible.

Notebook summary

notebooks/Student Performance1.ipynb contains:

  • Data loading & initial inspection (pandas, df.info(), df.head(), df.nunique()).
  • Exploratory Data Analysis using matplotlib and seaborn (charts and embedded figures).
  • Preprocessing and missing-value handling (uses SimpleImputer, encoders and ColumnTransformer).
  • Train/test split and model training with scikit-learn (pipelines and model serialization via joblib).
  • Outputs: plots, summary tables, and saved model artifacts (if run).

Results (summary)

  • Model Performance: <RandomForestRegressor>

    • Tuned Model Test MSE: 0.0579
    • Tuned Model Test R-squared: 0.9288
    • The results indicate that the RandomForest model performed exceptionally well on both the validation and test datasets. With a Mean Squared Error (MSE) of 0.063 for validation and 0.05 for test, the model's predictions are very close to the actual values, suggesting high accuracy. Furthermore, the RΒ² score of 0.9288 for both datasets indicates that the model explains all the variability in the target variable, meaning it perfectly fits the data. Overall, these metrics suggest that the model is highly effective for this prediction task.
  • Performance: <LogisticRegression>

    • Accuracy: 99.61%
    • Precision / Recall / F1: 1 / 1 / 1
  • Key drivers identified: [ 'GPA', 'Absences', 'StudyTimeWeekly', 'Tutoring']

Installation

Usage

All analysis and modeling are performed within Jupyter Notebooks. Open the relevant .ipynb files to follow the workflow:

  • EDA.ipynb: Data exploration and visualization
  • Preprocessing.ipynb: Data cleaning and feature engineering
  • Modeling.ipynb: Building and evaluating predictive models

Project Structure

.
β”œβ”€β”€ data/
β”‚   └── student_performance.csv        # Raw dataset
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ EDA.ipynb                      # Exploratory Data Analysis
β”‚   β”œβ”€β”€ Preprocessing.ipynb            # Data preprocessing steps
β”‚   └── Modeling.ipynb                 # Predictive modeling
β”œβ”€β”€ README.md

Machine Learning Models

  • Logistic Regression
  • Random Forests

Results

  • Major takeaways
  • Top predictors of student performance: StudyTimeWeekly, Absences, PastGrades (or prior term performance), and ParentalSupport are the strongest predictors of final academic outcome based on feature importance analysis.
  • Attendance matters: Higher Absences is consistently associated with lower GPA and a higher probability of being flagged AtRisk.
  • Study habits drive gains: Students reporting β‰₯ 71% hours/week of focused study show substantially higher average GPA than peers.
  • Extracurricular involvement shows mixed effects: Participation in clubs, sports, and music correlates with improved engagement and modest positive GPA lift after controlling for study time; the effect varies by activity type.
  • Tutoring helps but is clustered: Students receiving tutoring often start with lower baseline performance (selection effect) yet demonstrate better relative improvement; targeted tutoring appears more effective than unfocused programs.
  • Model performance: Best model = RandomForestRegressor β€” Accuracy: 92.9%, Recall (AtRisk): 99.6%.
  • Fairness & bias check: Performance gaps were observed across demographic groups (e.g., Ethnicity, ParentalEducation). Without mitigation, models may reproduce existing disparities.

Practical recommendations

  1. Early-warning dashboard: Implement a real-time rule to flag students with low recent grades, high absences, or low study hours for counselor outreach and support.
  2. Targeted tutoring & monitoring: Prioritize tutoring for flagged students and measure short-term GPA changes to evaluate impact.
  3. Parental engagement programs: Run low-cost workshops or weekly progress summaries for students with low ParentalSupport scores.
  4. Promote structured study time: Offer study-skill sessions and track StudyTimeWeekly as a KPI; marginal increases in study time correlate with measurable GPA gains.
  5. Audit model fairness: Perform subgroup evaluation and calibration checks before deployment; consider class balancing, reweighting, or fairness-aware algorithms for mitigation.

Contributing

Contributions are welcome! Please open issues or submit pull requests for enhancements, bug fixes, or new features.

License

MIT License

Contact

For questions or collaborations, reach out to the repository owner via GitHub profile.

Clone the repo

Clone the repository:

git clone https://github.com/ODUNAYOMIDE-YAKUBU/Student-Performance-Analysis-and-Predictive-Modeling.git
cd Student-Performance-Analysis-and-Predictive-Modeling

About

The Student Performance Dataset provides insights into academic achievements and extracurricular activities of students. This dataset is valuable for analyzing factors that impact student success, study habits, and parental influence.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published