Project: Student Performance Analysis And Performance Prediction
Author / Maintainer: Odunayomide Yakubu [Data Scientist and Analyst |Data Storyteller]
This repository contains a reproducible analysis and predictive modeling pipeline for a student performance dataset. The project explores student-level features, performs EDA, preprocesses data, trains machine learning models, and serializes the best model for later use. The primary deliverable is the Jupyter notebook Student Performance1.ipynb and a set of helper scripts for reproducibility.
Goal: improve student academic performance and well-being by identifying key drivers of success and building predictive models to flag at-risk students.
The Student Performance Dataset provides insights into academic achievements and extracurricular activities of students. This dataset is valuable for analyzing factors that impact student success, study habits, and parental influence.
- Total Records: 6,055
- Total Columns: 15
- Demographics:
StudentID,Age,Gender,Ethnicity - Academic Performance:
GPA,GradeClass,Absences,StudyTimeWeekly - Parental Influence:
ParentalEducation,ParentalSupport - Extracurricular Activities:
ClubInvolvement,Sports,Music,Volunteering - Additional Support:
Tutoring
Short, human-readable descriptions for each column β update as needed to match your dataset precisely.
StudentID- Student unique identifier (integer)Ageβ Student age in years (integer).Genderβ Student gender (e.g.,Male,Female,Other).Ethnicityβ Self-reported ethnicity or category.GPAβ Grade Point Average (continuous: 0.0 β 4.0 or dataset scale).GradeClassβ Categorical grade bracket (e.g.,A,B,C) or year group.Absencesβ Number of class days missed.StudyTimeWeeklyβ Reported weekly study hours.ParentalEducationβ Highest education level of parents/guardians.ParentalSupportβ Indicator of parental support (binary/categorical/scale).ClubInvolvementβ Participation in clubs (Yes/No or list).Sportsβ Participation in sports (Yes/No or frequency).Musicβ Participation in music programs (Yes/No or frequency).Volunteeringβ Volunteering involvement (Yes/No or hours).Tutoringβ Whether student receives tutoring (Yes/No / hours).EnrollmentStatusβ Current enrollment status (e.g.,Active,Transferred,Dropped) β (replace if different)
Choose target depending on your problem formulation:
- Regression:
GPA(predict continuous performance scores). - Classification:
GradeClass(predict grade bucket orAtRisklabel derived from GPA thresholds). - You can also define derived targets such as
AtRisk(GPA < threshold) orImproved(GPA increase vs previous term).
- Predicting student performance based on study habits.
- Analyzing the impact of extracurricular activities on GPA.
- Examining the role of parental support in academic success.
- Identifying trends in absenteeism and its effects on grades.
- Building an early-warning system to flag at-risk students for intervention.
Before training models, perform these reproducible steps:
- Data ingestion β load from
data/student_performance_dataset.csv. - Missing values β impute:
- Categorical: most frequent or a dedicated
Missingcategory. - Numerical: median (robust) or mean where appropriate.
- Categorical: most frequent or a dedicated
- Type conversions β convert categorical columns to
categorydtype. - Feature engineering β create derived features (e.g.,
StudyTimeCategory,AbsenceRate,ExtracurricularScore). - Encoding β one-hot / ordinal encoding for categorical variables (choose consistent mapping).
- Train/test split β use stratified split for classification; keep a holdout test set (e.g., 20%).
- Imbalanced classes β handle via resampling, class weights, or targeted metrics.
- Versioning β save processed datasets and preprocessing pipeline (e.g.,
joblib).
Suggested approaches and metrics:
- Logistic Regression (classification) .
- Random Forest Regression .
- Regression: RMSE, MAE, RΒ² for
GPAprediction. - Classification: Accuracy, Precision, Recall, F1-score, .
- Business-relevant: Precision@k, Recall@k, confusion matrix for
AtRiskdetection.
- Cross-validation (k-fold) for robust estimates.
- Use a holdout test set for final evaluation.
- Report confidence intervals where possible.
notebooks/Student Performance1.ipynb contains:
- Data loading & initial inspection (
pandas,df.info(),df.head(),df.nunique()). - Exploratory Data Analysis using
matplotlibandseaborn(charts and embedded figures). - Preprocessing and missing-value handling (uses
SimpleImputer, encoders andColumnTransformer). - Train/test split and model training with
scikit-learn(pipelines and model serialization viajoblib). - Outputs: plots, summary tables, and saved model artifacts (if run).
-
Model Performance:
<RandomForestRegressor>- Tuned Model Test MSE:
0.0579 - Tuned Model Test R-squared:
0.9288 The results indicate that the RandomForest model performed exceptionally well on both the validation and test datasets. With a Mean Squared Error (MSE) of 0.063 for validation and 0.05 for test, the model's predictions are very close to the actual values, suggesting high accuracy. Furthermore, the RΒ² score of 0.9288 for both datasets indicates that the model explains all the variability in the target variable, meaning it perfectly fits the data. Overall, these metrics suggest that the model is highly effective for this prediction task.
- Tuned Model Test MSE:
-
Performance:
<LogisticRegression>- Accuracy:
99.61% - Precision / Recall / F1:
1 / 1 / 1
- Accuracy:
-
Key drivers identified:
[ 'GPA', 'Absences', 'StudyTimeWeekly', 'Tutoring']
All analysis and modeling are performed within Jupyter Notebooks. Open the relevant .ipynb files to follow the workflow:
- EDA.ipynb: Data exploration and visualization
- Preprocessing.ipynb: Data cleaning and feature engineering
- Modeling.ipynb: Building and evaluating predictive models
.
βββ data/
β βββ student_performance.csv # Raw dataset
βββ notebooks/
β βββ EDA.ipynb # Exploratory Data Analysis
β βββ Preprocessing.ipynb # Data preprocessing steps
β βββ Modeling.ipynb # Predictive modeling
βββ README.md
- Logistic Regression
- Random Forests
- Major takeaways
- Top predictors of student performance:
StudyTimeWeekly,Absences,PastGrades(or prior term performance), andParentalSupportare the strongest predictors of final academic outcome based on feature importance analysis. - Attendance matters: Higher
Absencesis consistently associated with lowerGPAand a higher probability of being flaggedAtRisk. - Study habits drive gains: Students reporting β₯ 71% hours/week of focused study show substantially higher average GPA than peers.
- Extracurricular involvement shows mixed effects: Participation in clubs, sports, and music correlates with improved engagement and modest positive GPA lift after controlling for study time; the effect varies by activity type.
- Tutoring helps but is clustered: Students receiving tutoring often start with lower baseline performance (selection effect) yet demonstrate better relative improvement; targeted tutoring appears more effective than unfocused programs.
- Model performance: Best model = RandomForestRegressor β Accuracy:
92.9%, Recall (AtRisk):99.6%. - Fairness & bias check: Performance gaps were observed across demographic groups (e.g.,
Ethnicity,ParentalEducation). Without mitigation, models may reproduce existing disparities.
- Early-warning dashboard: Implement a real-time rule to flag students with low recent grades, high absences, or low study hours for counselor outreach and support.
- Targeted tutoring & monitoring: Prioritize tutoring for flagged students and measure short-term GPA changes to evaluate impact.
- Parental engagement programs: Run low-cost workshops or weekly progress summaries for students with low
ParentalSupportscores. - Promote structured study time: Offer study-skill sessions and track
StudyTimeWeeklyas a KPI; marginal increases in study time correlate with measurable GPA gains. - Audit model fairness: Perform subgroup evaluation and calibration checks before deployment; consider class balancing, reweighting, or fairness-aware algorithms for mitigation.
Contributions are welcome! Please open issues or submit pull requests for enhancements, bug fixes, or new features.
For questions or collaborations, reach out to the repository owner via GitHub profile.
Clone the repository:
git clone https://github.com/ODUNAYOMIDE-YAKUBU/Student-Performance-Analysis-and-Predictive-Modeling.git
cd Student-Performance-Analysis-and-Predictive-Modeling