AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

With the increasing use of AI tools like ChatGPT in academia, distinguishing between human- and AI-generated responses is essential for maintaining academic integrity. This project explores a machine learning pipeline with Logistic Regression, SVM, MLP, and BERT models to classify text as human- or AI-generated based on linguistic and semantic features.

*This is a course project for STT 811 Applied Statistical Modeling for Data Scientists at MSU. The contributors are Mahnoor Sheikh, Andrew John J, Roshni Bhowmik and Ab Basit Syed Rafi.

🌐 Access the streamlit web app to delve into the detailed steps of data cleaning, preprocessing, and modelling, as well as to uncover the insights derived from the analysis.

Dataset

Source: Custom dataset of 2,239 rows (from Mendeley)
Contents:
- Question: The original statistics question
- Human Response: Text response from a student
- AI Response: Text generated using a language model
Post-cleaning: 1,993 usable examples

Preprocessing and Feature Engineering

Cleaning: Lowercasing, punctuation removal, tokenization, stopword removal
Feature Creation:
- Text length, special character counts
- Flesch Reading Ease, Gunning Fog Index
- Cosine similarity to question
- Sentiment scores and sentiment gaps
Vectorization: CountVectorizer followed by PCA (95% variance retained in 482 components)

Exploratory Data Analysis

Key visuals and insights:

Top Trigrams and Common Words in AI vs. Human responses
Word Clouds and Text Length Distribution
Sentiment Gap Analysis and KDE Estimation
Readability Scores: AI responses are longer and more formulaic
Text Similarity: AI more aligned with original questions
Pairplots & Correlation Heatmaps reveal subtle response patterns

Modeling

Traditional ML Models

Logistic Regression, Linear SVM, Decision Tree, Random Forest, KNN, Gradient Boosting, MLP
Best Accuracy: ~85% (Logistic Regression, SVM, MLP)

Deep Learning: BERT

Model: bert-base-uncased via Hugging Face
Training:
- Tokenization (WordPiece)
- 30 epochs with cross-entropy loss
- AdamW optimizer
Performance: Comparable to traditional models with potential for further gains

Streamlit App Features

Upload new questions and responses
Evaluate text using trained models
Visual analytics: word clouds, trigrams, readability, sentiment
Compare AI vs. human characteristics interactively

Key Takeaways

Human responses were simpler, less verbose, and showed more variability
AI responses were longer, sentimentally aligned with questions, and structurally consistent
Readability, sentiment gap, and cosine similarity are strong distinguishing features
The system offers a foundational step toward detecting AI-generated content in education

References

Installation and Usage

# Clone repo
git clone https://github.com/andrew-jxhn/STT811_StatsProject.git
cd STT811_StatsProject

# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate  # or .\venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Run Streamlit app
streamlit run streamlit_code.py

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
Streamlit		Streamlit
AI classifier dataset.zip		AI classifier dataset.zip
Bert.ipynb		Bert.ipynb
Model.ipynb		Model.ipynb
Model_v2.ipynb		Model_v2.ipynb
Model_v2_updated.ipynb		Model_v2_updated.ipynb
Model_v3_final.ipynb		Model_v3_final.ipynb
README.md		README.md
STT 811 - Project Report - Final.pdf		STT 811 - Project Report - Final.pdf
STT 811 IDA and EDA.ipynb		STT 811 IDA and EDA.ipynb
STT811 QR.pptx		STT811 QR.pptx
aidata_clean_avg.csv		aidata_clean_avg.csv
combined.xls		combined.xls
profiling_report.html		profiling_report.html
requirements.txt		requirements.txt
statistics_background_transparent_dark.png		statistics_background_transparent_dark.png
streamlit_code.py		streamlit_code.py
stt811statsproject_qr.png		stt811statsproject_qr.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

Table of Contents

Dataset

Preprocessing and Feature Engineering

Exploratory Data Analysis

Modeling

Traditional ML Models

Deep Learning: BERT

Streamlit App Features

Key Takeaways

References

Installation and Usage

About

Uh oh!

Releases

Packages

Languages

mahnoorsheikh16/NLP-Approach-to-AI-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

Table of Contents

Dataset

Preprocessing and Feature Engineering

Exploratory Data Analysis

Modeling

Traditional ML Models

Deep Learning: BERT

Streamlit App Features

Key Takeaways

References

Installation and Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages