With the increasing use of AI tools like ChatGPT in academia, distinguishing between human- and AI-generated responses is essential for maintaining academic integrity. This project explores a machine learning pipeline with Logistic Regression, SVM, MLP, and BERT models to classify text as human- or AI-generated based on linguistic and semantic features.
*This is a course project for STT 811 Applied Statistical Modeling for Data Scientists at MSU. The contributors are Mahnoor Sheikh, Andrew John J, Roshni Bhowmik and Ab Basit Syed Rafi.
🌐 Access the streamlit web app to delve into the detailed steps of data cleaning, preprocessing, and modelling, as well as to uncover the insights derived from the analysis.
- Dataset
- Preprocessing and Feature Engineering
- Exploratory Data Analysis
- Modeling
- Streamlit App Features
- Key Takeaways
- References
- Installation and Usage
- Source: Custom dataset of 2,239 rows (from Mendeley)
- Contents:
Question: The original statistics questionHuman Response: Text response from a studentAI Response: Text generated using a language model
- Post-cleaning: 1,993 usable examples
- Cleaning: Lowercasing, punctuation removal, tokenization, stopword removal
- Feature Creation:
- Text length, special character counts
- Flesch Reading Ease, Gunning Fog Index
- Cosine similarity to question
- Sentiment scores and sentiment gaps
- Vectorization:
CountVectorizerfollowed by PCA (95% variance retained in 482 components)
Key visuals and insights:
- Top Trigrams and Common Words in AI vs. Human responses
- Word Clouds and Text Length Distribution
- Sentiment Gap Analysis and KDE Estimation
- Readability Scores: AI responses are longer and more formulaic
- Text Similarity: AI more aligned with original questions
- Pairplots & Correlation Heatmaps reveal subtle response patterns
- Logistic Regression, Linear SVM, Decision Tree, Random Forest, KNN, Gradient Boosting, MLP
- Best Accuracy: ~85% (Logistic Regression, SVM, MLP)
- Model:
bert-base-uncasedvia Hugging Face - Training:
- Tokenization (WordPiece)
- 30 epochs with cross-entropy loss
- AdamW optimizer
- Performance: Comparable to traditional models with potential for further gains
- Upload new questions and responses
- Evaluate text using trained models
- Visual analytics: word clouds, trigrams, readability, sentiment
- Compare AI vs. human characteristics interactively
- Human responses were simpler, less verbose, and showed more variability
- AI responses were longer, sentimentally aligned with questions, and structurally consistent
- Readability, sentiment gap, and cosine similarity are strong distinguishing features
- The system offers a foundational step toward detecting AI-generated content in education
# Clone repo
git clone https://github.com/andrew-jxhn/STT811_StatsProject.git
cd STT811_StatsProject
# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate # or .\venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
# Run Streamlit app
streamlit run streamlit_code.py