Random Forest Model Pipeline

1. Model Pipeline Components

The pipeline consists of three main components:

a) Text Preprocessor

Custom transformer class compatible with sklearn
Implements the preprocessing steps from Exercise 02:
1. Noise removal (URLs, code blocks, special characters)
2. Text normalization (lowercase, whitespace)
3. Tokenization
4. Stop-word removal
5. Lemmatization
Maintains preprocessing consistency between training and prediction

b) TF-IDF Vectorizer

Converts preprocessed text into numerical features
Parameters:
- max_features=5000 (limits vocabulary size)
Benefits:
- Captures word importance in documents
- Handles varying document lengths
- Reduces impact of common words

c) Random Forest Classifier

Final classification model
Parameters:
- n_estimators=100 (number of trees)
- max_depth=None (allows full tree growth)
- min_samples_split=2 (minimum samples for split)
- random_state=42 (reproducibility)
Benefits:
- Handles high-dimensional data well
- Resistant to overfitting
- Provides feature importance

2. Model Modularization for Integration

Modularization Approach:

Separate Preprocessing Class
- TextPreprocessor class is self-contained
- Can be imported and used independently
- Maintains consistent preprocessing across applications
Serialized Model Pipeline
- Complete pipeline saved using joblib
- Includes preprocessor, vectorizer, and classifier
- Can be loaded and used in Flask application
Verification System
- Preprocessing examples stored in JSON
- Allows verification of preprocessing consistency
- Useful for testing and debugging
Future Integration
- Flask app can import TextPreprocessor class
- Load saved pipeline using joblib
- Use for real-time predictions

3. Random Forest and Concept Drift

Drawbacks of Random Forest

Static Model Structure
- Trees are fixed after training
- Cannot adapt to new patterns without retraining
- May become outdated as issue patterns change
Memory Intensive
- Stores many decision trees
- Difficult to update incrementally
- Requires full retraining for updates
Feature Space Limitations
- Fixed vocabulary from training data
- Cannot handle new terms or patterns
- May miss emerging topics

Alternative Model: Online Learning with SGDClassifier

Benefits for Concept Drift:

Incremental Learning
- Can update with new data points
- Adapts to changing patterns
- Supports partial_fit method
Memory Efficient
- Doesn't store training data
- Lighter memory footprint
- Easier to deploy and update
Adaptive Learning Rate
- Adjusts to data changes
- Balances old and new knowledge
- Better handles concept drift
Implementation Strategy
- Regular model updates with new data
- Monitoring of prediction confidence
- Sliding window for recent patterns

After the execution of app.py:

### 1. Preprocessing Verification
First, the program verified the preprocessing pipeline using the examples saved from exercise02:

```python
Example 1: GitHub issue about Entities and fields
- Original: Technical issue about '__tileSrcRect' fields
- Processed: Cleaned, tokenized, and lemmatized version without URLs and special characters

Example 2: Bug report about blog link
- Original: Markdown-formatted bug report about updating website links
- Processed: Clean text with key terms preserved but formatting removed

Example 3: Technical discussion about expressions
- Original: Code example with markdown formatting
- Processed: Plain text with code-related terms preserved

2. Model Training and Evaluation

The program then trained and evaluated the Random Forest model:

Model Evaluation Results:
- Bug issues:
  - Precision: 0.76 (76% of predicted bugs were actual bugs)
  - Recall: 0.79 (79% of actual bugs were correctly identified)
  - F1-score: 0.77 (harmonic mean of precision and recall)

- Enhancement issues:
  - Precision: 0.70
  - Recall: 0.80
  - F1-score: 0.75

- Question issues:
  - Precision: 0.61
  - Recall: 0.09 (very low - model struggles with questions)
  - F1-score: 0.16 (poor performance on questions)

Overall:
- Accuracy: 0.73 (73% of all predictions were correct)
- The model performs well on bugs and enhancements
- Struggles with questions (likely due to class imbalance)

3. Model Serialization

model_pipeline.joblib

This file contains:

The complete trained pipeline including:
1. TextPreprocessor
2. TF-IDF Vectorizer
3. Random Forest Classifier

Can be loaded later using:

loaded_model = joblib.load('model_pipeline.joblib')
prediction = loaded_model.predict(['new issue text'])

Key Observations:

Preprocessing works consistently across different types of issues
Model performs well on majority classes (bugs and enhancements)
Poor performance on minority class (questions) suggests need for:
- Class balancing techniques
- More training data for questions
- Possibly different model architecture for better minority class handling

The saved model can now be used in a Flask application for real-time predictions on new GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
presentation_images		presentation_images
.DS_Store		.DS_Store
.gitignore		.gitignore
.~random_forest_presentation_with_images.pptx		.~random_forest_presentation_with_images.pptx
Principles Of AI Engineering.pdf		Principles Of AI Engineering.pdf
README.md		README.md
app.py		app.py
create_presentation_with_images.py		create_presentation_with_images.py
random_forest_presentation_with_images.pptx		random_forest_presentation_with_images.pptx
requirements.txt		requirements.txt
sample1.csv		sample1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Random Forest Model Pipeline

1. Model Pipeline Components

a) Text Preprocessor

b) TF-IDF Vectorizer

c) Random Forest Classifier

2. Model Modularization for Integration

Modularization Approach:

3. Random Forest and Concept Drift

Drawbacks of Random Forest

Alternative Model: Online Learning with SGDClassifier

Benefits for Concept Drift:

2. Model Training and Evaluation

3. Model Serialization

Key Observations:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Syed007Hassan/MlPipeline-RandomForest

Folders and files

Latest commit

History

Repository files navigation

Random Forest Model Pipeline

1. Model Pipeline Components

a) Text Preprocessor

b) TF-IDF Vectorizer

c) Random Forest Classifier

2. Model Modularization for Integration

Modularization Approach:

3. Random Forest and Concept Drift

Drawbacks of Random Forest

Alternative Model: Online Learning with SGDClassifier

Benefits for Concept Drift:

2. Model Training and Evaluation

3. Model Serialization

Key Observations:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages