Stack-Overflow-Tag-Prediction

Objective:

To predict as many as tags possible with high Precision and Recall.

Description:

The dataset was obtained from kaggle. The given problem is multi-label classification problem. The dataset contains features such as Id, Title, Body and Tags. Data preprocessing and cleaning was done to remove html tags and hyperlinks. Micro-Averaged F1-Score was used as performance metric as mentioned on Kaggle.

Data: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data

Features:

As a part of feature engineering, a new named as question was created as a combination of title + body
Code,HTML Tags and Stopwords were remmoved from body as part of data cleaning.

Case Study Flow:

Objective of this case study was to Suggest the tags based on the content that was there in the question posted on Stackoverflow.
The given dataset contains 6M data point in train with Id,Title,body and Tags as features.
EDA was done on tags and it is found that "c#", "java", "php", "asp.net", "javascript", "c++" are some of the most frequent tags.
On an avg. 2.88 tags were present perquestion.
We are considering only 5500 tags which covers 99.04 % of questions
Various machine learning models were tried and tested with OvR classifier to get the best results.
Logistic regression with TFIDF gave best accuracy of 0.236 trained on 1M data pts.
Model accuracy degraded as we reduced the number of data points which is as expected

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack-Overflow-Tag-Prediction

Objective:

Description:

Features:

Case Study Flow:

Results:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Stack-Overflow-Tag-Prediction

Objective:

Description:

Features:

Case Study Flow:

Results: