Skip to content

Data Preparation #17

@Technocolabs100

Description

@Technocolabs100

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions