🧠 YouTube Topic Modeling
(Phân tích chủ đề mô tả video YouTube bằng LDA – BERTopic – Top2Vec – CombinedTM)
This project analyzes Vietnamese YouTube video descriptions to discover latent semantic topics using modern topic modeling algorithms:
LDA, BERTopic, Top2Vec, and a hybrid CombinedTM (LDA + BERTopic).
(Dự án phân tích mô tả video YouTube tiếng Việt bằng LDA, BERTopic, Top2Vec và mô hình lai CombinedTM.)
Why this matters?
- YouTube descriptions reflect content trends & user behavior
- Topic modeling helps understand large-scale unstructured data
- Vietnamese text is noisy → requires deep preprocessing
- Useful for recommendation systems, content moderation, trend analysis
Dataset:
- 7,653 cleaned Vietnamese video descriptions
- Collected via YouTube Data API v3
- Includes title, tags, category_id, description, engagement metrics
Applied steps:
- Lowercase text
- Remove emoji, URLs, emails, hashtags
- Remove advertisement keywords (“đăng ký”, “like”, “share”)
- Normalize Vietnamese characters
- Tokenize using Underthesea
- Remove stopwords
- Filter out short descriptions (< 5 tokens)
Outputs:
clean_description- tokenized list for LDA
- corpus + dictionary
(Mô hình thống kê phân phối xác suất chủ đề)
- Input: Bag-of-Words
- Optimized number of topics: 5
- Best Coherence: 0.549
(Embedding + UMAP + HDBSCAN + cTF-IDF)
- Multilingual Transformer
- Best Coherence: 0.714
- Best semantic quality across models
- Produces clear topic boundaries
(Joint embedding for documents–words–topics)
- Automatically finds number of topics
- High Topic Diversity
- Lowest Coherence among 3 main models
(Mô hình lai kết hợp phân bố chủ đề LDA + embedding BERTopic)
Results:
- Coherence 0.663 (highest overall)
- Stable & interpretable
- Strong semantic grouping
High balance of:
✔ Coherence
✔ Stability
✔ Purity
✔ Topic Diversity
| Model | C_v | NPMI | NMI | ARI | Diversity | Nhận xét |
|---|---|---|---|---|---|---|
| LDA | 0.549 | 0.039 | 0.193 | 0.092 | 0.800 | Dễ diễn giải, ranh giới yếu |
| BERTopic | 0.660 | 0.077 | 0.339 | 0.176 | 0.741 | Ngữ nghĩa mạnh nhất, phân tách rõ |
| Top2Vec | 0.421 | 0.000 | 0.148 | 0.043 | 0.800 | Đa dạng cao, ngữ nghĩa yếu |
| CombinedTM | 0.663 | 0.076 | 0.339 | 0.068 | 0.748 | Tốt nhất tổng thể, cân bằng |
Repository includes:
- WordCloud
- Bubble Chart
- Topic Cluster Maps
- Coherence plots
- Embedding visualizations (UMAP)
- BERTopic → strongest semantic model
- LDA → interpretable but overlapping
- Top2Vec → diverse but weak coherence
- CombinedTM → best hybrid model, highest Coherence
- Vietnamese descriptions require strong preprocessing
- Short-text topic modeling is challenging but solvable with embeddings
- YouTube content trend analysis
- Automatic content tagging
- Detect emerging hot topics
- Improving recommendation algorithms
- Digital media research
- Hà Thế Anh
- Nguyễn Nhật Nam
- Hoàng Quang Minh
Supervisor:
- Lê Nhật Tùng – HUTECH University