Handling Bahasa Rojak (Malaysian Code Mixing Language) OOV and performing Sentiment Analysis using downstreamed Cross Lingual Model XLM-RoBERTa (XLM-T)
Jupyter Notebooks includes detailing of:
- Text Preprocessing
- Model Fine Tuning
- New Data Inference Pipeline
For further resources regarding the project, please access link below.
Access the project here: https://drive.google.com/drive/folders/12Uir9KE4B1VL6oQWdj2BWvCUZOC0vWa2
| Preprocessing Method | Model 1 (V1) | Model 2 (V2) | Model 3 (V3) | Model 4 (V4) |
|---|---|---|---|---|
| Remove URLs | ✔ | ✔ | ✔ | ✔ |
| Convert Lowercase | ✔ | ✔ | ✔ | - |
| Remove Punctuations | ✔ | ✔ | ✔ | - |
| Remove Irregular Spaces | ✔ | ✔ | ✔ | ✔ |
| Handle OOV | ✔ | ✔ | ✔ | ✔ |
| Remove Stopwords | ✔ | ✔ | - | - |
| Chinese Character Segmentation | - | ✔ | ✔ | - |
| Remove Rare Words | - | - | ✔ | - |
| Precision | Recall | F1-Score | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0 | 1 | ||
| Model V1 | 0.716 | 0.830 | 0.840 | 0.702 | 0.773 | 0.760 | 0.767 |
| Model V2 | 0.768 | 0.771 | 0.735 | 0.801 | 0.751 | 0.786 | 0.770 |
| Model V3 | 0.794 | 0.703 | 0.691 | 0.802 | 0.739 | 0.749 | 0.744 |
| Model V4 | 0.861 | 0.833 | 0.802 | 0.884 | 0.831 | 0.858 | 0.845 |


