Dataset Collection for Thesis.
- Raw dataset (captions & comments) for Balinese text classification: instagram.json
- Some of the Instagram account source:
- Example scraping data (captions & comments) from various social media: dataset.json
- Notebook for scraping data: notebook.ipynb
- Notebook for pre-processing dataset: preprocessing.ipynb
- Dataset will be anotated by Penyuluh Bahasa Bali: dataset.xlsx
- Dataset annotation result: thesis-ml/dataset
This project mainly using Selenium and BeautifulSoup4 for scraping data from various social media. So, you basically need:
- Webdriver for Selenium (Chrome in this project)
- Clone the repository:
git clone https://github.com/putuwaw/thesis-dataset.git
- Install dependencies:
pip install -r requirements.txt
- If dependencies failed, you can try to install:
pip install ipykernel selenium beautifulsoup4 pydantic pandas linggapy openpyxl requests
- You are ready for scraping and collecting data.