This project compiles Chamorro-language text data from various sources into clean, structured datasets to support language preservation, analysis, and educational tool development. Using Python, the project extracts raw text from PDF files, scraped web content, dictionary datasets, and OCR-scanned documents and organizes the materials into two primary datasets: one of unique Chamorro words and one of full sentences.
By processing these documents into searchable, machine-readable formats, the project helps make historical and contemporary Chamorro texts more accessible to researchers, language learners, and digital applications. This work lays the foundation for future projects, including language analysis, natural language processing (NLP), dictionary building, and other digital learning tools.