Chamorro Corpus Builder

About This Project

This project compiles Chamorro-language text data from various sources into clean, structured datasets to support language preservation, analysis, and educational tool development. Using Python, the project extracts raw text from PDF files, scraped web content, dictionary datasets, and OCR-scanned documents and organizes the materials into two primary datasets: one of unique Chamorro words and one of full sentences.

By processing these documents into searchable, machine-readable formats, the project helps make historical and contemporary Chamorro texts more accessible to researchers, language learners, and digital applications. This work lays the foundation for future projects, including language analysis, natural language processing (NLP), dictionary building, and other digital learning tools.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
inputs		inputs
scripts/text-extraction		scripts/text-extraction
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chamorro Corpus Builder

About This Project

About

Uh oh!

Releases

Packages

Languages

schyuler/Chamorro-Corpus-Builder

Folders and files

Latest commit

History

Repository files navigation

Chamorro Corpus Builder

About This Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages