Skip to content

This project compiles Chamorro-language text data from various sources into clean, structured datasets to support language preservation, analysis, and educational tool development. (WIP)

Notifications You must be signed in to change notification settings

schyuler/Chamorro-Corpus-Builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Chamorro Corpus Builder

About This Project

This project compiles Chamorro-language text data from various sources into clean, structured datasets to support language preservation, analysis, and educational tool development. Using Python, the project extracts raw text from PDF files, scraped web content, dictionary datasets, and OCR-scanned documents and organizes the materials into two primary datasets: one of unique Chamorro words and one of full sentences.

By processing these documents into searchable, machine-readable formats, the project helps make historical and contemporary Chamorro texts more accessible to researchers, language learners, and digital applications. This work lays the foundation for future projects, including language analysis, natural language processing (NLP), dictionary building, and other digital learning tools.

About

This project compiles Chamorro-language text data from various sources into clean, structured datasets to support language preservation, analysis, and educational tool development. (WIP)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published