I've implemented a probabilistic word segmentation algorithm for the Khmer language using a Viterbi approach (finding the shortest path in a graph of possible segments) weighted by word probabilities derived from a text corpus.
Real-time Segmentation & Editing
- Native Typing Experience: Type quickly with zero lag.
- Instant Feedback: Unknown words are underlined in red, updating in real-time (~0.3ms latency) as you type.
- Smart Paste: Automatically re-segments pasted text while preserving formatting signals.
Visual Analysis & Navigation
- Segmentation Blocks: See exactly how the algorithm splits each word.
- Navigation Tools: Jump instantly between unknown words to verify potential errors or new vocabulary.
- Analytics: View word count and unknown word statistics.
- khmernltk: Used for benchmarking.
- sovichet: For providing the Khmer Folktales Corpus and Dictionary resources.
- phylypo: For providing the
kh_data_10000bdataset used for frequency analysis.
Important
Disclaimer: My dictionary is still lacking many curated sources of technical words. If anyone can contribute curated Khmer words with credible sources, the algorithm will improve significantly. I highly appreciate your contributions to improving the data quality!
TL;DR: Most ML-based Khmer segmentation relies on inconsistent Zero-Width Space (ZWS) annotations in training data. Since 100 people annotate ZWS differently, models learn "average human inconsistency" rather than linguistic correctness. I ignore ZWS entirely, using curated dictionaries and frequency-based Viterbi search for deterministic, dictionary-accurate segmentation.
✅ 100% Deterministic – Same input always produces identical output
✅ Explainable – Update dictionary/frequency = instant improvement
✅ Ultra-Fast – ~0.34ms/call (C port) vs. ~5-50ms for transformers
✅ Portable – Runs on embedded systems, mobile, web (WASM), and cloud
✅ No Training – Works immediately without GPU/annotation costs
I've built this as foundational infrastructure for:
- ML Research: Clean training data for Khmer LLMs by providing dictionary-accurate baselines
- Spellcheck/Grammar Tools: Essential word boundary detection for text editors, IDEs, browsers, and mobile keyboards
- Production Systems: Reliable segmentation for technical documentation, legal/medical applications, and embedded devices
📖 Read the full design philosophy and use cases ▸
To install the required dependencies, run:
pip install -r requirements.txtfrom khmer_segmenter import KhmerSegmenter
# Initialize (Loads dictionary from default path)
seg = KhmerSegmenter()
# Segment text
text = "ក្រុមហ៊ុនទទួលបានប្រាក់ចំណូល"
result = seg.segment(text)
print(result)
# Output: ['ក្រុមហ៊ុន', 'ទទួល', 'បាន', 'ប្រាក់ចំណូល']For users requiring maximum performance or embedding in C/C++/Zig applications, a native port is available in the port/c/ directory. All ports share common linguistic data found in port/common/.
- Speed: ~16x faster (Single Thread), more than 30x faster (Multi-Thread) running in WSL.
- Architecture: Zero-dependency, Regex-Free Rule Engine (Hardcoded logic) for consistent O(n) performance.
- Documentation: See port/c/README.md.
I've built a consolidated data pipeline to normalize text, generate frequencies, and compile binary dictionaries for all ports.
- Normalize: Strips ZWS, ZWNJ, and fixes composite vowels/clusters in the corpus.
- Generate Frequencies (Iterative): Uses the C-based
khmer_segmenterto iteratively segment the corpus and refine word frequencies (Self-Supervised).Robustness: This method starts with a clean slate (dictionary only) and converges to a stable frequency distribution in minimal iterations (typically 3 passes). It is robust against initial noise and effectively learns the corpus's natural word usage without external dependencies.
- Compile Binaries:
khmer_dictionary.kdict: Optimized Hash Table for C/Zig port (Saved toport/common/).khmer_frequencies.bin: Legacy binary format (Saved toport/common/).
Note
Compressed Dataset: The extended frequency dataset is stored as dataset/kh_data_10000b_200b.zip to reduce repository size. If you wish to use this data for frequency generation, please extract it manually before running the scripts. The prepare_data.py script will automatically find the extracted .txt files if they are placed in the experiments/ or dataset/ folder.
Run the pipeline to rebuild all data assets:
# Run with default settings (uses internal segmenter)
python scripts/prepare_data.py
# Specify custom corpus or dictionary
python scripts/prepare_data.py --corpus dataset/my_corpus.txt --dict khmer_segmenter/dictionary_data/my_dict.txt
To add new words:
- Add word to
khmer_segmenter/dictionary_data/khmer_dictionary_words.txt. - Run
python scripts/prepare_data.py. - The new word is now compiled into
khmer_dictionary.kdictand ready for the C port.
If you are adding new words and want to assign them frequencies based on your existing corpus (using the unknown word frequencies captured during prepare_data.py), use the incremental update script. This is faster than re-running the full pipeline.
- Ensure you have unknown word data: Run
prepare_data.pyat least once to generatekhmer_segmenter/dictionary_data/unknown_word_frequencies.json. - Add new words: Add your new words to
khmer_segmenter/dictionary_data/khmer_dictionary_words.txt. - Run the update:
This script will:
python scripts/incremental_update.py \ --dict khmer_segmenter/dictionary_data/khmer_dictionary_words.txt \ --freq khmer_segmenter/dictionary_data/khmer_word_frequencies.json \ --unknown-freq khmer_segmenter/dictionary_data/unknown_word_frequencies.json
- Find words in your dictionary that are missing from the frequency list.
- Look up their count in
unknown_word_frequencies.json. - If not found, try to derive frequency from component words (for compounds).
- Otherwise, assign a default floor frequency.
- Update
khmer_word_frequencies.json.
For a detailed step-by-step explanation of the Viterbi algorithm, Normalization logic, and Rules I use in this project, please refer to the Porting Guide & Algorithm Reference.
I compared the performance and output of KhmerSegmenter against khmernltk using a complex sentence from a folktale.
You can analyze the segmentation results to find words that were not in the dictionary (potential new words or names):
python scripts/find_unknown_words.py --input segmentation_results.txtThis will generate output/unknown_words_from_results.txt showing the unknown words, their frequency, and context (2 words before and after) to help you decide if they should be added to the dictionary.
I provide two benchmarks: one for Real-Time Latency (single sentence, micro-benchmark) and one for Batch Throughput (large corpus, macro-benchmark).
Context: Comparison across two scenarios: (1) Micro-Benchmark (Latency/Real-time) simulating single-sentence requests, and (2) Macro-Benchmark (Throughput) processing a large 50k-line corpus.
| Scenario | Metric | khmernltk | Python (v3) | C Port | Rust Port | Notes |
|---|---|---|---|---|---|---|
| Micro-Benchmark | Latency (Seq) | ~2.90 ms | ~2.21 ms | ~0.36 ms | ~0.34 ms | Lower is better |
| Micro-Benchmark | Throughput (4-Thread) | ~330 calls/s | ~503 calls/s | ~10,970 calls/s | ~10,909 calls/s | Higher is better |
| Macro-Benchmark | Throughput (4-Thread) | ~378 lines/s | ~585 lines/s | ~30,838 lines/s | ~31,250 lines/s | File I/O + Process |
| Memory Usage | Baseline (Init) | ~113 MB | ~36 MB | ~4.8 MB | ~2.2 MB | Dict Load |
| Memory Usage | Overhead (Conc) | ~12 MB | ~18 MB | ~0.2 MB | ~0.0 MB | Efficiency |
> Note: Benchmarks run on standard consumer hardware (Linux/Ryzen 7) with 4 threads.
With file-based benchmarking on Linux, the C port processes ~30,838 lines per second using 4 threads, compared to ~553 lines/sec in Python. This ~55x speedup validates the linear scaling and SIMD optimizations of the C implementation on multi-core systems, making it suitable for high-volume ETL pipelines.
Python-based segmenters (both ours and khmernltk) see negative scaling (0.9x speedup) when adding threads due to the Global Interpreter Lock (GIL) and overhead. Python is strictly limited to single-core CPU speeds for this workload.
KhmerSegmenter initializes in ~0.28s, whereas khmernltk takes ~1.8s+ to load its model. This makes KhmerSegmenter ideal for "Serverless" functions where startup latency is a primary billing and UX concern.
Input:
"ក្រុមហ៊ុនទទួលបានប្រាក់ចំណូល ១ ០០០ ០០០ ដុល្លារក្នុងឆ្នាំនេះ ខណៈដែលតម្លៃភាគហ៊ុនកើនឡើង ៥% ស្មើនឹង 50.00$។ លោក ទេព សុវិចិត្រ នាយកប្រតិបត្តិដែលបញ្ចប់ការសិក្សាពីសាកលវិទ្យាល័យភូមិន្ទភ្នំពេញ (ស.ភ.ភ.ព.) បានថ្លែងថា ភាពជោគជ័យផ្នែកហិរញ្ញវត្ថុនាឆ្នាំនេះ គឺជាសក្ខីភាពនៃកិច្ចខិតខំប្រឹងប្រែងរបស់ក្រុមការងារទាំងមូល និងការជឿទុកចិត្តពីសំណាក់វិនិយោគិន។"
khmernltk Result (v1.5):
ក្រុមហ៊ុន|ទទួលបាន|ប្រាក់|ចំណូល||១||០០០||០០០||ដុល្លារ|ក្នុង|ឆ្នាំ|នេះ||ខណៈ|ដែល|តម្លៃ|ភាគហ៊ុន|កើនឡើង||៥%||ស្មើនឹង||50.|00$|។||លោក||ទេព||សុវិចិត្រ||នាយក|ប្រតិបត្តិ|ដែល|បញ្ចប់|ការសិក្សា|ពី|សាកលវិទ្យាល័យ|ភូមិន្ទ|ភ្នំពេញ||(|ស.|ភ.|ភ.|ព.|)||បាន|ថ្លែង|ថា||ភាពជោគជ័យ|ផ្នែក|ហិរញ្ញវត្ថុ|នា|ឆ្នាំ|នេះ||គឺជា|សក្ខីភាព|នៃ|កិច្ច|ខិតខំ|ប្រឹងប្រែង|របស់|ក្រុមការងារ|ទាំងមូល||និង|ការជឿទុកចិត្ត|ពីសំណាក់|វិនិយោគិន|។
KhmerSegmenter Result:
ក្រុមហ៊ុន|ទទួល|បាន|ប្រាក់ចំណូល||១||០០០||០០០||ដុល្លារ|ក្នុង|ឆ្នាំ|នេះ||ខណៈ|ដែល|តម្លៃ|ភាគហ៊ុន|កើនឡើង||៥|%||ស្មើនឹង||50.00|$|។||លោក||ទេព||សុវិចិត្រ||នាយក|ប្រតិបត្តិ|ដែល|បញ្ចប់|ការសិក្សា|ពី|សាកលវិទ្យាល័យ|ភូមិន្ទ|ភ្នំពេញ||(|ស.ភ.ភ.ព.|)||បាន|ថ្លែង|ថា||ភាព|ជោគជ័យ|ផ្នែក|ហិរញ្ញវត្ថុ|នា|ឆ្នាំ|នេះ||គឺជា|សក្ខីភាព|នៃ|កិច្ច|ខិតខំ|ប្រឹងប្រែង|របស់|ក្រុមការងារ|ទាំងមូល||និង|ការ|ជឿ|ទុកចិត្ត|ពីសំណាក់|វិនិយោគិន|។
Key Differences:
- Numbers:
khmernltksplits១ ០០០ ០០០into 5 tokens.KhmerSegmenteralso splits them to support granular processing. - Acronyms:
khmernltkdestroys(ស.ភ.ភ.ព.)into multiple tokens.KhmerSegmenterkeeps it as one. - Dictionary Adherence:
KhmerSegmenterstrictly adheres to the dictionary. For example, it correctly splitsភាគហ៊ុនintoភាគ|ហ៊ុនifភាគហ៊ុនisn't in the loaded dictionary but the parts are (or vice versa depending on dictionary state). Note: Benchmarks reflect the current state ofkhmer_dictionary_words.txt. As you add words likeភាគហ៊ុន, the segmenter will automatically group them.
KhmerSegmenter is designed with a "Run Anywhere" philosophy. Unlike modern ML models that are tied to heavy runtimes (Python/PyTorch/TensorFlow) and specific hardware, this algorithm is pure mathematical logic.
The implementation is verified and optimized for all major operating systems and architectures:
- ✅ Windows, Linux, macOS, BSD (x86_64, ARM64, Apple Silicon)
- ✅ Cloud & Serverless: Minimal cold start (<50ms in C) for AWS Lambda, Google Cloud Functions, and Edge Computing.
- ✅ Embedded & IoT: Runs on bare-metal C, RTOS, and microcontrollers (Arduino, ESP32, STM32) with as little as 9MB of RAM.
The core Viterbi logic and Normalization rules rely only on standard data structures. It can be ported to any language:
- High-Level: Python (Official), JS/TS, Go, Java, C#.
- Native: C (Official), C++, Rust, Zig.
- Web: Compile to WASM for blazingly fast client-side performance.
- Zero External Libs: Works without
numpy,scikit-learn, orkhmernltkat runtime. - No GPU Required: Strictly CPU-based, deterministic, and 100% reproducible across different hardware.
You can verify the segmentation logic using the scripts/test_viterbi.py script. This script supports both single-case regression testing and batch processing of a corpus.
python scripts/test_viterbi.pyTo test against a file and see the output:
python scripts/test_viterbi.py --source dataset/khmer_folktales_extracted.txt --limit 500This will generate output/segmentation_results.txt.
MIT License
Copyright (c) 2026 Sovichea Tep
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This project is licensed under the MIT License - see the LICENSE file for details.
You are free to use, modify, and distribute this software, but you must acknowledge usage by retaining the copyright notice and license in your copies.

