An interactive web application for multi-language NLP analysis supporting Hebrew, Klingon, and English. Input text in any of these languages, select analysis models, and receive detailed tokenization and morphological breakdowns with confidence scores, visualizations, and export capabilities.
- 🇮🇱 Hebrew - Full morphological analysis with gender, number, person, tense, and binyanim
- 🖖 Klingon (tlhIngan Hol) - Advanced tokenization with agglutinative morphology support
- 🇬🇧 English - Part-of-speech tagging, lemmatization, and dependency parsing
- 🌐 Multi-Language Support - Process Hebrew, Klingon, and English text
- 🔤 Advanced Preprocessing - Normalization, diacritic handling, and language separation
- ✂️ Language-Aware Tokenization - Specialized segmentation for each language's unique structure
- 📊 Morphological Analysis - Detailed breakdown adapted to each language's grammar
- Hebrew: gender, number, person, tense, binyanim
- Klingon: noun/verb prefixes, suffixes, aspect markers
- English: POS tags, lemmas, morphological features
- 🎯 Confidence Scoring - Transparent confidence metrics for every analysis
- 🎨 Interactive UI - Built with Streamlit, supporting RTL text (Hebrew) and LTR text (Klingon, English)
- 💾 Export Options - Download results in JSON or CSV formats
- 📈 Visualizations - Charts and graphs for morphological distribution across languages
- Python 3.8+
- Streamlit - Web application framework
- Pandas - Data manipulation and analysis
- Plotly - Interactive visualizations
- spaCy - English NLP processing
- Custom NLP Modules - Hebrew and Klingon tokenization and morphology
Tokenizer-Morphology-Explorer/
├── app.py # Main Streamlit application
├── verify_app.py # Application verification script
├── modules/ # Core NLP processing modules
│ ├── preprocessor.py # Multi-language text preprocessing
│ ├── tokenizer.py # Hebrew, Klingon, English tokenization
│ ├── morphology.py # Morphological analysis for all languages
│ ├── hebrew_analyzer.py # Hebrew-specific analysis
│ ├── klingon_analyzer.py # Klingon-specific analysis
│ └── english_analyzer.py # English-specific analysis
├── data/ # Dictionaries and rule sets
│ ├── hebrew_lexicon.json # Hebrew word lexicon
│ ├── klingon_lexicon.json # Klingon word database
│ ├── english_lexicon.json # English word database
│ └── rules.json # Language-specific morphological rules
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Python 3.8 or later
- pip (Python package manager)
- Virtual environment (recommended)
git clone https://github.com/the3y3-code/Tokenizer-Morphology-Explorer.git
cd Tokenizer-Morphology-Explorerpython -m venv venvOn Windows:
venv\Scripts\activateOn macOS/Linux:
source venv/bin/activatepip install -r requirements.txtpython -m spacy download en_core_web_smstreamlit run app.pyThe application will open in your default browser at http://localhost:8501.
- Select Language - Choose Hebrew, Klingon, or English
- Input Text - Enter text in the selected language
- Select Model - Choose from available analysis models
- Analyze - Click the analyze button to process the text
- View Results - Explore tokenization, morphology, and confidence scores
- Export - Download results in JSON or CSV format
Hebrew:
שלום עולם
Klingon:
nuqneH, qaleghpu'
English:
Hello world, how are you?
Run the verification script to ensure everything is set up correctly:
python verify_app.pyYou can also use the modules programmatically:
from modules.tokenizer import MultiLanguageTokenizer
from modules.morphology import MorphologyAnalyzer
# Initialize
tokenizer = MultiLanguageTokenizer()
analyzer = MorphologyAnalyzer()
# Hebrew example
hebrew_text = "שלום עולם"
hebrew_tokens = tokenizer.tokenize(hebrew_text, language="hebrew")
for token in hebrew_tokens:
morphology = analyzer.analyze(token, language="hebrew")
print(f"{token}: {morphology}")
# Klingon example
klingon_text = "nuqneH"
klingon_tokens = tokenizer.tokenize(klingon_text, language="klingon")
for token in klingon_tokens:
morphology = analyzer.analyze(token, language="klingon")
print(f"{token}: {morphology}")
# English example
english_text = "Hello world"
english_tokens = tokenizer.tokenize(english_text, language="english")
for token in english_tokens:
morphology = analyzer.analyze(token, language="english")
print(f"{token}: {morphology}")To run tests (if implemented):
pytest tests/- Root extraction (shoresh)
- Binyan identification
- Gender and number agreement
- Prefix and suffix handling
- Nikud (diacritics) processing
- Verb prefix recognition (pronominal prefixes)
- Noun suffix types (1-5)
- Verb suffix types (1-9)
- Rover suffixes
- Aspect and mood markers
- Part-of-speech tagging
- Lemmatization
- Named entity recognition
- Dependency parsing
- Morphological features (tense, number, person)
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Please read CONTRIBUTING.md for details on our code of conduct and development process.
- Add support for additional languages (Arabic, Esperanto)
- Implement sentiment analysis for all languages
- Add named entity recognition (NER) for Hebrew and Klingon
- Cross-language comparison tools
- Translation suggestions between supported languages
- RESTful API endpoint
- Docker containerization
- Batch processing mode
- Mobile-responsive UI improvements
This project is licensed under the MIT License - see the LICENSE file for details.
- Hebrew NLP research community
- Klingon Language Institute (KLI)
- spaCy and English NLP communities
- Streamlit framework developers
- Contributors and testers
the3y3-code - GitHub Profile
Project Link: https://github.com/the3y3-code/Tokenizer-Morphology-Explorer
- Hebrew Morphology Reference
- Klingon Language Institute
- The Klingon Dictionary
- English NLP with spaCy
⭐ If you find this project useful, please consider giving it a star!