language_parser_using_transformer

A langauge parser using Text-To-Text Transfer Transformer (T5) is a pre-trained encoder-decoder model handling all NLP tasks as a unified text-to-text-format where the input and output are always text strings. T5-Small is the checkpoint with 60 million parameters.

ML Model Training for Phone-based Representation of Words

Project Overview

This project is an experiment aimed at training a Machine Learning (ML) model to accurately parse new words into their phonetic representations. The model is trained using a curated and verified dataset of words and their corresponding phonetic outputs. Over time, the model learns to generalize and correctly represent new words in a phonetic format. Example:

यूट्यूब : y U ट y U b

ब्लास्ट : b l A s ट

अट्टालिका : a ट ट A l i k A

Getting Started

Prerequisites

Ensure you have the following installed:

Python 3.7+
PyTorch
Hugging Face Transformers
Other dependencies listed in requirements.txt

Dataset Preparation

Create Dataset

The dataset is prepared using the create_dataset.py script. This script shuffles and divides the data into training and validation sets.
```
python create_dataset.py --input_data path_to_data/csv file
```
Extend Tokenizer

To handle new UTF-8 tokens (including non-English characters), the tokenizer needs to be extended. This is done using the extend_tokenizer.py script.

python extend_tokenizer.py

Model Training

Once the dataset is ready and the tokenizer is extended, you can proceed to train the model using a pretrained checkpoint of the T5-small model. This is done using the train.py script.

python train.py

Inference

After training, the model can be used for inference on new words. Use the inference_using_checkpoint.py script to perform inference.

python inference_using_checkpoint.py --model_checkpoint path_to_trained_model

Contributions

Contributions are welcome! Please fork the repository and submit a pull request with your changes. Feel free to reach out with any questions or suggestions. Happy training!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
create_dataset.py		create_dataset.py
extend_tokenizer.py		extend_tokenizer.py
inference.py		inference.py
inference_using_checkpoint.py		inference_using_checkpoint.py
requirements.txt		requirements.txt
train.py		train.py
val_data.csv		val_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

language_parser_using_transformer

ML Model Training for Phone-based Representation of Words

Project Overview

Getting Started

Prerequisites

Dataset Preparation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

utkarsh2299/language_parser_using_transformer

Folders and files

Latest commit

History

Repository files navigation

language_parser_using_transformer

ML Model Training for Phone-based Representation of Words

Project Overview

Getting Started

Prerequisites

Dataset Preparation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages