Stratified Random Sampling for Parallel Coprus

Description

While developing a corpus, data from different domains and sources is used and it is very rare that you find exact propostion of data from these different domains/sources. So instead of just collecting and dumping data, it is better to use sampling technique to adequately split the data into train, test and dev set. This python script can be used to split your data (collected from different sources) into train, test and dev/validation set using stratified random sampling.

Author:

Moodser Hussain
COMSATS University Islamabad, Lahore Campus
Email: [email protected]

Usage

This script support both type of splits (1) Percentage and (2) Number
Place all parallel files of English/Lang1 in 'en' directory and Urdu/Lang2 in 'ur' directory.
Place this script in the root directory of 'en' and 'ur' folder.
Name of files should end with language extention (e.g. all files should have extenstion .en that are placed in 'en' directory).
During execution program will ask for split type (percentage/number) and values so provide these values according according to your requirements.
It will automatically fetch files from both directories to generate parallel files for train, test and validation set.

Acknowledgements

Dr. Rao Muhammad Adeel Nawab
Dr. Muhammad Sharjeel

Note

This script was used for English-Urdu langauge pair. If you are intending to use it any other language pair, you can change the extensions accordingly.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
splitter-script.py		splitter-script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stratified Random Sampling for Parallel Coprus

Description

Author:

Usage

Acknowledgements

Note

About

Uh oh!

Releases

Packages

Languages

License

moodser/stratified-sampling-parallel-corpus

Folders and files

Latest commit

History

Repository files navigation

Stratified Random Sampling for Parallel Coprus

Description

Author:

Usage

Acknowledgements

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages