Improve training dataset

The Rababa models today are trained on the Tashkeela corpus.

In [Tashkeela](https://www.sciencedirect.com/science/article/pii/S2352340917300112?via%3Dihub), 98% of its content come from Shamela.

There are some other additional datasets that are either pointed or can be made into pointed datasets.

Pointed datasets:
* Shamela offers a full download of 6.8 GB of its books, which most if not all are pointed Arabic
  * https://shamela.ws/page/download
* The University of Leeds uses this: https://corpus.quran.com/java/uthmaniscript.jsp
  * It uses the Tanzil distribution of Quran that includes pointed text: https://tanzil.net/download/
  * This is a pointed dataset, can be immediately useable by supplementing it to the old Tashkeela
* K. Aissa, Maqola, a collection of best arabic citations, 2016.(Online)(. Available) http://maqola.org
  * This is one of the sources of Tashkeela, pointed dataset.
* AlJazeera Learning https://learning.aljazeera.net/ar
  * This is one of the sources of Tashkeela, pointed, but needs crawling to obtain text.

AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:
* https://learning.aljazeera.net/en/pages/تشكيل-vocalization

The endpoint goes to:
```
curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'
```

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:
* https://farasa.qcri.org/seq2seq_diacritization/
* https://farasa.qcri.org/diacritization/

Datasets that could potentially be pointed...:
* [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619)
  * It contains lemmatized words but is apparently only 95% accurate, might not be even useful
* Bibliotheca Alexandrina has the International Corpus of Arabic http://www.bibalex.org/ica



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve training dataset #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve training dataset #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions