-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
The Rababa models today are trained on the Tashkeela corpus.
In Tashkeela, 98% of its content come from Shamela.
There are some other additional datasets that are either pointed or can be made into pointed datasets.
Pointed datasets:
- Shamela offers a full download of 6.8 GB of its books, which most if not all are pointed Arabic
- The University of Leeds uses this: https://corpus.quran.com/java/uthmaniscript.jsp
- It uses the Tanzil distribution of Quran that includes pointed text: https://tanzil.net/download/
- This is a pointed dataset, can be immediately useable by supplementing it to the old Tashkeela
- K. Aissa, Maqola, a collection of best arabic citations, 2016.(Online)(. Available) http://maqola.org
- This is one of the sources of Tashkeela, pointed dataset.
- AlJazeera Learning https://learning.aljazeera.net/ar
- This is one of the sources of Tashkeela, pointed, but needs crawling to obtain text.
AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:
The endpoint goes to:
curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'
Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:
Datasets that could potentially be pointed...:
- The OSIAN Corpus
- It contains lemmatized words but is apparently only 95% accurate, might not be even useful
- Bibliotheca Alexandrina has the International Corpus of Arabic http://www.bibalex.org/ica
Metadata
Metadata
Assignees
Labels
No labels