Dataset based on Russian Web Tables (RWT), which is a corpus of Russian language tables from Wikipedia.
Only relational tables were chosen from RWT with headers matching selected 170 DBpedia semantic types.
Dataset contains 1.441.349 columns, and has fixed train / test split.
| Split | Columns | Tables | Avg. columns per table |
|---|---|---|---|
| Test | 115 448 | 55 080 | 2.096 |
| Train | 1 325 901 | 633 426 | 2.093 |
| Column size | Occurances |
|---|---|
| 1 | 257890 |
| 2 | 172414 |
| 3 | 124635 |
| 4 | 54886 |
| 5 | 18532 |
| 6 | 3404 |
| 7 | 733 |
| 8 | 254 |
| 9 | 234 |
| 18 | 221 |
| Column size | Occurances |
|---|---|
| 19 | 6 |
| 40 | 6 |
| 16 | 5 |
| 38 | 5 |
| 29 | 4 |
| 20 | 4 |
| 21 | 4 |
| 37 | 2 |
| 39 | 2 |
| 17 | 2 |
| Label | Occurances |
|---|---|
| год | 230016 |
| название | 170812 |
| место | 103986 |
| дата | 97228 |
| команда | 75032 |
| результат | 52730 |
| примечание | 48635 |
| актер | 38959 |
| страна | 36754 |
| турнир | 33175 |
| Label | Occurances |
|---|---|
| континент | 92 |
| роман | 89 |
| закон | 89 |
| борец | 88 |
| колледж | 87 |
| музей | 86 |
| фирма | 85 |
| дорога | 83 |
| префектура | 83 |
| цитата | 76 |
| Column size | Occurances |
|---|---|
| 1 | 22491 |
| 2 | 14923 |
| 3 | 10798 |
| 4 | 4801 |
| 5 | 1614 |
| 6 | 299 |
| 7 | 69 |
| 18 | 21 |
| 8 | 19 |
| 9 | 18 |
| Column size | Occurances |
|---|---|
| 13 | 3 |
| 36 | 2 |
| 20 | 1 |
| 16 | 1 |
| 21 | 1 |
| 14 | 1 |
| 39 | 1 |
| 37 | 1 |
| 38 | 1 |
| 11 | 1 |
| Label | Occurances |
|---|---|
| год | 19854 |
| название | 14748 |
| место | 9004 |
| дата | 8408 |
| команда | 6653 |
| результат | 4653 |
| примечание | 4203 |
| актер | 3435 |
| страна | 3217 |
| турнир | 2911 |
| Label | Occurances |
|---|---|
| цитата | 7 |
| дорога | 6 |
| статья | 6 |
| фирма | 6 |
| сообщество | 5 |
| колледж | 5 |
| борец | 5 |
| музей | 4 |
| банк | 4 |
| камера | 4 |
Make sure your PC satisfies these requirements:
- Download and decompress ru-wiki-tables-datset into
./dataset/directory. - Run
makecommand from./dataset/collecting/directory to compile collecting files. - Run
./dataset/collecting/collect_columns_from_datasetto collect column headers from dataset. Output will be in./dataset/collecting/columns_headers/. - Run all cells in
./dataset/research/research.ipynb. - Run all cells in
./dataset/labelling/labelling.ipynb. - Run
./dataset/collecting/collect_columns_datato collect column data from dataset. Output will be in./dataset/collecting/columns_data/. - Run all cells in
./dataset/cta_dataset/create_cta_dataset.ipynb. Output train/test splits will be in./dataset/cta_dataset/train[test]directories.