|
| 1 | +- [Embedding 模型汇总](#embedding-模型汇总) |
| 2 | + - [中文词向量](#中文词向量) |
| 3 | + - [英文词向量](#英文词向量) |
| 4 | + - [GloVe](#glove) |
| 5 | + - [FastText](#fasttext) |
| 6 | + - [模型信息](#模型信息) |
| 7 | + - [致谢](#致谢) |
| 8 | + - [参考论文](#参考论文) |
| 9 | + |
1 | 10 | # Embedding 模型汇总 |
2 | 11 |
|
3 | 12 | PaddleNLP提供多个开源的预训练Embedding模型,用户仅需在使用`paddlenlp.embeddings.TokenEmbedding`时,指定预训练模型的名称,即可加载相对应的预训练模型。以下为PaddleNLP所支持的预训练Embedding模型,其名称用作`paddlenlp.embeddings.TokenEmbedding`的参数。命名方式为:\${训练模型}.\${语料}.\${词向量类型}.\${co-occurrence type}.dim\${维度}。训练模型有三种,分别是Word2Vec(w2v, 使用skip-gram模型训练), GloVe(glove)和FastText(fasttext)。 |
@@ -42,11 +51,91 @@ PaddleNLP提供多个开源的预训练Embedding模型,用户仅需在使用`p |
42 | 51 |
|
43 | 52 | ## 英文词向量 |
44 | 53 |
|
45 | | -待更新。 |
| 54 | +### GloVe |
| 55 | + |
| 56 | +| 语料 | 25维 | 50维 | 100维 | 200维 | 300 维 | |
| 57 | +| ----------------- | ------ | ------ | ------ | ------ | ------ | |
| 58 | +| Wiki2014 + GigaWord | 无 | glove.wiki2014-gigaword.target.word-word.dim50.en | glove.wiki2014-gigaword.target.word-word.dim100.en | glove.wiki2014-gigaword.target.word-word.dim200.en | glove.wiki2014-gigaword.target.word-word.dim300.en | |
| 59 | +| Twitter | glove.twitter.target.word-word.dim25.en | glove.twitter.target.word-word.dim50.en | glove.twitter.target.word-word.dim100.en | glove.twitter.target.word-word.dim200.en | 无 | |
| 60 | + |
| 61 | +### FastText |
| 62 | + |
| 63 | +| 语料 | 名称 | |
| 64 | +|------|------| |
| 65 | +| Wiki2017 | fasttext.wiki-news.target.word-word.dim300.en | |
| 66 | +| Crawl | fasttext.crawl.target.word-word.dim300.en | |
| 67 | + |
| 68 | +## 模型信息 |
| 69 | + |
| 70 | +| 模型 | 文件大小 | 词表大小 | |
| 71 | +|-----|---------|---------| |
| 72 | +| w2v.baidu_encyclopedia.target.word-word.dim300 | 678.21 MB | 635965 | |
| 73 | +| w2v.baidu_encyclopedia.target.word-character.char1-1.dim300 | 679.15 MB | 636038 | |
| 74 | +| w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | 679.30 MB | 636038 | |
| 75 | +| w2v.baidu_encyclopedia.target.word-character.char1-4.dim300 | 679.51 MB | 636038 | |
| 76 | +| w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | 679.48 MB | 635977 | |
| 77 | +| w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300 | 671.27 MB | 628669 | |
| 78 | +| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 | 7.28 GB | 6969069 | |
| 79 | +| w2v.baidu_encyclopedia.target.word-wordLR.dim300 | 678.22 MB | 635958 | |
| 80 | +| w2v.baidu_encyclopedia.target.word-wordPosition.dim300 | 679.32 MB | 636038 | |
| 81 | +| w2v.baidu_encyclopedia.target.bigram-char.dim300 | 679.29 MB | 635976 | |
| 82 | +| w2v.baidu_encyclopedia.context.word-word.dim300 | 677.74 MB | 635952 | |
| 83 | +| w2v.baidu_encyclopedia.context.word-character.char1-1.dim300 | 678.65 MB | 636200 | |
| 84 | +| w2v.baidu_encyclopedia.context.word-character.char1-2.dim300 | 844.23 MB | 792631 | |
| 85 | +| w2v.baidu_encyclopedia.context.word-character.char1-4.dim300 | 1.16 GB | 1117461 | |
| 86 | +| w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300 | 7.25 GB | 6967598 | |
| 87 | +| w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300 | 5.21 GB | 5000001 | |
| 88 | +| w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300 | 7.26 GB | 6968998 | |
| 89 | +| w2v.baidu_encyclopedia.context.word-wordLR.dim300 | 1.32 GB | 1271031 | |
| 90 | +| w2v.baidu_encyclopedia.context.word-wordPosition.dim300 | 6.47 GB | 6293920 | |
| 91 | +| w2v.wiki.target.bigram-char.dim300 | 375.98 MB | 352274 | |
| 92 | +| w2v.wiki.target.word-char.dim300 | 375.52 MB | 352223 | |
| 93 | +| w2v.wiki.target.word-word.dim300 | 374.95 MB | 352219 | |
| 94 | +| w2v.wiki.target.word-bigram.dim300 | 375.72 MB | 352219 | |
| 95 | +| w2v.people_daily.target.bigram-char.dim300 | 379.96 MB | 356055 | |
| 96 | +| w2v.people_daily.target.word-char.dim300 | 379.45 MB | 355998 | |
| 97 | +| w2v.people_daily.target.word-word.dim300 | 378.93 MB | 355989 | |
| 98 | +| w2v.people_daily.target.word-bigram.dim300 | 379.68 MB | 355991 | |
| 99 | +| w2v.weibo.target.bigram-char.dim300 | 208.24 MB | 195199 | |
| 100 | +| w2v.weibo.target.word-char.dim300 | 208.03 MB | 195204 | |
| 101 | +| w2v.weibo.target.word-word.dim300 | 207.94 MB | 195204 | |
| 102 | +| w2v.weibo.target.word-bigram.dim300 | 208.19 MB | 195204 | |
| 103 | +| w2v.sogou.target.bigram-char.dim300 | 389.81 MB | 365112 | |
| 104 | +| w2v.sogou.target.word-char.dim300 | 389.89 MB | 365078 | |
| 105 | +| w2v.sogou.target.word-word.dim300 | 388.66 MB | 364992 | |
| 106 | +| w2v.sogou.target.word-bigram.dim300 | 388.66 MB | 364994 | |
| 107 | +| w2v.zhihu.target.bigram-char.dim300 | 277.35 MB | 259755 | |
| 108 | +| w2v.zhihu.target.word-char.dim300 | 277.40 MB | 259940 | |
| 109 | +| w2v.zhihu.target.word-word.dim300 | 276.98 MB | 259871 | |
| 110 | +| w2v.zhihu.target.word-bigram.dim300 | 277.53 MB | 259885 | |
| 111 | +| w2v.financial.target.bigram-char.dim300 | 499.52 MB | 467163 | |
| 112 | +| w2v.financial.target.word-char.dim300 | 499.17 MB | 467343 | |
| 113 | +| w2v.financial.target.word-word.dim300 | 498.94 MB | 467324 | |
| 114 | +| w2v.financial.target.word-bigram.dim300 | 499.54 MB | 467331 | |
| 115 | +| w2v.literature.target.bigram-char.dim300 | 200.69 MB | 187975 | |
| 116 | +| w2v.literature.target.word-char.dim300 | 200.44 MB | 187980 | |
| 117 | +| w2v.literature.target.word-word.dim300 | 200.28 MB | 187961 | |
| 118 | +| w2v.literature.target.word-bigram.dim300 | 200.59 MB | 187962 | |
| 119 | +| w2v.sikuquanshu.target.word-word.dim300 | 20.70 MB | 19529 | |
| 120 | +| w2v.sikuquanshu.target.word-bigram.dim300 | 20.77 MB | 19529 | |
| 121 | +| w2v.mixed-large.target.word-char.dim300 | 1.35 GB | 1292552 | |
| 122 | +| w2v.mixed-large.target.word-word.dim300 | 1.35 GB | 1292483 | |
| 123 | +| glove.wiki2014-gigaword.target.word-word.dim50.en | 73.45 MB | 400002 | |
| 124 | +| glove.wiki2014-gigaword.target.word-word.dim100.en | 143.30 MB | 400002 | |
| 125 | +| glove.wiki2014-gigaword.target.word-word.dim200.en | 282.97 MB | 400002 | |
| 126 | +| glove.wiki2014-gigaword.target.word-word.dim300.en | 422.83 MB | 400002 | |
| 127 | +| glove.twitter.target.word-word.dim25.en | 116.92 MB | 1193516 | |
| 128 | +| glove.twitter.target.word-word.dim50.en | 221.64 MB | 1193516 | |
| 129 | +| glove.twitter.target.word-word.dim100.en | 431.08 MB | 1193516 | |
| 130 | +| glove.twitter.target.word-word.dim200.en | 848.56 MB | 1193516 | |
| 131 | +| fasttext.wiki-news.target.word-word.dim300.en | 541.63 MB | 999996 | |
| 132 | +| fasttext.crawl.target.word-word.dim300.en | 1.19 GB | 2000002 | |
46 | 133 |
|
47 | 134 | ## 致谢 |
48 | | -- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding来源。 |
| 135 | +- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding预训练模型,[GloVe Project](https://nlp.stanford.edu/projects/glove)提供的GloVe英文Embedding预训练模型,[FastText Project](https://fasttext.cc/docs/en/english-vectors.html)提供的fasttext英文预训练模型。 |
49 | 136 |
|
50 | 137 | ## 参考论文 |
51 | 138 | - Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018). |
52 | 139 | - Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. |
| 140 | +- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. |
| 141 | +- T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations |
0 commit comments