Git Product home page Git Product logo

conceptnet-numberbatch's Introduction

ConceptNet Numberbatch

The best pre-computed word embeddings you can use

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.

ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet is a knowledge graph that provides lots of ways to compute with word meanings, one of which is word embeddings, while ConceptNet Numberbatch is a snapshot of just the word embeddings.

These embeddings benefit from the fact that they have semi-structured, common sense knowledge from ConceptNet, giving them a way to learn about words that isn't just observing them in context.

Numberbatch is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. It is described in the paper ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, presented at AAAI 2017.

Unlike most embeddings, ConceptNet Numberbatch is multilingual from the ground up. Words in different languages share a common semantic space, and that semantic space is informed by all of the languages.

Evaluation and publications

ConceptNet Numberbatch can be seen as a replacement for other precomputed embeddings, such as word2vec and GloVe, that do not include the graph-style knowledge in ConceptNet. Numberbatch outperforms these datasets on benchmarks of word similarity.

ConceptNet Numberbatch took first place in both subtasks at SemEval 2017 task 2, "Multilingual and Cross-lingual Semantic Word Similarity". Within that task, it was also the first-place system in each of English, German, Italian, and Spanish. The result is described in our ACL 2017 SemEval paper, "Extending Word Embeddings with Multilingual Relational Knowledge".

The code and papers were created as a research project of Luminoso Technologies, Inc., by Robyn Speer, Joshua Chin, Catherine Havasi, and Joanna Lowry-Duda.

Graph of performance on English evaluations

Now with more fairness

Word embeddings are prone to learn human-like stereotypes and prejudices. ConceptNet Numberbatch 17.04 and later counteract this as part of the build process, leading to word vectors that are less prejudiced than competitors such as word2vec and GloVe. See our blog post on reducing bias.

Graph of biases

A paper by Chris Sweeney and Maryam Najafian, "A Transparent Framework for Evaluating Unintended Demographic Bias in Word Embeddings", independently evaluates bias in precomputed word embeddings, and finds that ConceptNet Numberbatch is less likely than GloVe to inherently lead to demographic discrimination.

Code

Since 2016, the code for building ConceptNet Numberbatch is part of the ConceptNet code base, in the conceptnet5.vectors package.

The only code contained in this repository is text_to_uri.py, which normalizes natural-language text into the ConceptNet URI representation, allowing you to look up rows in these tables without requiring the entire ConceptNet codebase. For all other purposes, please refer to the ConceptNet code.

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that helps its performance in the presence of unfamiliar words. The strategy is implemented in the ConceptNet code base. It can be summarized as follows:

  • Given an unknown word whose language is not English, try looking up the equivalently-spelled word in the English embeddings (because English words tend to end up in text of all languages).
  • Given an unknown word, remove a letter from the end, and see if that is a prefix of known words. If so, average the embeddings of those known words.
  • If the prefix is still unknown, continue removing letters from the end until a known prefix is found. Give up when a single character remains.

Downloads

ConceptNet Numberbatch 19.08 is the current recommended download.

This table lists the downloads and formats available for multiple recent versions:

Version Multilingual English-only HDF5
19.08 numberbatch-19.08.txt.gz numberbatch-en-19.08.txt.gz 19.08/mini.h5
17.06 numberbatch-17.06.txt.gz numberbatch-en-17.06.txt.gz 17.06/mini.h5
17.04 numberbatch-17.04.txt.gz numberbatch-en-17.04b.txt.gz 17.05/mini.h5
17.02 numberbatch-17.02.txt.gz numberbatch-en-17.02.txt.gz
16.09 16.09/numberbatch.h5

The 16.09 version was the version published at AAAI 2017. You can reproduce its results using a Docker snapshot of the conceptnet5 repository. See the instructions on the ConceptNet wiki.

The .txt.gz files of term vectors are in the text format used by word2vec, GloVe, and fastText.

The first line of the file contains the dimensions of the matrix:

9161912 300

Each line contains a term label followed by 300 floating-point numbers, separated by spaces:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...
/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...
/c/en/absoluteless 0.2740 0.0718 0.1548 0.1118 -0.1669 -0.0216 -0.0508...
/c/en/absolutely 0.0065 -0.1813 0.0335 0.0991 -0.1123 0.0060 -0.0009 0...
/c/en/absolutely_convergent 0.3752 0.1087 -0.1299 -0.0796 -0.2753 -0.1...

The HDF5 files are the format that ConceptNet uses internally. They are data tables that can be loaded into Python using a library such as pandas or pytables.

The "mini.h5" files trade off a little bit of accuracy for a lot of memory savings, taking up less than 150 MB in RAM, and are used to power the ConceptNet API.

License and attribution

These vectors are distributed under the CC-By-SA 4.0 license. In short, if you distribute a transformed or modified version of these vectors, you must release them under a compatible Share-Alike license and give due credit to Luminoso.

Some suggested text:

This data contains semantic vectors from ConceptNet Numberbatch, by
Luminoso Technologies, Inc. You may redistribute or modify the
data under the terms of the CC-By-SA 4.0 license.

If you build on this data, you should cite it. Here is a straightforward citation:

Robyn Speer, Joshua Chin, and Catherine Havasi (2017). "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." In proceedings of AAAI 2017.

In BibTeX form, the citation is:

@inproceedings{speer2017conceptnet,
    title = {{ConceptNet} 5.5: An Open Multilingual Graph of General Knowledge},
    url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972},
    author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine},
    year = {2017},
    pages = {4444--4451}
}

This data is itself built on:

  • ConceptNet 5.7, which contains data from Wiktionary, WordNet, and many contributors to Open Mind Common Sense projects, edited by Robyn Speer

  • GloVe, by Jeffrey Pennington, Richard Socher, and Christopher Manning

  • word2vec, by Tomas Mikolov and Google Research

  • Parallel text from OpenSubtitles 2016, by Pierre Lison and Jörg Tiedemann, analyzed using fastText, by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov

Language statistics

The multilingual data in ConceptNet Numberbatch represents 78 different language codes, though some have vocabularies with much more coverage than others. The following table lists the languages and their vocabulary size.

You may notice a focus on even the smaller and historical languages of Europe, and under-representation of widely-spoken languages from outside Europe, which is an effect of the availability of linguistic resources for these languages. We would like to change this, but it requires finding good source data for ConceptNet in these under-represented languages.

Because Numberbatch contains word forms, inflected languages end up with larger vocabularies.

These vocabulary sizes were updated for ConceptNet Numberbatch 19.08.

code language vocab size
fr French 1388686
la Latin 855294
es Spanish 651859
de German 594456
it Italian 557743
en English 516782
ru Russian 455325
zh Chinese 307441
fi Finnish 267307
pt Portuguese 262904
ja Japanese 256648
nl Dutch 190221
bg Bulgarian 178508
sv Swedish 167321
pl Polish 152949
no Norwegian Bokmål 105689
eo Esperanto 96255
th Thai 95342
sl Slovenian 91134
ms Malay 90554
cs Czech 88613
ca Catalan 87508
ar Arabic 85325
hu Hungarian 74384
se Northern Sami 67601
sh Serbian 66746
el Greek 65905
gl Galician 59006
da Danish 57119
fa Persian 53984
ro Romanian 51437
tr Turkish 51308
is Icelandic 48639
eu Basque 44151
ko Korean 42106
vi Vietnamese 39802
ga Irish 36988
grc Ancient Greek 36977
uk Ukrainian 36851
lv Latvian 36333
he Hebrew 33435
mk Macedonian 33370
ka Georgian 32338
hy Armenian 29844
sk Slovak 29376
lt Lithuanian 28826
ast Asturian 28401
mg Malagasy 26865
et Estonian 26525
oc Occitan 26095
fil Filipino 25088
io Ido 25004
hsb Upper Sorbian 24852
hi Hindi 23538
te Telugu 22173
be Belarusian 22117
fro Old French 21249
sq Albanian 20493
mul (Multilingual, such as emoji) 19376
cy Welsh 18721
xcl Classical Armenian 18420
az Azerbaijani 17184
kk Kazakh 16979
gd Scottish Gaelic 16827
af Afrikaans 16132
fo Faroese 15973
ang Old English 15700
ku Kurdish 13804
vo Volapük 12731
ta Tamil 12690
ur Urdu 12006
sw Swahili 11150
sa Sanskrit 11081
nrf Norman French 10048
non Old Norse 8536
gv Manx 8425
nv Navajo 8232
rup Aromanian 5107

Referred here from an old version?

An unpublished paper of ours described the "ConceptNet Vector Ensemble", and refers to a repository that now redirects here, and an attached store of data that is no longer hosted. We apologize, but we're not supporting the unpublished paper. Please use a newer version and use the currently supported ConceptNet build process.

Image credit

The otter logo was designed by Christy Presler for The Noun Project, and is used under a Creative Commons Attribution license.

conceptnet-numberbatch's People

Contributors

joshua-chin avatar pawnyy avatar rspeer avatar thatandromeda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conceptnet-numberbatch's Issues

Cannot get annex files

After everything else is successful, including a Git Annex installation, when I type:

cd code/data
git annex get

Nothing occurs. No downloads start. I tried this with cd code/source-data, and it does not work either. How can I download the data files?

Alternatively, is there a way to obtain the files without git annex?

Vector Ensembler Code

Hello

Here we see the Dataset where can I find the Numberbatch Vector Ensembler Code ?

Thank you very much.

Accuracy issues

Hi

I tried ConceptNet Numberbatch pre trained embedding on CNN classification task and compared the results Glove and Word2vec results.

Results of Word2vec and Glove are still better than ConceptNet embedding. I was expecting to get better accuracy results from numberbatch,

Any advise ? if i am doing anything wrong?

KeyError: "word 'coffee_pot' not in vocabulary"

Hello,

I am trying to get similarity between two words. I am using multilingual (numberbatch-17.06.txt') and the smaller english-only (numberbatch-en-17.06.txt) word vectors.

What have I currently achieved:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
`KeyError: "word 'coffee_pot' not in vocabulary"

No matter the word pairs, it never finds any similarities.

Interesting is that when I do the exact same thing with ConceptNet smaller english-only word vector file, all works just fine:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-en-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
0.5312845

For testing purposes when I iterate over every line of these files, I get the following results:

  1. numberbatch-17.06.txt -> 1 917 248
  2. numberbatch-en-17.06.txt -> 417 195

This shows us that the files are just fine and contain data.

Example content of file numberbatch-en-17.06.txt:

417194 300
tea_kettle 0.0387 -0.0292 0.2034 0.0983 -0.0785 -0.0051 -0.0116 -0.1310 0.1573 0.0358 -0.1409 -0.0158 -0.0262 -0.0663 -0.0684 0.1487 0.0211 0.0157 0.0348 -0.1160 -0.0701 -0.0608 -0.0211 0.0731 0.1092 -0.0442 0.0256 0.0136 0.0202 0.0671 0.0546 -0.0398 0.0347 0.1572 0.0104 0.0684 0.0615 0.0011 0.0769 -0.0849 0.1121 -0.0146 0.0206 0.0890 0.0034 0.0998 -0.1155 -0.0272 0.1015 0.0245 -0.0029 0.0695 0.0315 0.0344 -0.1253 -0.0065 0.0318 0.0381 0.0714 0.1117 0.0643 0.0176 -0.0146 0.0323 -0.0121 0.0828 0.1397 0.0657 0.0341 -0.0022 -0.0808 -0.0102 -0.0376 -0.0665 0.0470 -0.0740 0.0475 -0.0439 -0.1397 -0.0080 -0.0162 -0.0080 -0.0090 0.0758 0.0810 0.0960 0.0251 0.0324 0.0364 -0.0174 0.0730 0.0455 0.0726 -0.0408 0.1600 -0.0330 0.0497 0.0386 0.0575 0.0502 0.0282 0.0694 0.0284 0.0106 0.0604 -0.0308 0.1479 0.0419 0.0148 -0.0838 0.0076 0.0850 -0.0081 0.0001 -0.0346 0.0440 0.0194 -0.0662 -0.0037 -0.0127 0.0501 -0.0037 -0.0433 0.0840 0.0849 -0.0227 -0.0348 -0.0678 0.0064 0.0069 -0.0961 0.0382 -0.0234 -0.0157 0.0476 0.0230 0.0274 -0.0948 -0.0189 -0.0320 0.0148 0.0048 0.0111 0.0164 -0.0060 0.0528 -0.0438 -0.0374 0.0483 -0.0509 -0.0621 -0.0944 0.0287 -0.0347 0.0426 0.0072 0.0636 -0.0269 0.0194 0.0125 0.0522 -0.0145 -0.0429 -0.0658 0.0550 -0.0563 0.0634 -0.0271 0.0067 0.0529 0.0446 0.0477 -0.0389 -0.0156 -0.0803 0.0096 -0.0045 0.0738 0.0082 0.1149 0.0426 0.0435 0.1527 0.0145 0.0287 0.0157 0.0240 -0.0163 0.0111 -0.1571 -0.0086 0.0315 0.1189 -0.0286 0.0136 -0.0009 -0.0022 -0.0620 -0.0087 -0.0087 0.0451 -0.0221 0.0440 0.0300 0.0246 -0.0211 0.0015 -0.0988 0.0207 0.0209 -0.0194 0.0085 0.0048 -0.0461 -0.0463 0.0118 0.0319 0.0644 0.0314 -0.0716 0.0013 0.0189 0.0017 -0.0892 -0.0420 -0.0389 0.0255 -0.0115 -0.0180 -0.0208 -0.0679 -0.0670 -0.0114 0.0184 0.0075 -0.0079 0.0893 0.1186 -0.0519 0.0240 0.0709 -0.0012 -0.0427 0.0180 -0.0194 0.0077 0.0242 0.0327 0.0736 -0.1041 0.0360 -0.0107 0.1080 -0.0048 0.0447 -0.0109 -0.0357 0.0029 0.0464 0.0288 0.0930 0.0280 -0.0380 -0.0303 0.0239 -0.0361 0.1058 0.0381 0.0397 0.0503 0.0488 -0.0014 -0.0189 0.0218 0.0538 0.0643 -0.0117 -0.0569 -0.0072 -0.0235 -0.0106 -0.0155 0.0249 0.0790 0.0974 -0.0126 -0.0214 -0.0303 -0.0031 -0.0403 -0.1275 0.0454 -0.0159 -0.0287 -0.0092 -0.0471 -0.0019 0.0183 -0.0509 -0.0412
coffee_pot -0.0230 0.0046 0.0981 0.1118 -0.0274 -0.0430 0.0668 -0.1377 0.1417 -0.0054 -0.1251 0.0249 -0.0319 -0.0386 -0.0870 0.1135 0.0580 0.0420 -0.0394 -0.0855 -0.1048 -0.0423 -0.0198 0.0363 0.0809 -0.0504 -0.0459 0.0026 -0.1134 -0.0098 0.0396 0.0257 0.0578 0.0409 0.1037 0.0127 0.0631 0.0111 0.0341 -0.0565 0.0457 -0.0754 0.0174 0.0017 0.0379 0.0919 0.0048 -0.0303 0.1128 -0.0517 -0.0679 0.0375 0.0068 0.0612 -0.0367 -0.0346 0.0093 0.0608 0.0587 0.0321 0.0465 -0.0551 -0.0880 -0.0569 -0.0324 0.0402 0.0586 0.0173 -0.0797 -0.0163 -0.0103 -0.0142 -0.0537 -0.0697 0.1746 -0.0507 0.0150 -0.0284 -0.1064 -0.0054 -0.0395 -0.0012 0.0224 -0.0276 -0.0227 0.0777 0.0406 0.0460 0.0104 -0.0124 -0.0179 -0.0581 0.0546 0.0230 0.1200 -0.0507 0.1206 0.0995 0.1138 0.1081 0.1309 0.1133 0.0837 0.0106 0.1533 -0.0413 0.0384 0.0320 -0.0448 0.0390 -0.0273 -0.0037 0.0100 0.1070 0.1078 -0.0111 -0.0051 -0.1064 -0.0507 -0.0184 -0.0077 -0.0425 -0.0462 0.0528 0.0964 -0.0050 0.0147 -0.0723 -0.0232 0.0427 -0.1352 0.0433 -0.0277 -0.0064 0.0547 -0.0011 0.0105 0.0018 -0.0281 -0.0369 0.0138 -0.0069 0.0185 0.0368 0.0152 0.0851 -0.0760 0.0149 0.0127 -0.0212 0.0215 -0.0758 -0.0211 -0.0327 0.0059 0.0646 0.0738 -0.0097 0.0307 -0.0074 -0.0192 0.0750 0.0092 -0.0525 0.0939 0.0345 0.0386 -0.0119 -0.0113 0.0230 0.0050 0.0099 0.0856 0.0425 -0.0634 -0.0230 0.0607 -0.0060 -0.0486 0.1053 0.0487 -0.0081 0.0836 -0.0040 0.0138 -0.1171 0.0372 0.0944 0.0219 -0.0437 0.0506 0.0204 0.1172 0.0622 -0.0056 0.0303 -0.0120 -0.0067 0.0493 -0.0059 -0.0535 -0.0646 0.0731 0.0510 -0.0589 0.0143 -0.0261 -0.1250 0.0329 -0.0203 -0.0688 -0.0065 0.0075 0.0406 -0.0259 0.0218 0.0851 0.1140 0.0471 -0.0155 -0.0035 0.0228 0.0486 -0.0672 -0.0486 -0.0427 0.0194 0.1313 -0.0559 0.1879 0.0610 0.0066 -0.0540 0.0240 0.0789 0.0820 -0.0753 0.0255 -0.0801 -0.0039 0.0454 -0.0655 0.0078 -0.0493 -0.0665 -0.0217 0.0398 0.0206 0.0275 -0.1553 0.0141 -0.0150 -0.0216 -0.0092 0.0282 0.0306 0.0238 0.0245 -0.0251 -0.0183 0.0438 0.0267 -0.0379 0.0549 0.0149 -0.0172 -0.0228 0.0316 0.0067 0.0254 0.0174 -0.0269 -0.0616 0.0822 0.0304 -0.0101 0.0323 -0.0698 0.0373 0.0479 -0.0292 0.0060 0.0129 -0.0062 -0.0005 0.0549 -0.0928 0.0237 0.0139 -0.0256 -0.0110 -0.0107 0.0545 -0.0719 -0.0023 -0.0257 -0.0343 0.0371 -0.0116 -0.1188
...etc

I am assuming file numberbatch-17.06.txt has even more data inside (I can not open the txt file, as it is too massive).

What might be the issue here? Why I can not get similarities between words? Am I running out of memory?

Versions

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.4
SciPy 1.1.0
gensim 3.8.0
FAST_VERSION 1

Format different from Word2Vec's format?

It looks like GloVe and Word2Vec have slightly different formats for their files, so I think it's a bit confusing to say that the models here are in the same format?

I noticed this when trying to load these embeddings into Gensim. Apparently the same problem exists with Glove, and this repository offers a solution that also works for the Conceptnet embeddings: https://github.com/manasRK/glove-gensim

Basically, the first line needs to indicate the number of word embeddings in the file and the number of dimensions of the vectors. I think it'd be a good idea to at least mention this in the README.

training script for embedding

This is a really great job ! Now I want to add more knowledge in ConceptNet and train another embedding. I wonder if you can share the training script ? Thanks in advance for any help on that.

.

how do i delete this

List of removed stop words

Section 3.1 of the paper (second para) says some stop words were removed while pre-processing. Would there be a list of the words that were removed? Some very common stop words appear to be around, so just wanted to be sure which ones had been knowingly gotten rid of.

@paper is not recognized while importing citation

Hello,
The citation you provided in readme threw an error while I was importing it to zotero.
I couldn't find anything related to a @paper entry, I believe the following is the correct (updated?) form.

@inproceedings{speer2017conceptnet,
  title = {{{ConceptNet}} 5.5: {{An Open Multilingual Graph}} of {{General Knowledge}}},
  url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972},
  eventtitle = {{{AAAI Conference}} on {{Artificial Intelligence}}},
  date = {2017},
  pages = {4444-4451},
  author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine}
}

Converting sentence into a list of concepts?

When using the following code from Readme:

>>> from conceptnet5.nodes import standardized_concept_uri

>>> standardized_concept_uri('en', 'this is an example')
'/c/en/be_example'

What I get instead is

{
  "uri": "/c/en/this_be_example"
}

Which is of course not found in conceptnet-ensemble-201603-labels.txt

Is there a "standard" way to turn an arbitrary sentence into a list of concepts?

Quick questions. Thanks.

Thanks for the great opensource and embeddings!

I just have a few quick questions:

  1. Glove embeddings (that 840B version)'s vocabulary size is: 2196017; while your embeddings have 1453347 words only. Since I am under the impression that your approach combines lots of resources: word2vec/Glove/PPDB/ConceptNet, could you please clarify why yours has a much smaller vocabulary size (~66%) than Glove?

  2. Is this because you combine words together into phrases? I found there are lots of phrase words in your vocabulary, like "supreme_court", "washington_dc", "san_francisco" or "natural_gas" etc. And Glove does not have these phrase words.

  3. BTW, any possibilities to release your emb in plain text files (zipped, just like Golve's format)? instead of numpy matrices?

Thanks again!

Sorting by occurrence count

Hi guys! Do you think that you could provide the conceptnet-numberbatch embeddings sorted by some kind of word frequency, similarly as GloVe and FastText does? In my research I'm limiting the vocabulary to most frequent K words in order not to eat all the GPU memory with embedding lookup when using pretrained embeddings in my models, and the sort order used by the other embeddings makes this much easier.

meaning of number of # characters in subwords?

For continuation words. there are varying number of # signs. For example, in the 5 first words we have followings:

  • /c/de/####er
  • /c/de/###er
  • /c/de/##er

For example, if I have a word ending with "er", which one should I use?

Thanks..

Embedding for other dimensions: 50, 100 and 200

It's really nice to see concept-enriched embeddings!

It would be nicer to have embeddings in other dimensions, e.g., 50, 100 and 200 because there are many models that use these smaller dimensions to prevent overfitting.

error while getting datasets with "git annex get"

I have done all setup and tried to get dataset with git annex command but it says

get code/source-data/conceptnet5.5.csv
Remote origin not usable by git-annex; setting annex-ignore
(not available)
No other repository is known to contain the file.
failed
get code/source-data/conceptnet5.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.300d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.600d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.labels (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove.42B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove12.840B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/ppdb-xl-lexical.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/w2v-google-news.bin.gz (not available)
No other repository is known to contain the file.
failed
git-annex: get: 9 failed

Can i use embeddings in closed source game (trough rest server)?

I'm little confused how license of your project apply to my case.

I would like to use embeddings for my game. Embeddings data would be used by simple rest server, which would provide one method - calculating similarity between two words.

That method would be used by my game. So in my game i won't use embeddings directly nor will i distribute them. My game will just use calculated similarity from rest server.

So, my question is - how ShareAlike license apply to my case? Can my game be closed source? What about my server? Do i need to release it under CC BY-SA 4.0 license?

Large number of zero vectors

Hello,

This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.

For reference, this is the code I used to count empty vectors:

empty = np.zeros(300)
count = 0
for each in englishVectors:
 if np.array_equal(each, empty):
  count +=1

I discovered this while trying to figure out the words closest to semi-common words.

For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following:

['adddresse', 'rudat', 'barhydt', 'weeked', 'inovonics', 'alleppey', 'katten', 'georgievski', 'kopinski', 'waxwing', 'irin_plusnews']

Error when ninja : Shape mismatch in assignement

Hi,

When I launch ninja, I get this error
[21/127] python3 -m conceptnet_retrofitting.builders.self_loops build...uild-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz FAILED: build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz python3 -m conceptnet_retrofitting.builders.self_loops build-data/glove.840B.300d.ppdb-xl-lexical-standardized.npz build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz Traceback (most recent call last): File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 22, in <module> main(*sys.argv[1:]) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 16, in main assoc = self_loops(assoc) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 8, in self_loops assoc[diagonal] = assoc.sum(axis=1).T[0] File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/site-packages/scipy/sparse/_index.py", line 124, in __setitem__ raise ValueError("shape mismatch in assignment") ValueError: shape mismatch in assignment

I used the orginal datasets and when I debug, I cannot see mismatchh in shape...
Has anybody else had this issue?

Spelling Error in README

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning.

SHOULD BE:

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.

Happens to the best of us!

Common word subset

Is it possible to download a subset of Numberbatch sorted by common words? In my application, it is computationally infeasible to load all of the words into memory.

However, a 20% subset of the most common words would solve my problems and fit into memory as well.

Please let me know if this is possible!
Nick

Lemmatization for SNLI

Hi,

I would like to use your embedding on SNLI dataset. However, due to lemmatization, almost half of the words have no embeddings. Therefore I'd like to lemmatize the SNLI dataset.

I am wondering, which lemmatization algorithm would be best to get a dataset similar to Conceptnet Numberbatch

Do all versions occupy the same vector space?

Do all versions occupy the same vector space, so that same words between versions have similar coordinates?

This would be extremely useful when upgrading to new versions, as one wouldn't need to vectorize the entire corpus again. It becomes even more important in cases where the original source text isn't available anymore.

Thanks

downloads and term vectors not available

Your links are not working. I'm trying to download any of the following:
conceptnet-numberbatch-201609_uris_main.txt.gz (1928481 × 300) contains terms from many languages, specified by their complete ConceptNet URI (the strings starting with /c/en/ in the example above).

conceptnet-numberbatch-201609_en_main.txt.gz (426572 × 300) contains only English terms, and strips the /c/en/ prefix to provide just the term text. This form is the most compatible with other systems, as long as you only want English.

conceptnet-numberbatch-201609_en_extra.txt.gz (233488 × 300) contains additional single words of English whose vectors could be inferred as the average of their neighbors in ConceptNet.

conceptnet-numberbatch-201609.h5 contains the data in its native HDF5 format, which can be loaded with the Python library pandas.

http://conceptnet5.media.mit.edu/downloads/ is also not working.

Using the pretrained term vectors

First-time using the pretrained term vectors and noticed the vectors are in a text file. The word2vec and the googlenews pretrained vectors can be loaded as a numpy array which in turn can optionally be read from disk with the mmap_mode. Given a term, look up an dictionary or hashtable to get an index for the term and then extract the term vector from the numpy array using the index value. I've used this successfully.

Can numberbatch be used in a similar way and if so how?

Wrong link in readme

in README.md

numberbatch-17.04.txt.gz contains this data in 77 languages.
numberbatch-en-17.04.txt.gz contains just the English subset of the data, with the /c/en/ prefix removed.

Both link to 17.02 instead of 17.04

number batch code

Hi,

I've run the code over the whole weekend but it still has not completed. You had mentioned it would take about a day to run... do you think there was something that happened incorrectly?

Thanks,

Megh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.