commonsense / conceptnet-numberbatch Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 143.0 5.61 MB

License: Other

Python 92.55% Shell 7.45%

conceptnet-numberbatch's People

Contributors

Stargazers

Watchers

Forkers

zclfly codeaudit fulquan zhoujialinmumu milesqli manasrk byzhang oliviaguest joshloyal wanjinchang icewwn amitshah canwe picorb mmallad chagge alphamupsiomega rap9430 benjamesbabala matthieubizien sayiho heshizhu wuxiaobo leochencipher rubenszimbres stevenlol vikingmew blogpuppy akiratu tonylibing 6676401088 ml-lab s4sarath mominatabish hushuangwei gujadot qgzang sandy4321 zgsxwsdxg githubddc longchuan1985 semanticbeeng cequencer zyusou seraphyx barliant sumhncku changhopaeon chirayukong yuhanyu seetharamireddy540 shengleih tkasasagi rajat95 rsantana-isg liuhaifeng0212 binkmust pb-pravin tonybb9089 himmelstein wangz19 zs930831 rjkrnkmr ameyyadav09 sohuren zy-peng aigeorgeli github-bigang vipul115 sabirdvd pkulzb soul-an sdd031215 rohit0710 liujun26 cheikhovitch vincentlux duanzhihua kaushalprasadhial dzedda cao-ming wwwa ravikiransm yigitsever scape1989 linzhuolisoc hello-web brennerhaverlock qpb- stjordanis communicateconnectcreate naushadzaman surexs npubird k-o-n-s-t cbowdon dbrroxane ssitb haozheji zzs4026

conceptnet-numberbatch's Issues

Sorting by occurrence count

Hi guys! Do you think that you could provide the conceptnet-numberbatch embeddings sorted by some kind of word frequency, similarly as GloVe and FastText does? In my research I'm limiting the vocabulary to most frequent K words in order not to eat all the GPU memory with embedding lookup when using pretrained embeddings in my models, and the sort order used by the other embeddings makes this much easier.

Lemmatization for SNLI

Hi,

I would like to use your embedding on SNLI dataset. However, due to lemmatization, almost half of the words have no embeddings. Therefore I'd like to lemmatize the SNLI dataset.

I am wondering, which lemmatization algorithm would be best to get a dataset similar to Conceptnet Numberbatch

Predict output word

I need given the context to predict the output word like gensim.models.word2vec.Word2Vec.predict_output_word. How I can achieve this using ConceptNet Numberbatch?

I can load ConceptNet Numberbatch word embeddings only with KeyedVectors.load_word2vec_format, so it doesn't have predict_output_word method.

Error when ninja : Shape mismatch in assignement

Hi,

When I launch ninja, I get this error
[21/127] python3 -m conceptnet_retrofitting.builders.self_loops build...uild-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz FAILED: build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz python3 -m conceptnet_retrofitting.builders.self_loops build-data/glove.840B.300d.ppdb-xl-lexical-standardized.npz build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz Traceback (most recent call last): File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 22, in <module> main(*sys.argv[1:]) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 16, in main assoc = self_loops(assoc) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 8, in self_loops assoc[diagonal] = assoc.sum(axis=1).T[0] File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/site-packages/scipy/sparse/_index.py", line 124, in __setitem__ raise ValueError("shape mismatch in assignment") ValueError: shape mismatch in assignment

I used the orginal datasets and when I debug, I cannot see mismatchh in shape...
Has anybody else had this issue?

Accuracy issues

I tried ConceptNet Numberbatch pre trained embedding on CNN classification task and compared the results Glove and Word2vec results.

Results of Word2vec and Glove are still better than ConceptNet embedding. I was expecting to get better accuracy results from numberbatch,

Any advise ? if i am doing anything wrong?

Cannot get annex files

After everything else is successful, including a Git Annex installation, when I type:

cd code/data
git annex get

Nothing occurs. No downloads start. I tried this with cd code/source-data, and it does not work either. How can I download the data files?

Alternatively, is there a way to obtain the files without git annex?

downloads and term vectors not available

Your links are not working. I'm trying to download any of the following:
conceptnet-numberbatch-201609_uris_main.txt.gz (1928481 × 300) contains terms from many languages, specified by their complete ConceptNet URI (the strings starting with /c/en/ in the example above).

conceptnet-numberbatch-201609_en_main.txt.gz (426572 × 300) contains only English terms, and strips the /c/en/ prefix to provide just the term text. This form is the most compatible with other systems, as long as you only want English.

conceptnet-numberbatch-201609_en_extra.txt.gz (233488 × 300) contains additional single words of English whose vectors could be inferred as the average of their neighbors in ConceptNet.

conceptnet-numberbatch-201609.h5 contains the data in its native HDF5 format, which can be loaded with the Python library pandas.

http://conceptnet5.media.mit.edu/downloads/ is also not working.

Converting sentence into a list of concepts?

When using the following code from Readme:

>>> from conceptnet5.nodes import standardized_concept_uri

>>> standardized_concept_uri('en', 'this is an example')
'/c/en/be_example'

What I get instead is

{
  "uri": "/c/en/this_be_example"
}

Which is of course not found in conceptnet-ensemble-201603-labels.txt

Is there a "standard" way to turn an arbitrary sentence into a list of concepts?

Embedding for other dimensions: 50, 100 and 200

It's really nice to see concept-enriched embeddings!

It would be nicer to have embeddings in other dimensions, e.g., 50, 100 and 200 because there are many models that use these smaller dimensions to prevent overfitting.

ignore

forget it

meaning of number of # characters in subwords?

For continuation words. there are varying number of # signs. For example, in the 5 first words we have followings:

/c/de/####er
/c/de/###er
/c/de/##er

For example, if I have a word ending with "er", which one should I use?

Thanks..

number batch code

Hi,

I've run the code over the whole weekend but it still has not completed. You had mentioned it would take about a day to run... do you think there was something that happened incorrectly?

Thanks,

Megh

Can i use embeddings in closed source game (trough rest server)?

I'm little confused how license of your project apply to my case.

I would like to use embeddings for my game. Embeddings data would be used by simple rest server, which would provide one method - calculating similarity between two words.

That method would be used by my game. So in my game i won't use embeddings directly nor will i distribute them. My game will just use calculated similarity from rest server.

So, my question is - how ShareAlike license apply to my case? Can my game be closed source? What about my server? Do i need to release it under CC BY-SA 4.0 license?

Wrong link for the first file to download

Hi, I found that the link for file uris_main point to this location http://conceptnet5.media.mit.edu/downloads/conceptnet-numberbatch-16.09/conceptnet-numberbatch-201609_en_extra.txt.gz

However, I believe it should be pointed to this location. http://conceptnet5.media.mit.edu/downloads/conceptnet-numberbatch-16.09/conceptnet-numberbatch-201609_uris_main.txt.gz

.

how do i delete this

conceptnet entities not present in the embeddings

error while getting datasets with "git annex get"

I have done all setup and tried to get dataset with git annex command but it says

get code/source-data/conceptnet5.5.csv
Remote origin not usable by git-annex; setting annex-ignore
(not available)
No other repository is known to contain the file.
failed
get code/source-data/conceptnet5.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.300d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.600d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.labels (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove.42B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove12.840B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/ppdb-xl-lexical.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/w2v-google-news.bin.gz (not available)
No other repository is known to contain the file.
failed
git-annex: get: 9 failed

Do all versions occupy the same vector space?

Do all versions occupy the same vector space, so that same words between versions have similar coordinates?

This would be extremely useful when upgrading to new versions, as one wouldn't need to vectorize the entire corpus again. It becomes even more important in cases where the original source text isn't available anymore.

Thanks

Accessing the ensembled vectors

after running "ninja", where can I find the ensemble vectors?

Common word subset

Is it possible to download a subset of Numberbatch sorted by common words? In my application, it is computationally infeasible to load all of the words into memory.

However, a 20% subset of the most common words would solve my problems and fit into memory as well.

Please let me know if this is possible!
Nick

training script for embedding

This is a really great job ! Now I want to add more knowledge in ConceptNet and train another embedding. I wonder if you can share the training script ? Thanks in advance for any help on that.

Format different from Word2Vec's format?

It looks like GloVe and Word2Vec have slightly different formats for their files, so I think it's a bit confusing to say that the models here are in the same format?

I noticed this when trying to load these embeddings into Gensim. Apparently the same problem exists with Glove, and this repository offers a solution that also works for the Conceptnet embeddings: https://github.com/manasRK/glove-gensim

Basically, the first line needs to indicate the number of word embeddings in the file and the number of dimensions of the vectors. I think it'd be a good idea to at least mention this in the README.

Large number of zero vectors

Hello,

This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.

For reference, this is the code I used to count empty vectors:

empty = np.zeros(300)
count = 0
for each in englishVectors:
 if np.array_equal(each, empty):
  count +=1

I discovered this while trying to figure out the words closest to semi-common words.

For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following:

['adddresse', 'rudat', 'barhydt', 'weeked', 'inovonics', 'alleppey', 'katten', 'georgievski', 'kopinski', 'waxwing', 'irin_plusnews']

KeyError: "word 'coffee_pot' not in vocabulary"

Hello,

I am trying to get similarity between two words. I am using multilingual (numberbatch-17.06.txt') and the smaller english-only (numberbatch-en-17.06.txt) word vectors.

What have I currently achieved:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
`KeyError: "word 'coffee_pot' not in vocabulary"

No matter the word pairs, it never finds any similarities.

Interesting is that when I do the exact same thing with ConceptNet smaller english-only word vector file, all works just fine:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-en-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
0.5312845

For testing purposes when I iterate over every line of these files, I get the following results:

numberbatch-17.06.txt -> 1 917 248
numberbatch-en-17.06.txt -> 417 195

This shows us that the files are just fine and contain data.

Example content of file numberbatch-en-17.06.txt:

417194 300
tea_kettle 0.0387 -0.0292 0.2034 0.0983 -0.0785 -0.0051 -0.0116 -0.1310 0.1573 0.0358 -0.1409 -0.0158 -0.0262 -0.0663 -0.0684 0.1487 0.0211 0.0157 0.0348 -0.1160 -0.0701 -0.0608 -0.0211 0.0731 0.1092 -0.0442 0.0256 0.0136 0.0202 0.0671 0.0546 -0.0398 0.0347 0.1572 0.0104 0.0684 0.0615 0.0011 0.0769 -0.0849 0.1121 -0.0146 0.0206 0.0890 0.0034 0.0998 -0.1155 -0.0272 0.1015 0.0245 -0.0029 0.0695 0.0315 0.0344 -0.1253 -0.0065 0.0318 0.0381 0.0714 0.1117 0.0643 0.0176 -0.0146 0.0323 -0.0121 0.0828 0.1397 0.0657 0.0341 -0.0022 -0.0808 -0.0102 -0.0376 -0.0665 0.0470 -0.0740 0.0475 -0.0439 -0.1397 -0.0080 -0.0162 -0.0080 -0.0090 0.0758 0.0810 0.0960 0.0251 0.0324 0.0364 -0.0174 0.0730 0.0455 0.0726 -0.0408 0.1600 -0.0330 0.0497 0.0386 0.0575 0.0502 0.0282 0.0694 0.0284 0.0106 0.0604 -0.0308 0.1479 0.0419 0.0148 -0.0838 0.0076 0.0850 -0.0081 0.0001 -0.0346 0.0440 0.0194 -0.0662 -0.0037 -0.0127 0.0501 -0.0037 -0.0433 0.0840 0.0849 -0.0227 -0.0348 -0.0678 0.0064 0.0069 -0.0961 0.0382 -0.0234 -0.0157 0.0476 0.0230 0.0274 -0.0948 -0.0189 -0.0320 0.0148 0.0048 0.0111 0.0164 -0.0060 0.0528 -0.0438 -0.0374 0.0483 -0.0509 -0.0621 -0.0944 0.0287 -0.0347 0.0426 0.0072 0.0636 -0.0269 0.0194 0.0125 0.0522 -0.0145 -0.0429 -0.0658 0.0550 -0.0563 0.0634 -0.0271 0.0067 0.0529 0.0446 0.0477 -0.0389 -0.0156 -0.0803 0.0096 -0.0045 0.0738 0.0082 0.1149 0.0426 0.0435 0.1527 0.0145 0.0287 0.0157 0.0240 -0.0163 0.0111 -0.1571 -0.0086 0.0315 0.1189 -0.0286 0.0136 -0.0009 -0.0022 -0.0620 -0.0087 -0.0087 0.0451 -0.0221 0.0440 0.0300 0.0246 -0.0211 0.0015 -0.0988 0.0207 0.0209 -0.0194 0.0085 0.0048 -0.0461 -0.0463 0.0118 0.0319 0.0644 0.0314 -0.0716 0.0013 0.0189 0.0017 -0.0892 -0.0420 -0.0389 0.0255 -0.0115 -0.0180 -0.0208 -0.0679 -0.0670 -0.0114 0.0184 0.0075 -0.0079 0.0893 0.1186 -0.0519 0.0240 0.0709 -0.0012 -0.0427 0.0180 -0.0194 0.0077 0.0242 0.0327 0.0736 -0.1041 0.0360 -0.0107 0.1080 -0.0048 0.0447 -0.0109 -0.0357 0.0029 0.0464 0.0288 0.0930 0.0280 -0.0380 -0.0303 0.0239 -0.0361 0.1058 0.0381 0.0397 0.0503 0.0488 -0.0014 -0.0189 0.0218 0.0538 0.0643 -0.0117 -0.0569 -0.0072 -0.0235 -0.0106 -0.0155 0.0249 0.0790 0.0974 -0.0126 -0.0214 -0.0303 -0.0031 -0.0403 -0.1275 0.0454 -0.0159 -0.0287 -0.0092 -0.0471 -0.0019 0.0183 -0.0509 -0.0412
coffee_pot -0.0230 0.0046 0.0981 0.1118 -0.0274 -0.0430 0.0668 -0.1377 0.1417 -0.0054 -0.1251 0.0249 -0.0319 -0.0386 -0.0870 0.1135 0.0580 0.0420 -0.0394 -0.0855 -0.1048 -0.0423 -0.0198 0.0363 0.0809 -0.0504 -0.0459 0.0026 -0.1134 -0.0098 0.0396 0.0257 0.0578 0.0409 0.1037 0.0127 0.0631 0.0111 0.0341 -0.0565 0.0457 -0.0754 0.0174 0.0017 0.0379 0.0919 0.0048 -0.0303 0.1128 -0.0517 -0.0679 0.0375 0.0068 0.0612 -0.0367 -0.0346 0.0093 0.0608 0.0587 0.0321 0.0465 -0.0551 -0.0880 -0.0569 -0.0324 0.0402 0.0586 0.0173 -0.0797 -0.0163 -0.0103 -0.0142 -0.0537 -0.0697 0.1746 -0.0507 0.0150 -0.0284 -0.1064 -0.0054 -0.0395 -0.0012 0.0224 -0.0276 -0.0227 0.0777 0.0406 0.0460 0.0104 -0.0124 -0.0179 -0.0581 0.0546 0.0230 0.1200 -0.0507 0.1206 0.0995 0.1138 0.1081 0.1309 0.1133 0.0837 0.0106 0.1533 -0.0413 0.0384 0.0320 -0.0448 0.0390 -0.0273 -0.0037 0.0100 0.1070 0.1078 -0.0111 -0.0051 -0.1064 -0.0507 -0.0184 -0.0077 -0.0425 -0.0462 0.0528 0.0964 -0.0050 0.0147 -0.0723 -0.0232 0.0427 -0.1352 0.0433 -0.0277 -0.0064 0.0547 -0.0011 0.0105 0.0018 -0.0281 -0.0369 0.0138 -0.0069 0.0185 0.0368 0.0152 0.0851 -0.0760 0.0149 0.0127 -0.0212 0.0215 -0.0758 -0.0211 -0.0327 0.0059 0.0646 0.0738 -0.0097 0.0307 -0.0074 -0.0192 0.0750 0.0092 -0.0525 0.0939 0.0345 0.0386 -0.0119 -0.0113 0.0230 0.0050 0.0099 0.0856 0.0425 -0.0634 -0.0230 0.0607 -0.0060 -0.0486 0.1053 0.0487 -0.0081 0.0836 -0.0040 0.0138 -0.1171 0.0372 0.0944 0.0219 -0.0437 0.0506 0.0204 0.1172 0.0622 -0.0056 0.0303 -0.0120 -0.0067 0.0493 -0.0059 -0.0535 -0.0646 0.0731 0.0510 -0.0589 0.0143 -0.0261 -0.1250 0.0329 -0.0203 -0.0688 -0.0065 0.0075 0.0406 -0.0259 0.0218 0.0851 0.1140 0.0471 -0.0155 -0.0035 0.0228 0.0486 -0.0672 -0.0486 -0.0427 0.0194 0.1313 -0.0559 0.1879 0.0610 0.0066 -0.0540 0.0240 0.0789 0.0820 -0.0753 0.0255 -0.0801 -0.0039 0.0454 -0.0655 0.0078 -0.0493 -0.0665 -0.0217 0.0398 0.0206 0.0275 -0.1553 0.0141 -0.0150 -0.0216 -0.0092 0.0282 0.0306 0.0238 0.0245 -0.0251 -0.0183 0.0438 0.0267 -0.0379 0.0549 0.0149 -0.0172 -0.0228 0.0316 0.0067 0.0254 0.0174 -0.0269 -0.0616 0.0822 0.0304 -0.0101 0.0323 -0.0698 0.0373 0.0479 -0.0292 0.0060 0.0129 -0.0062 -0.0005 0.0549 -0.0928 0.0237 0.0139 -0.0256 -0.0110 -0.0107 0.0545 -0.0719 -0.0023 -0.0257 -0.0343 0.0371 -0.0116 -0.1188
...etc

I am assuming file numberbatch-17.06.txt has even more data inside (I can not open the txt file, as it is too massive).

What might be the issue here? Why I can not get similarities between words? Am I running out of memory?

Versions

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.4
SciPy 1.1.0
gensim 3.8.0
FAST_VERSION 1

Quick questions. Thanks.

Thanks for the great opensource and embeddings!

I just have a few quick questions:

Glove embeddings (that 840B version)'s vocabulary size is: 2196017; while your embeddings have 1453347 words only. Since I am under the impression that your approach combines lots of resources: word2vec/Glove/PPDB/ConceptNet, could you please clarify why yours has a much smaller vocabulary size (~66%) than Glove?
Is this because you combine words together into phrases? I found there are lots of phrase words in your vocabulary, like "supreme_court", "washington_dc", "san_francisco" or "natural_gas" etc. And Glove does not have these phrase words.
BTW, any possibilities to release your emb in plain text files (zipped, just like Golve's format)? instead of numpy matrices?

Thanks again!

download of english version not available

I am trying download numberbatch-en-19.08.txt.gz, but it seems that the link of latest ConceptNet Numberbatch 19.08 for english is not working.

Wrong link in readme

in README.md

numberbatch-17.04.txt.gz contains this data in 77 languages.
numberbatch-en-17.04.txt.gz contains just the English subset of the data, with the /c/en/ prefix removed.

Both link to 17.02 instead of 17.04

List of removed stop words

Section 3.1 of the paper (second para) says some stop words were removed while pre-processing. Would there be a list of the words that were removed? Some very common stop words appear to be around, so just wanted to be sure which ones had been knowingly gotten rid of.

Using the pretrained term vectors

First-time using the pretrained term vectors and noticed the vectors are in a text file. The word2vec and the googlenews pretrained vectors can be loaded as a numpy array which in turn can optionally be read from disk with the mmap_mode. Given a term, look up an dictionary or hashtable to get an index for the term and then extract the term vector from the numpy array using the index value. I've used this successfully.

Can numberbatch be used in a similar way and if so how?

Will there be a new version of Numberbatch?

It's been a while since the last update, and I was wondering if there are any new versions planned. Thanks for the great work so far!

Vector Ensembler Code

Hello

Here we see the Dataset where can I find the Numberbatch Vector Ensembler Code ?

Thank you very much.

@paper is not recognized while importing citation

Hello,
The citation you provided in readme threw an error while I was importing it to zotero.
I couldn't find anything related to a @paper entry, I believe the following is the correct (updated?) form.

@inproceedings{speer2017conceptnet,
  title = {{{ConceptNet}} 5.5: {{An Open Multilingual Graph}} of {{General Knowledge}}},
  url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972},
  eventtitle = {{{AAAI Conference}} on {{Artificial Intelligence}}},
  date = {2017},
  pages = {4444-4451},
  author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine}
}

Spelling Error in README

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning.

SHOULD BE:

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.

Happens to the best of us!