Git Product home page Git Product logo

Comments (16)

tukeyclothespin avatar tukeyclothespin commented on August 10, 2024 1

Hi @rspeer thanks for responding.

I was attempting to build your original ConceptNet-Numberbatch from your paper. Your main readme (https://github.com/commonsense/conceptnet-numberbatch/blob/master/README.md) states to use branch 16.04 to recreate your 2016 paper. I encountered the git annex issue following the readme in branch 16.04 (https://github.com/commonsense/conceptnet-numberbatch/blob/16.04/README.md).

I wasn't able to resolve the git annex issue so I tried to download Glove, Word2Vec and PPDB into a folder myself and while 'python ninja.py' doesn't complain, ninja segfaults immediately. Do I understand you correctly that the conceptnet-raw-data-5.6.zip and conceptnet-assertions-5.5.5.csv.gz files from your links above contain the data files that the git annex step previously pulled down?

I am trying to recreate your 2016 paper as I have two word embedding models (Word2Vec and FastText) that I have trained on a specialized vocabulary and ConceptNet-Numberbatch looks very appealing as a way to fuse the generalized vocabulary from Glove and Google News Word2Vec with my models.

from conceptnet-numberbatch.

tukeyclothespin avatar tukeyclothespin commented on August 10, 2024 1

Yes, I went with the 2016 paper and branch 16.04 because the instructions were more approachable, I don't need the web api features from Conceptnet, and I can run docker but not docker compose on my infrastructure. I am going to try the raw build instructions at https://github.com/commonsense/conceptnet5/wiki/Build-process.

Beyond that, I am open to building and running any version of ConceptNet-Numberbatch. My goal is to add my pretrained word embedding models of specialized vocabulary into the ConceptNet-Numberbatch build and evaluate the word embedding results via our own use case metric that we already have defined. I just need access to the terms and vectors output by ConceptNet-Numberbatch to see how they score on our metric. That's why the simplicity of the branch 16.04 instructions was appealing.

Can you explain how I build Conceptnet Numberbatch using the conceptnet5.vectors package per the readme? I have the conceptnet5 package installed and can import conceptnet5.vectors at the python3 interpreter but help(conceptnet5.vectors) shows functions related to vector comparisons.

"Since 2016, the code for building ConceptNet Numberbatch is part of the ConceptNet code base, in the conceptnet5.vectors package."

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024 1

I'll work on updating the documentation. I've put the data files you need up on Zenodo: https://zenodo.org/record/1208722

Download these files and put them in your source-data directory, and you should be able to run the 16.04 build.

Note that this is for reproducing the paper, and the distinct feature of this paper compared to the others is that we tried various combinations of data sources and parameters. The build process, as described, will build all of them.

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

Sorry! I was cleaning up old branches, and thought git-annex was a feature branch from when I introduced git-annex, but actually it's the branch where git-annex keeps all its information about where to find files.

I've restored the branch and it should work now. But man, I am not using this inscrutable, haphazardly-documented utility again.

from conceptnet-numberbatch.

mmallad avatar mmallad commented on August 10, 2024

Thank you for reply. I downloaded all dataset by my self. I fixed some format issues and now its running all okay. I would like to ask some questions about it can you please provide your email address.
Thank You

from conceptnet-numberbatch.

jatin270 avatar jatin270 commented on August 10, 2024

Hey can you tell me how you downloaded the datasets

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

@jatin270 Can you be more specific about what you're looking for?

Currently, all parts of ConceptNet, including Numberbatch, are built using the code in https://github.com/commonsense/conceptnet5. Its build script will download the input data from Zenodo: https://zenodo.org/record/998169

from conceptnet-numberbatch.

jatin270 avatar jatin270 commented on August 10, 2024

@rspeer can you tell me format of conceptnet5.csv file like in what way have they stored the data?

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

I'm really going to need you to be more specific what you're asking about, but is this what you're looking for? https://github.com/commonsense/conceptnet5/wiki/Downloads

from conceptnet-numberbatch.

jatin270 avatar jatin270 commented on August 10, 2024

@rspeer
get source-data/w2v-google-news.bin.gz (from web...)

Unable to access these remotes: web

Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
2feaff51-0a4a-4afc-8b56-b9a553161e49 -- rspeer@buffy:/conceptnet-retrofitting-paper
54f3fff2-290c-42f1-908f-a6b2c9785668 -- media-lab-rsync
7ffdd42d-dce8-4cb8-904e-d09097500dfa -- rspeer@ip-10-23-1-47:
/conceptnet-retrofitting-paper
91510204-049b-4033-a6cf-0fe419754978 -- mungojerrie 2.7TB HD
dd2f35a5-4cde-4d3b-a6a1-69167174aea0 -- rspeer@buffy:/home/rspeer/conceptnet-retrofitting-paper

(Note that these git remotes have annex-ignore set: origin)
failed

I am obtaining this error

How can I fix this

from conceptnet-numberbatch.

tukeyclothespin avatar tukeyclothespin commented on August 10, 2024

I have the same 'Unable to access these remotes: web' response to 'git annex get' as @jatin270 for all of the datasets.

'git annex whereis' shows references to http://conceptnet-api-1.media.mit.edu, which does not resolve to an IP.

``
whereis source-data/w2v-google-news.bin.gz (5 copies)
00000000-0000-0000-0000-000000000001 -- web
2feaff51-0a4a-4afc-8b56-b9a553161e49 -- rspeer@buffy:/conceptnet-retrofitting-paper
54f3fff2-290c-42f1-908f-a6b2c9785668 -- media-lab-rsync
7ffdd42d-dce8-4cb8-904e-d09097500dfa -- rspeer@ip-10-23-1-47:~/conceptnet-retrofitting-paper
91510204-049b-4033-a6cf-0fe419754978 -- mungojerrie 2.7TB HD

The following untrusted locations may also have copies:
dd2f35a5-4cde-4d3b-a6a1-69167174aea0 -- rspeer@buffy:/home/rspeer/conceptnet-retrofitting-paper

web: http://conceptnet-api-1.media.mit.edu/downloads/annex/vector-ensemble/033/71c/SHA256E-s1647046227--21c05ae916a67a4da59b1d006903355cced7de7da1e42bff9f0504198c748da8.bin.gz/SHA256E-s1647046227--21c05ae916a67a4da59b1d006903355cced7de7da1e42bff9f0504198c748da8.bin.gz
ok
``

It looks like some of them are in https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/ :

https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/glove12.840B.300d.txt.gz
https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/GoogleNews-vectors-negative300.bin.gz

You can peruse the list at https://s3.amazonaws.com/conceptnet/

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

@tukeyclothespin, could you tell me where you encountered the directions to use git annex? Those directions are very old and I need to get rid of things that point to them.

The built data in .csv format is available from https://github.com/commonsense/conceptnet5/wiki/Downloads, and the raw input data is hosted at https://zenodo.org/record/1165009.

Unlike git-annex, Zenodo is very well suited for long-term data hosting.

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

Ah, no, the .csv.gz file was in response to what @jatin270 was looking for.

I can look for the files that are inputs to the 2016 paper, but probably nothing I do will save git-annex from the bit rot that's seemingly designed into it.

I'm supposing you went with the 2016 paper because the instructions for reproducing the AAAI 2017 paper using Docker were too daunting? I'm working on the process, on making it possible to build ConceptNet Numberbatch without all the sysadminnery, but it's going to be a newer (better) version, not an exact replication.

from conceptnet-numberbatch.

rspeer avatar rspeer commented on August 10, 2024

The modern build process, using the conceptnet5 package, is described at: https://github.com/commonsense/conceptnet5/wiki/Build-process

from conceptnet-numberbatch.

tukeyclothespin avatar tukeyclothespin commented on August 10, 2024

Thanks for your patience, I really appreciate it. I downloaded the data files from your link, put them into the source-data directory of my 16.04 version, ran python ninja.py and then sudo ninja but get a segfault nearly immediately. I am going to take a break from looking at that because I was separately able to work through the build process using the current conceptnet5 package.

I noticed your comment in the google group to run install the dependencies and run
snakemake data/vectors/numberbatch.h5
to build the Numberbatch vectors. That worked and I was able to load the Numberbatch vectors in gensim!

Now I want to add my own pretrained word embeddings into Numberbatch this time so I put a gzipped file of the word2vec text format of my model terms/vectors in data/raw/vectors as you suggested in the google group thread. During snakemake I saw:

rule convert_word2vec:
input: data/raw/vectors/GoogleNews-vectors-negative300.bin.gz
output: data/vectors/w2v-google-news.h5
jobid: 14
resources: ram=24

But I never saw snakemake run convert_word2vec on the word2vec gz file I had in data/raw/vectors. The initial job list also stated that there was only one job for convert_word2vec. Do I need to add my new file name in a conceptnet or snakemake configuration script? I noticed the four other input embeddings are mentioned in the Snakefile:
INPUT_EMBEDDINGS = [ 'crawl-300d-2M', 'w2v-google-news', 'glove12-840B', 'fasttext-opensubtitles' ]

from conceptnet-numberbatch.

tukeyclothespin avatar tukeyclothespin commented on August 10, 2024

I think I have it:

  1. Put my pretrained term/vector output file in data/raw/vectors/
  2. Make a new rule in the snakefile with the input as my file in data/raw/vectors/ and the output as a h5 file name in data/vectors using either convert_word2vec or convert_fasttext as the template depending on whether the file is binary or text.
  3. Add the name of the h5 file output from the new rule to the INPUT_EMBEDDINGS list in the snakefile
  4. snakemake clean
  5. snakemake data/vectors/numberbatch.h5

from conceptnet-numberbatch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.