Git Product home page Git Product logo

glove-python's Introduction

glove-python

Circle CI

A toy python implementation of GloVe.

Glove produces dense vector embeddings of words, where words that occur together are close in the resulting vector space.

While this produces embeddings which are similar to word2vec (which has a great python implementation in gensim), the method is different: GloVe produces embeddings by factorizing the logarithm of the corpus word co-occurrence matrix.

The code uses asynchronous stochastic gradient descent, and is implemented in Cython. Most likely, it contains a tremendous amount of bugs.

Installation

Install from pypi using pip: pip install glove_python.

Note for OSX users: due to its use of OpenMP, glove-python does not compile under Clang. To install it, you will need a reasonably recent version of gcc (from Homebrew for instance). This should be picked up by setup.py; if it is not, please open an issue.

Building with the default Python distribution included in OSX is also not supported; please try the version from Homebrew or Anaconda.

Usage

Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).

There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.

Examples

example.py has some example code for running simple training scripts: ipython -i -- examples/example.py -c my_corpus.txt -t 10 should process your corpus, run 10 training epochs of GloVe, and drop you into an ipython shell where glove.most_similar('physics') should produce a list of similar words.

If you want to process a wikipedia corpus, you can pass file from here into the example.py script using the -w flag. Running make all-wiki should download a small wikipedia dump file, process it, and train the embeddings. Building the cooccurrence matrix will take some time; training the vectors can be speeded up by increasing the training parallelism to match the number of physical CPU cores available.

Running this on my machine yields roughly the following results:

In [1]: glove.most_similar('physics')
Out[1]:
[('biology', 0.89425889335342257),
 ('chemistry', 0.88913708236100086),
 ('quantum', 0.88859617025616333),
 ('mechanics', 0.88821824562025431)]

In [4]: glove.most_similar('north')
Out[4]:
[('west', 0.99047203572917908),
 ('south', 0.98655786905501008),
 ('east', 0.97914140138065575),
 ('coast', 0.97680427897282185)]

In [6]: glove.most_similar('queen')
Out[6]:
[('anne', 0.88284931171714842),
 ('mary', 0.87615260138308615),
 ('elizabeth', 0.87362497374226267),
 ('prince', 0.87011034923161801)]

In [19]: glove.most_similar('car')
Out[19]:
[('race', 0.89549347066796814),
 ('driver', 0.89350343749207217),
 ('cars', 0.83601334715106568),
 ('racing', 0.83157724991920212)]

Development

Pull requests are welcome.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .c and .cpp files before running pip install -e ..

glove-python's People

Contributors

joshloyal avatar maciejkula avatar nsaphra avatar ogrisel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glove-python's Issues

Problem running the example script

Hi there, I would like to try your example.py but I have no idea what corpus are to use/ I have just started learning python and machine learning and I am really confused. In your using guide example: ipython -i -- examples/example.py -c my_corpus.txt -t 10

I tried using the link that you have provided (http://www-nlp.stanford.edu/projects/glove) under "Download pre-trained word vectors" - I chose the Wikipedia 2014 + Gigaword 5 - (glove.6B.zip). In this glove.6B.zip file there are 4 files (glove.6B.50d , glove.6B.100d, glove.6B.200d and glove.6B.300d)

In the python command I tried running it using -i -- examples/example.py -c my_corpus.txt -t 10 where I renamed on the the file i.e. (glove.6B.50d to my_ corpus.txt).

I get an error message where it says : No module named corpus_cython. Did I do any of the steps wrongly?

I was wondering ,if you can provide me with the link to "my_corpus.txt" where you can get the result

In [1]: glove.most_similar('physics')
Out[1]:
[('biology', 0.89425889335342257),
('chemistry', 0.88913708236100086),
('quantum', 0.88859617025616333),
('mechanics', 0.88821824562025431)

Thank you.

Pickle error when dealing with large corpora

While this is not an issue with glove-python, it's worth noting that pickling large corpora/models causes the following error:

SystemError: error return without exception set 

According to this numpy issue, it is a bug in pickle that has been fixed in Python 3.3.
I think it would be worth pointing that out on the README for future reference.

changes in corpus_cython have no effect

hi there, I wanted to modify the corpus_cython.pyx script to take into account the left context of each word. The problem is, after i run python setup.py cythonize and pip install -e . those changes have no effect . What might be the cause?

I'm running it on Windows, using Anaconda.
Cheers.

import glove error: Symbol not found: _GOMP_parallel

After successfully installing glove_python with pip install on Mac Sierra 10.12.6, I get the following import error when trying to import the package:
from glove import Glove
from glove import Corpus

ImportError: dlopen(/Users/thomas/anaconda/lib/python3.6/site-packages/glove/glove_cython.cpython-36m-darwin.so, 2): Symbol not found: _GOMP_parallel
Referenced from: /Users/thomas/anaconda/lib/python3.6/site-packages/glove/glove_cython.cpython-36m-darwin.so
Expected in: flat namespace
in /Users/thomas/anaconda/lib/python3.6/site-packages/glove/glove_cython.cpython-36m-darwin.so

Does anyone know how to resolve this issue?

get phrase vector

what if i want to return the word/paragraph vector instead of the word similarities?

fails to install with pip on osx

On a mac (osx 10.12) both clang and gcc seem to fail. output below ...

Building wheels for collected packages: glove-python
Running setup.py bdist_wheel for glove-python ... error
Complete output from command /Users/barry/miniconda3/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/31/yd2dv9h54m7_9fp95r8llkk80000gn/T/pip-build-i_roohw3/glove-python/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /var/folders/31/yd2dv9h54m7_9fp95r8llkk80000gn/T/tmpj98_zqpmpip-wheel- --python-tag cp35:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.9-x86_64-3.5
creating build/lib.macosx-10.9-x86_64-3.5/glove
copying glove/init.py -> build/lib.macosx-10.9-x86_64-3.5/glove
copying glove/corpus.py -> build/lib.macosx-10.9-x86_64-3.5/glove
copying glove/glove.py -> build/lib.macosx-10.9-x86_64-3.5/glove
running build_ext
building 'glove.glove_cython' extension
creating build/temp.macosx-10.9-x86_64-3.5
creating build/temp.macosx-10.9-x86_64-3.5/glove
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/barry/miniconda3/include -arch x86_64 -I/Users/barry/miniconda3/include/python3.5m -c glove/glove_cython.c -o build/temp.macosx-10.9-x86_64-3.5/glove/glove_cython.o -fopenmp -ffast-math -march=native
clang: error: unsupported option '-fopenmp'
error: command 'gcc' failed with exit status 1

single precision

Current implementation uses double (float64) everywhere.

This is probably an overkill -- single precision (float32) may be enough, and cut the memory down a lot. Both for the C++ and scipy.sparse matrices.

Is there a specific reason behind using double? Are there numerical problems with single?

(word2vec uses single precision everywhere, for example)

unable to install it on OSX

I am consistently getting the following error:

gcc-4.9: error: unrecognized command line option '-Wshorten-64-to-32'

Any help ?

__contains__ and __getitem__ support

Hey there!

Have you thought about implementing __contains__ for "man" in model or __getitem__ for model["man"] support? That would be really neat. I tried monkey patching them, but it did not work (not sure why to be honest). The functions could be something like this (I think):

def __getitem__(self, word):
    try:
        word_idx = self.dictionary[word]
    except KeyError:
        raise Exception('Word not in dictionary')

    return self.word_vectors[word_idx]

def __contains__(self, word):
    return word in self.dictionary

I would have done a pull request but I am too uncertain about cython and stuff.

Finetune model

I have trained a model using glove-python, and I want to finetune it using other data. Is it possible?
If I load a trained glove model and train it with fit(), it starts to train from scratch.

STS benchmark reproducibility?

Hello,

I was wondering if someone managed to reproduce the results of the sentence similarity scores on the STS benchmark dataset (http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark). I tried to do that using your function transform_paragraph together with tokenizing the sentences with StanfordTokenizer from the NLTK library, but I managed to get to the Pearson coef. of only a bit over 0.3 on the testing set (the STS shows around 0.4).

I know the transform_paragraph function is only experimental, but I was wondering whether you implemented it completely yourself or you used an official GloVe sentence embedding (I myself do not know how exactly they weighted the individual words to get the sentence vector).

Thanks :)

What is the meaning of the hyperparameters?

I've found no documentation about usage of this package, so I don't understand how to correctly tune this model. The example just mentions the code below:

glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)

But it doesn't explains anything. I don't understand what no_components and learning_rate means. And what effect on the result has the number of epochs? Thank you.

Error in glove.py transform_paragraph in python 3

In python 3 dict.keys() returns a dict_keys object and dict_values a dict_values object instead of an array.

So in glove.py line 165 should be changed from:
word_ids = np.array(cooccurrence.keys(), dtype=np.int32)
to
word_ids = np.array(list(cooccurrence), dtype=np.int32)

and line 166 should be changed from:
values = np.array(cooccurrence.values(), dtype=np.float64)
to
values = np.array(list(cooccurrence.values()), dtype=np.float64)

Btw, I'm using python 3.5 because pickle gives problems when working with big models in python 2.7.

Error loading nlp.stanford.edu vectors

I'm getting the following error when trying to load http://nlp.stanford.edu/data/glove.840B.300d.zip

In [1]: import glove

In [2]: %time glv = glove.Glove.load_stanford("glove.840B.300d.txt")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-5e84d129b242> in <module>()
----> 1 get_ipython().magic(u'time glv = glove.Glove.load_stanford("glove.840B.300d.txt")')

virtualEnv/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2161         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2162         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2163         return self.run_line_magic(magic_name, magic_arg_s)
   2164 
   2165     #-------------------------------------------------------------------------

virtualEnv/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2082                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2083             with self.builtin_trap:
-> 2084                 result = fn(*args,**kwargs)
   2085             return result
   2086 

<decorator-gen-60> in time(self, line, cell, local_ns)

virtualEnv/Executionr/local/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

virtualEnv/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1175         else:
   1176             st = clock2()
-> 1177             exec(code, glob, local_ns)
   1178             end = clock2()
   1179             out = None

<timed exec> in <module>()

virtualEnv/local/lib/python2.7/site-packages/glove/glove.pyc in load_stanford(cls, filename)
    265         instance.word_vectors = (np.array(vectors)
    266                                  .reshape(no_vectors,
--> 267                                           no_components))
    268         instance.word_biases = np.zeros(no_vectors)
    269         instance.add_dictionary(dct)

ValueError: total size of new array must be unchanged

Any suggestions on how to load the vectors?

python 3 incompatibility in glove.py

In line 166 and 167
word_ids = np.array(cooccurrence.keys(), dtype=np.int32)
values = np.array(cooccurrence.values(), dtype=np.float64)

gives

TypeError: int() argument must be a string, a bytes-like object or a number, not 'dict_keys'

Changing them to the following solved the issue:
word_ids = np.array(list(cooccurrence.keys()), dtype=np.int32)
values = np.array(list(cooccurrence.values()), dtype=np.float64)

I can't install it on MAC

I have Mac and I am running python 3.5.2 and gcc version 7.1.0
When i ran pip command i got the following error

    /usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/hsethi/anaconda/include -arch x86_64 -I/Users/xxxxx/anaconda/include/python3.5m -c glove/glove_cython.c -o build/temp.macosx-10.6-x86_64-3.5/glove/glove_cython.o -fopenmp -ffast-math
    clang: error: unsupported option '-fopenmp'
    error: command '/usr/bin/clang' failed with exit status 1
    
    ----------------------------------------
Command "/Users/xxxx/anaconda/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/2y/442qhylj3bs2mln_h3lh4zvrbrhh_y/T/pip-build-1wyf65em/glove-python/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/2y/442qhylj3bs2mln_h3lh4zvrbrhh_y/T/pip-bgxjtwil-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/2y/442qhylj3bs2mln_h3lh4zvrbrhh_y/T/pip-build-1wyf65em/glove-python/

Issue in installing glove_python on Windows 10

Hi,
I am using Python 2.7.When I tried to install glove_python using pip install glove_python command, I was asked to download Microsoft Visual C++ compiler.I downloaded the same and again I am facing issue installation.I am attaching the logs.Please help me in installation.
glove_python_error.txt

Training glove model on a custom curpus

Hi
I have a corpus in Hindi and want to train the Glove model on this dataset.
My corpus is in the form of a folder of text documents.
How can I do this? Please provide appropriate code. Thanks.

Loss function is not squared in glove_cython?

Not sure if I am missing something here but thought I'd ask for clarification - the loss function is not squared.

loss = entry_weight * (prediction - c_log(count))

Also this implementation does not generate seperate vectors for when word is used in context?

Outputting context vectors

I'd like to get the context vectors out too (not just word vectors). Is this possible with current implementation?

The idea is to try to decouple words and contexts completely, ala Levy&Goldberg's "dependency based embeddings", to experiment with functional similarities.

OSX install error "gcc-4.9: error: unrecognized command line option '-Wshorten-64-to-32'"

Here's the output of this one,

> python setup.py install
running install
running bdist_egg
running egg_info
writing pbr to glove.egg-info/pbr.json
writing requirements to glove.egg-info/requires.txt
writing glove.egg-info/PKG-INFO
writing top-level names to glove.egg-info/top_level.txt
writing dependency_links to glove.egg-info/dependency_links.txt
reading manifest file 'glove.egg-info/SOURCES.txt'
writing manifest file 'glove.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.10-intel/egg
running install_lib
running build_py
running build_ext
building 'glove.glove_cython' extension
gcc-4.9 -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c glove/glove_cython.c -o build/temp.macosx-10.10-intel-2.7/glove/glove_cython.o -fopenmp
gcc-4.9: error: unrecognized command line option '-Wshorten-64-to-32'
error: command 'gcc-4.9' failed with exit status 1

Trying to use llvm-gcc results in missing a -lgomp library which I assume is related to the -fopenmp flag in the setup file.

I think cython introduces the shorten-64-to-32 flag based on my python version, I may have to use a different version of that.

gcc: error trying to exec 'cc1plus': execvp: No such file or directory

I am trying to install glove-python on RHEL 7.
I am using Python 2.7.5
I have GCC version 4.8.5 installed on my machine.

When I run pip install glove-python==0.1.0, I get the following error:

  Running setup.py install for glove-python ... error
    Complete output from command /home/vaibhavtulsyan/mar_12/my-dir/.pyenv/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-9Uc3Dm/glove-python/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-pSuN67/install-record.txt --single-version-externally-managed --compile --install-headers /home/vaibhavtulsyan/mar_12/my-dir/.pyenv/include/site/python2.7/glove-python:
    running install
    running build
    running build_py
    running build_ext
    building 'glove.corpus_cython' extension
    gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -c glove/corpus_cython.cpp -o build/temp.linux-x86_64-2.7/glove/corpus_cython.o -fopenmp -ffast-math -march=native
    gcc: error trying to exec 'cc1plus': execvp: No such file or directory
    error: command 'gcc' failed with exit status 1

Output of gcc -v:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)
  1. How can I install glove-python on RHEL 7 now?
  2. Is a shared object (binary) of glove-python available that I can directly import?

[Discussion] Motivation

Hi Maciej!

Thanks for sharing this project; I've been pursuing the idea of wrapping the reference stanfordnlp/GloVe source code as a Cython extension. Your library has been very useful in learning more about Cython!

I just wanted to ask what your motivation was for writing the algorithm from scratch, and what you would think of trying to thinly-wrap the original distribution. Thanks!

unable to install glove-python on windows 7

when i am trying to install glove_python on windows 7 i am getting this error.

C:\Users\BINVI01>pip install glove_python
Collecting glove_python
Using cached glove_python-0.1.0.tar.gz
Requirement already satisfied: numpy in c:\python27\lib\site-packages (from glov
e_python)
Requirement already satisfied: scipy in c:\python27\lib\site-packages (from glov
e_python)
Installing collected packages: glove-python
Running setup.py install for glove-python ... error
Complete output from command c:\python27\python.exe -u -c "import setuptools
, tokenize;file='c:\users\binvi01\appdata\local\temp\pip-build-lfd7z
\glove-python\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read
().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" instal
l --record c:\users\binvi01\appdata\local\temp\pip-sozxqa-record\install-record.
txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win32-2.7
creating build\lib.win32-2.7\glove
copying glove\corpus.py -> build\lib.win32-2.7\glove
copying glove\glove.py -> build\lib.win32-2.7\glove
copying glove_init
.py -> build\lib.win32-2.7\glove
running build_ext
building 'glove.glove_cython' extension
creating build\temp.win32-2.7
creating build\temp.win32-2.7\Release
creating build\temp.win32-2.7\Release\glove
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Ic:\python27\include
-Ic:\python27\PC /Tcglove/glove_cython.c /Fobuild\temp.win32-2.7\Release\glove/g
love_cython.obj -fopenmp -ffast-math -march=native
cl : Command line warning D9002 : ignoring unknown option '-fopenmp'
cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
glove_cython.c
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth
on\9.0\VC\Bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:c:\python27\libs /L
IBPATH:c:\python27\PCbuild /LIBPATH:c:\python27\PC\VS9.0 /EXPORT:initglove_cytho
n build\temp.win32-2.7\Release\glove/glove_cython.obj /OUT:build\lib.win32-2.7\g
love\glove_cython.pyd /IMPLIB:build\temp.win32-2.7\Release\glove\glove_cython.li
b /MANIFESTFILE:build\temp.win32-2.7\Release\glove\glove_cython.pyd.manifest -fo
penmp
LINK : warning LNK4044: unrecognized option '/fopenmp'; ignored
Creating library build\temp.win32-2.7\Release\glove\glove_cython.lib and
object build\temp.win32-2.7\Release\glove\glove_cython.exp
building 'glove.metrics.accuracy_cython' extension
creating build\temp.win32-2.7\Release\glove\metrics
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth
on\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Ic:\python27\include
-Ic:\python27\PC /Tcglove/metrics/accuracy_cython.c /Fobuild\temp.win32-2.7\Rele
ase\glove/metrics/accuracy_cython.obj -fopenmp -ffast-math -march=native
cl : Command line warning D9002 : ignoring unknown option '-fopenmp'
cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
accuracy_cython.c
creating build\lib.win32-2.7\glove\metrics
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth
on\9.0\VC\Bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:c:\python27\libs /L
IBPATH:c:\python27\PCbuild /LIBPATH:c:\python27\PC\VS9.0 /EXPORT:initaccuracy_cy
thon build\temp.win32-2.7\Release\glove/metrics/accuracy_cython.obj /OUT:build\l
ib.win32-2.7\glove\metrics\accuracy_cython.pyd /IMPLIB:build\temp.win32-2.7\Rele
ase\glove/metrics\accuracy_cython.lib /MANIFESTFILE:build\temp.win32-2.7\Release
\glove/metrics\accuracy_cython.pyd.manifest -fopenmp
LINK : warning LNK4044: unrecognized option '/fopenmp'; ignored
Creating library build\temp.win32-2.7\Release\glove/metrics\accuracy_cyth
on.lib and object build\temp.win32-2.7\Release\glove/metrics\accuracy_cython.exp

building 'glove.corpus_cython' extension
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth

on\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Ic:\python27\include
-Ic:\python27\PC /Tpglove/corpus_cython.cpp /Fobuild\temp.win32-2.7\Release\glov
e/corpus_cython.obj -fopenmp -ffast-math -march=native
cl : Command line warning D9002 : ignoring unknown option '-fopenmp'
cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
corpus_cython.cpp
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth
on\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but
unwind semantics are not enabled. Specify /EHsc
glove/corpus_cython.cpp(1894) : warning C4018: '>=' : signed/unsigned mismat
ch
glove/corpus_cython.cpp(2225) : warning C4018: '<' : signed/unsigned mismatc
h
glove/corpus_cython.cpp(2496) : warning C4018: '<' : signed/unsigned mismatc
h
glove/corpus_cython.cpp(3403) : warning C4244: 'argument' : conversion from
'double' to 'float', possible loss of data
glove/corpus_cython.cpp(3431) : warning C4244: 'argument' : conversion from
'double' to 'float', possible loss of data
C:\Users\BINVI01\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyth
on\9.0\VC\Bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:c:\python27\libs /L
IBPATH:c:\python27\PCbuild /LIBPATH:c:\python27\PC\VS9.0 stdc++.lib /EXPORT:init
corpus_cython build\temp.win32-2.7\Release\glove/corpus_cython.obj /OUT:build\li
b.win32-2.7\glove\corpus_cython.pyd /IMPLIB:build\temp.win32-2.7\Release\glove\c
orpus_cython.lib /MANIFESTFILE:build\temp.win32-2.7\Release\glove\corpus_cython.
pyd.manifest -fopenmp -ffast-math -march=native
LINK : warning LNK4044: unrecognized option '/fopenmp'; ignored
LINK : warning LNK4044: unrecognized option '/ffast-math'; ignored
LINK : warning LNK4044: unrecognized option '/march=native'; ignored
LINK : fatal error LNK1181: cannot open input file 'stdc++.lib'
error: command 'C:\Users\BINVI01\AppData\Local\Programs\Common\Micros
oft\Visual C++ for Python\9.0\VC\Bin\link.exe' failed with exit status 1181

----------------------------------------

Command "c:\python27\python.exe -u -c "import setuptools, tokenize;file='c:
\users\binvi01\appdata\local\temp\pip-build-_lfd7z\glove-python\setup.py'
;f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n')
;f.close();exec(compile(code, file, 'exec'))" install --record c:\users\binv
i01\appdata\local\temp\pip-sozxqa-record\install-record.txt --single-version-ext
ernally-managed --compile" failed with error code 1 in c:\users\binvi01\appdata
local\temp\pip-build-_lfd7z\glove-python\

Memory Error when running example.py

Hi, I'm getting a Memory Error when I'm trying to run an example script (probably during creation of coo matrix. Is there a way to save intermediate results in file (or some other method) to decrease the memory usage?

Get loss of word vector fitting

Hello.

I would like to see the loss of the vector optimization procedure to see if my glove model has relatively converged (and add a tolerance to make an early stopping). All I would need is the loss from the call to the cython module/function fit_vectors. I guess you would need to modify the C code of glove and somehow return the loss. Is there any easy way to achieve this?

Thanks in advance

Minimum word count parameter

I think this implementation is missing a parameter to discard words that appeared less than a given number of times, at least I couldn't find such a parameter in the code.

Are there new examples of applications on glove-python?

Hi, I am new to NLP and interested to explore the hype of word2vec. I wanna carry out some intrinsic evaluation such as "man-women=father-mother". In gensim package, we can do so directly with a most_similar function. I do not know how to do that in glove-python, in addition, I wonder whether I can use glove-python to do document classification, how can I use the functions like similar_paragraph, transform_paragraph and etc. Expect your help, thanks!

Index overflow with large matrix? (ValueError: negative dimensions are not allowed)

For a cooc matrix with dimensionality 1.5 million * 1.5 million I get the following error:

  File "build/bdist.linux-x86_64/egg/glove/corpus.py", line 70, in fit
    max_map_size)
  File "glove/corpus_cython.pyx", line 279, in glove.corpus_cython.construct_cooccurrence_matrix (glove/corpus_cython.cpp:3465)
  File "glove/corpus_cython.pyx", line 145, in glove.corpus_cython.Matrix.to_coo (glove/corpus_cython.cpp:2290)
ValueError: negative dimensions are not allowed

It's a bit weird cause the dimensions use integer and int.max >> (1.5 mio)^2.

glove/glove_cython.c:262:10: fatal error: omp.h: No such file or directory

I was trying to install glove on an ec2 instance. I have python3.6 and have already installed gcc. Even then, it is failing to install. The displayed message is:

glove/glove_cython.c:262:10: fatal error: omp.h: No such file or directory
#include <omp.h>
^~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1

unable to install on Win 7

whether using python setup.py install or pip install glove-python commands from Anaconda prompt, any installation attempt ends in failure with the error 2 below

c:\Anaconda>pip install glove-python
Collecting glove-python
  Downloading glove_python-0.1.0.tar.gz (263kB)
    100% |################################| 266kB 744kB/s
Requirement already satisfied (use --upgrade to upgrade): numpy in c:\anaconda\l
ib\site-packages (from glove-python)
Requirement already satisfied (use --upgrade to upgrade): scipy in c:\anaconda\l
ib\site-packages (from glove-python)
Building wheels for collected packages: glove-python
  Running setup.py bdist_wheel for glove-python ... error
  Complete output from command c:\anaconda\python.exe -u -c "import setuptools,
tokenize;__file__='c:\\users\\captain\\appdata\\local\\temp\\pip-build-jiom9g\\g
love-python\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).re
ad().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\captain\
appdata\local\temp\tmpbo71fdpip-wheel- --python-tag cp27:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\glove
  copying glove\corpus.py -> build\lib.win-amd64-2.7\glove
  copying glove\glove.py -> build\lib.win-amd64-2.7\glove
  copying glove\__init__.py -> build\lib.win-amd64-2.7\glove
  running build_ext

  building 'glove.glove_cython' extension
  error: [Error 2] The system cannot find the file specified

unable to find Corpus, Glove on OSX

Macbook Air
El Capitan 10.11.3
Python 2.7.11

Was able to (apparently) successfully install glove-python. However, when attempting to:
from glove import Corpus, Glove
I receive the following error:

Traceback (most recent call last):
File "build_glove.py", line 15, in
from glove import Glove
File "/Users/tyler/gitlab/parsing/quero/venv/lib/python2.7/site-packages/glove/init.py", line 1, in
from .corpus import Corpus
File "/Users/tyler/gitlab/parsing/quero/venv/lib/python2.7/site-packages/glove/corpus.py", line 10, in
from .corpus_cython import construct_cooccurrence_matrix
ImportError: dlopen(/Users/tyler/gitlab/parsing/quero/venv/lib/python2.7/site-packages/glove/corpus_cython.so, 2): no suitable image found. Did find:
/Users/tyler/gitlab/parsing/quero/venv/lib/python2.7/site-packages/glove/corpus_cython.so: mach-o, but wrong architecture

Any thoughts or comments on the matter? I have yet to find a solution.

install issue on windows 10

I am running the command pip install glove_python

C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\bin\HostX86\x86\link.exe /nologo /INCREMENTAL:NO /LTCG /nodefaultlib:libucrt.lib ucrt.lib /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:c:\users\hi5an\appdata\local\programs\python\python35-32\libs /LIBPATH:c:\users\hi5an\appdata\local\programs\python\python35-32\PCbuild\win32 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\Lib\x86" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\um\x86" /LIBPATH:C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319 "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\ucrt\x86" stdc++.lib /EXPORT:PyInit_corpus_cython build\temp.win32-3.5\Release\glove/corpus_cython.obj /OUT:build\lib.win32-3.5\glove\corpus_cython.cp35-win32.pyd /IMPLIB:build\temp.win32-3.5\Release\glove\corpus_cython.cp35-win32.lib -fopenmp -ffast-math -march=native
LINK : warning LNK4044: unrecognized option '/fopenmp'; ignored
LINK : warning LNK4044: unrecognized option '/ffast-math'; ignored
LINK : warning LNK4044: unrecognized option '/march=native'; ignored
LINK : fatal error LNK1181: cannot open input file 'stdc++.lib'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\bin\HostX86\x86\link.exe' failed with exit status 1181

Complete logs in the attached file
log.txt

help :Sample python code that uses pretrained model to give sentence vectors

I'm new to Glove. I could install glove-python using pip. Can someone please redirect me to a sample code that uses glove.6B.zip pre-trained model such that I get vectors for new sentences. I tried converting that word2vec but ran into issues. Could someone help me providing a sample code for the same.

Min word count while creating vocab

I was trying to train the latest Wikipedia dump size 15gb, obviously it has large corpus and token count (approx 360m). Since the co-occurrence matrix need to live in the memory, I want to provide a min number for Freq count of the word while creating vocab which in turn creates the co-occurrence matrix. I could not find any parameter for that. Also the code is in cython so it's hard to understand for noob like me. Any idea how can I create vocab and co-occurrence making it memory efficient?

metrics is not in top-level module namespace

examples/analogy_tasks_evaluation.py only works if metrics is in the top level module namespace, as in from glove import Glove, metrics. At least in my most recent pip installed version, metrics is not in the module's exported namespace.

I'm running Python 2.7.6 on Ubuntu.

What is the purpose of having the transform_paragraph in transform_paragraph?

I see that in glove.py there is a transform_paragraph function, which as the name suggests transforms a paragraph into a vector. However, at the end of the function I see it calling another transform_paragraph, this time from glove_cython.pyx. What is the purpose of this last call, it seems to be working without the call just as well.

Problem with the _model.add_dictionary(corpus.dictionary)_

Hi all,

Thanks for this nice implementation. However, the dict constructed from the input data by the Corpus() method has a shape:

{
word_1: ID_1,
word_2: ID_2 ... }

So, what about a word appearing often in the corpus ? The last ID is just replaced ? It should be a list no ?

And one more, when training on multiple chunks of documents, the method add_dictionary() simply replace the old dict created on the chunk No 1 by the one created with the No 2. Do you want me to pull a new version who will merge the two dicts instead ?

This is more RAM-friendly to interate through a generator when the input corpus is huge as hell ...

Thanks !

paragraph_transform input

What is the input data structure of the paragraph_transform function? Is it a list of word? I got keyError when I give a list of words as input.

I hope this is a simple question, because I'm new to NLP.

Thanks. This is great project.

UPDATE: did I understand it wrong? I'm using pretrained model and just use this function to convert short sentence/paragraph to a vector. But reading the code, it seems this function does gradient update, which seems like a training process...

How to load Stanford pre-trained?

Hi, I am new to Glove in Python. I was wondering how to load the pre-trained Stanford Glove word vectors.
Request your help.

Regards
SBS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.