Git Product home page Git Product logo

lda2vec's Introduction

lda2vec: Tools for interpreting natural language

http://img.shields.io/badge/license-MIT-blue.svg?style=flat https://readthedocs.org/projects/lda2vec/badge/?version=latest https://travis-ci.org/cemoody/lda2vec.svg?branch=master https://img.shields.io/twitter/follow/chrisemoody.svg?style=social

lda2vec_network_publish_text.gif

The lda2vec model tries to mix the best parts of word2vec and LDA into a single framework. word2vec captures powerful relationships between words, but the resulting vectors are largely uninterpretable and don't represent documents. LDA on the other hand is quite interpretable by humans, but doesn't model local word relationships like word2vec. We build a model that builds both word and document topics, makes them interpreable, makes topics over clients, times, and documents, and makes them supervised topics.

Warning: this code is a big series of experiments. It's research software, and we've tried to make it simple to modify lda2vec and to play around with your own custom topic models. However, it's still research software. I wouldn't run this in production, Windows, and I'd only use it after you've decided both word2vec and LDA are inadequate and you'd like to tinker with your own cool models :) That said, I don't want to discourage experimentation: there's some limited documentation, a modicum of unit tests, and some interactive examples to get you started.

Resources

See the research paper Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

See this Jupyter Notebook for an example of an end-to-end demonstration.

See this slide deck or this youtube video for a presentation focused on the benefits of word2vec, LDA, and lda2vec.

See the API reference docs

About

images/img00_word2vec.png

Word2vec tries to model word-to-word relationships.

images/img01_lda.png

LDA models document-to-word relationships.

images/img02_lda_topics.png

LDA yields topics over each document.

images/img03_lda2vec_topics01.png

lda2vec yields topics not over just documents, but also regions.

images/img04_lda2vec_topics02.png

lda2vec also yields topics over clients.

images/img05_lda2vec_topics03_supervised.png

lda2vec the topics can be 'supervised' and forced to predict another target.

lda2vec also includes more contexts and features than LDA. LDA dictates that words are generated by a document vector; but we might have all kinds of 'side-information' that should influence our topics. For example, a single client comment is about a particular item ID, written at a particular time and in a particular region. In this case, lda2vec gives you topics over all items (separating jeans from shirts, for example) times (winter versus summer) regions (desert versus coastal) and clients (sporty vs professional attire).

Ultimately, the topics are interpreted using the excellent pyLDAvis library:

images/img06_pyldavis.gif

Requirements

Minimum requirements:

  • Python 2.7+
  • NumPy 1.10+
  • Chainer 1.5.1+
  • spaCy 0.99+

Requirements for some features:

  • CUDA support
  • Testing utilities: py.test

lda2vec's People

Contributors

cemoody avatar cesarsalgado avatar eloiz avatar intohole avatar matheusportela avatar tingletech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lda2vec's Issues

clambda value

You mentioned in the paper that clambda = 200 performs well, but it seems that clambda is not used after it is being set to 200 in lda2vec_run.py.

a DLL error when importing lda2vec

Hi,

I am new to lda2vec and very interested in learning it.

I have installed lda2vec and all required dependencies in their newest version with the help of the setup.py file available here. I work with Enthought Canopy on Windows, Python 2.7.11. The only problem I had while installing dependencies was that I had to reinstall cymem (a package on which one of the dependencies depends) to an older version. When I import any chainer, spacy or numpy the import goes smoothly. However, when I import lda2vec, I get the following error, see below.

Does anyone know why it happens and how I best overcome it?

All the best,
Radoslaw


ImportError Traceback (most recent call last)
in ()
----> 1 import lda2vec
2
3

build\bdist.win-amd64\egg\lda2vec__init__.py in ()

build\bdist.win-amd64\egg\lda2vec\tracking.py in ()

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scikit_learn-0.17.1-py2.7-win-amd64.egg\sklearn__init__.py in ()
55 else:
56 from . import __check_build
---> 57 from .base import clone
58 __check_build # avoid flakes unused variable error
59

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scikit_learn-0.17.1-py2.7-win-amd64.egg\sklearn\base.py in ()
9 from scipy import sparse
10 from .externals import six
---> 11 from .utils.fixes import signature
12
13

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scikit_learn-0.17.1-py2.7-win-amd64.egg\sklearn\utils__init__.py in ()
9
10 from .murmurhash import murmurhash3_32
---> 11 from .validation import (as_float_array,
12 assert_all_finite,
13 check_random_state, column_or_1d, check_array,

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scikit_learn-0.17.1-py2.7-win-amd64.egg\sklearn\utils\validation.py in ()
14
15 from ..externals import six
---> 16 from ..utils.fixes import signature
17
18 FLOAT_DTYPES = (np.float64, np.float32, np.float16)

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scikit_learn-0.17.1-py2.7-win-amd64.egg\sklearn\utils\fixes.py in ()
322 from ._scipy_sparse_lsqr_backport import lsqr as sparse_lsqr
323 else:
--> 324 from scipy.sparse.linalg import lsqr as sparse_lsqr
325
326

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\linalg__init__.py in ()
110 from future import division, print_function, absolute_import
111
--> 112 from .isolve import *
113 from .dsolve import *
114 from .interface import *

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\linalg\isolve__init__.py in ()
4
5 #from info import doc
----> 6 from .iterative import *
7 from .minres import minres
8 from .lgmres import lgmres

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\linalg\isolve\iterative.py in ()
5 all = ['bicg','bicgstab','cg','cgs','gmres','qmr']
6
----> 7 from . import _iterative
8
9 from scipy.sparse.linalg.interface import LinearOperator

ImportError: DLL load failed: The specified procedure could not be found.

error with spacy.tokens.span.Span'

I have started using spacy recently and i am getting this line of error that object has no attribute get_dependants? Is there a way to fix this?

in opinion_extractor(aspect_token, parsed_sentence)
11
12 #Check for Negative Opinions
---> 13 for negation in parsed_sentence.get_dependants(aspect_token):
14 if negation.deprel == "neg":
15 negation_check = True

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'get_dependants'

Loss and Prior becomes nan

Is this a normal behavior that loss and prior became nan soon after the training process started?
I ran the sample code of 20 news group.

screen shot 2017-12-20 at 5 44 43 pm

Remove spaCy as a dependency

Hi @cemoody

I see spaCy being used ONCE in the text (to tokenize). Might I suggest making lda2vec more language-agnostic, and removing the dependency, and leaving the onus of tokenizing upon the user? I was hoping to get started on lda2Vec, and it the spaCy requirement automatically implied English, and that made me reluctant.

Assertion error

File "/home/aum/PycharmProjects/learn_p/venv/src/lda2vec/lda2vec/preprocess.py", line 35, in tokenize
assert dat.min() >= 0, msg
AssertionError: Negative indices reserved for special tokens

what should i do?

TypeError: only length-1 arrays can be converted to Python scalars

I am this getting error while executing your 20newsgroups ipython notebook.

I am running this notebook on remote ipython notebook server.

Also, I got Memoryerror while executing code under Preprocess text section.

and further while executing Fit the model section I got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-7-c243e820b33b> in <module>()
      1 # Fit the model
----> 2 model = LDA2Vec(n_words, n_hidden, counts, dropout_ratio=0.2)
      3 model.add_categorical_feature(n_docs, n_topics, name='document_id')
      4 model.finalize()
      5 if os.path.exists('model.hdf5'):

/home/nipunsadvilkar/notebooks/lda2vec_model.pyc in __init__(self, n_documents, n_document_topics, n_units, n_vocab, dropout_ratio, train, counts, n_samples, word_dropout_ratio)
     16                  counts=None, n_samples=15, word_dropout_ratio=0.0):
     17         em = EmbedMixture(n_documents, n_document_topics, n_units,
---> 18                           dropout_ratio=dropout_ratio)
     19         kwargs = {}
     20         kwargs['mixture'] = em

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/embed_mixture.pyc in __init__(self, n_documents, n_topics, n_dim, dropout_ratio)
     66         self.n_dim = n_dim
     67         self.dropout_ratio = dropout_ratio
---> 68         factors = _orthogonal_matrix((n_topics, n_dim)).astype('float32')
     69         factors /= np.sqrt(n_topics + n_dim)
     70         super(EmbedMixture, self).__init__(

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/embed_mixture.pyc in _orthogonal_matrix(shape)
     10     # github.com/mila-udem/blocks/blob/master/blocks/initialization.py
     11     M1 = np.random.randn(shape[0], shape[0])
---> 12     M2 = np.random.randn(shape[1], shape[1])
     13 
     14     # QR decomposition of matrix with entries in N(0, 1) is random

mtrand.pyx in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:17781)()

mtrand.pyx in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:18260)()

mtrand.pyx in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:2055)()

TypeError: only length-1 arrays can be converted to Python scalars

Also why does it raises error while importing LDA2Vec fuction.

ImportError: cannot import name LDA2Vec

Thanks for this great library! However, after python setup.py install, I cannot import the core object LDA2Vec. I also didn't see any files with the name LDA2Vec in the source code.

TypeError trying to visualize npzfile data with pyLDAvis

new to using pyLDAvis, but I ran the 20 newsgroups example, which generated topics.pyldavis.npz
after loading that into npzfile, did the following

import pyLDAvis as vis

prepared_data = vis.prepare(npzfile['topic_term_dists'],npzfile['doc_topic_dists'],npzfile['doc_lengths'],npzfile['vocab'],npzfile['term_frequency'])

vis.prepared_data_to_html(prepared_data)

but I just get a TypeError

TypeError: 5.3336749 is not JSON serializable

(same error if I try to use show() or display() or anything else that expected PreparedData, though if i just print prepared_data, it looks the data contains reasonable topics and vocab...)

Am I just doing something wrong or is there a bug?

error with preprocess.py

hello i have this error in preprocess.py

python3 preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 8, in
from lda2vec import preprocess, Corpus
File "/usr/local/lib/python3.5/dist-packages/lda2vec-0.1-py3.5.egg/lda2vec/init.py", line 1, in
ImportError: No module named 'dirichlet_likelihood'

but there is no module named 'dirichlet_likehood' in https://pypi.python.org/pypi .

error with spacy

hello i have this error in hacker_news/data
python preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 47, in
merge=True)
File "build/bdist.linux-x86_64/egg/lda2vec/preprocess.py", line 76, in tokenize
author_name = authors.categories
File "spacy/tokens/doc.pyx", line 250, in noun_chunks (spacy/tokens/doc.cpp:8013)
File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.getitem (spacy/tokens/doc.cpp:4890)
IndexError: list index out of range

No module named unicode

Im using python 3.5 and when im trying to run processor.py its showing an error No module named unicode.Can anyone help me how to fix it?

How to predict with a lda2vec model

When I have a lda2vec model, how can I predit the vector of a document with this model. I couldn't find the prediction method in any of following files : corpus.py dirichlet_likelihood.py embed_mixture.py fake_data.py init.py negative_sampling.py preprocess.py topics.py tracking.py utils.py.

Axis Error lda2vec

I am getting the following error when I tried to run the default preprocess.py in the lda2vec (original github), can someone help me out in this?


AxisError Traceback (most recent call last)
in ()
2 # Make a ranked list of rare vs frequent words
3 corpus.update_word_count(tokens)
----> 4 corpus.finalize()

~/Arav/LDA2VEC_TopicModel/lda2vec-master/lda2vec/corpus.py in finalize(self)
153 # Return the loose keys and counts in descending count order
154 # so that the counts arrays is already in compact order
--> 155 self.keys_loose, self.keys_counts, n_keys = self._loose_keys_ordered()
156 self.keys_compact = np.arange(n_keys).astype('int32')
157 self.loose_to_compact = {l: c for l, c in

~/Arav/LDA2VEC_TopicModel/lda2vec-master/lda2vec/corpus.py in _loose_keys_ordered(self)
102 keys, counts = keys[order], counts[order]
103 # Add in the specials as a prefix to the other keys
--> 104 specials = np.sort(self.specials.values())
105 keys = np.concatenate((specials, keys))
106 empty = np.zeros(len(specials), dtype='int32')

~/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in sort(a, axis, kind, order)
845 else:
846 a = asanyarray(a).copy(order="K")
--> 847 a.sort(axis=axis, kind=kind, order=order)
848 return a
849

AxisError: axis -1 is out of bounds for array of dimension 0

Installation issues on OSX

I'm having trouble installing this successfully on OSX. I'm downloading the repository locally and installing via python setup.py install. This throws the following error:

7 warnings generated.
zip_safe flag not set; analyzing archive contents...
error: SandboxViolation: open('/Users/conanmcmurtrie/anaconda/include/murmurhash/MurmurHash2.h', 'wb') {}

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This package cannot be safely installed by EasyInstall, and may not
support alternate installation locations even if you run its setup
script by hand.  Please inform the package's author and the EasyInstall
maintainers to find out if a fix or workaround is available.

Any ideas?

AxisError: axis -1 is out of bounds for array of dimension 0

When I run preprocess.py,there is the following problem. How should I resolve it? Thanks very much!

runfile('D:/Python project/3Olive-lda2vec-master/examples/twenty_newsgroups/data/preprocess.py', wdir='D:/Python project/3Olive-lda2vec-master/examples/twenty_newsgroups/data')
Traceback (most recent call last):

File "", line 1, in
runfile('D:/Python project/3Olive-lda2vec-master/examples/twenty_newsgroups/data/preprocess.py', wdir='D:/Python project/3Olive-lda2vec-master/examples/twenty_newsgroups/data')

File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "D:/Python project/3Olive-lda2vec-master/examples/twenty_newsgroups/data/preprocess.py", line 35, in
corpus.finalize()

File "D:\Python project\3Olive-lda2vec-master\lda2vec\corpus.py", line 155, in finalize
self.keys_loose, self.keys_counts, n_keys = self._loose_keys_ordered()

File "D:\Python project\3Olive-lda2vec-master\lda2vec\corpus.py", line 104, in _loose_keys_ordered
specials = np.sort(self.specials.values())

File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 822, in sort
a.sort(axis=axis, kind=kind, order=order)

AxisError: axis -1 is out of bounds for array of dimension 0

Set number of epochs

Hi. I'm new to this.
I am wondering how did the author set the number of epochs for the examples? the number of epochs in twenty-newsgroups is 200, while in HN is 5000.

the result of preprocess

When I run preprocess.py in twenty_newsgroup, I get results like these

2 --> SKIP
4 , --> ÉÏ
5 . --> ÉÏ
13 - --> ÉÏ
15 ) --> ÉÏ
16 " --> ÉÏ
17 ( --> ÉÏ
19 : --> ÉÏ
24 ? --> ÉÏ
36 ' --> ÉÏ
43 / --> ÉÏ
49 ! --> ÉÏ
51 ; --> ÉÏ
61 < --> ÉÏ
76 ... --> §£.§£.
79 -- --> -4
90 ] --> ÉÏ
100 max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax --> Malavika_Jagannathan_?[email protected]
108 [ --> ÉÏ
126 | --> ÉÏ
226 } --> ÉÏ
231 10 --> -0

I don't know what should I do to fix it,or it is the right results.

Does lda2vec modelling require cuDNN and HDF5?

Hi,

I read from the main page of this github that the only requirement for modelling is to have the CUDA support. At present, I'm trying to use code from the lda2vec_run.py file to run modelling. I have seen that there is a moment towards the end of the file where the output is saved to hdf5 but I can't find any documentation on whether CUDA + cuDNN and/or HDF5 is required to run an lda2vec model.

I ask because at present, despite installation of CUDA toolkit together with Visual Studio 2013 and setting environmental variables for Chainer, I cannot correctly import cuda to chainer.

My code is this:

import chainer
from chainer import cuda
from chainer import serializers
import chainer.optimizers as O
import numpy as np

from lda2vec import utils
from lda2vec import prepare_topics, print_top_words_per_topic
from lda2vec_model import LDA2Vec

gpu_id = int(os.getenv('CUDA_GPU', 0))
cuda.get_device(gpu_id).use() # tu jest w tej chwili problem
print "Using GPU " + str(gpu_id)

I get an error as follows:

RuntimeError Traceback (most recent call last)
in ()
20
21 gpu_id = int(os.getenv('CUDA_GPU', 0))
---> 22 cuda.get_device(gpu_id).use() # tu jest w tej chwili problem
23 print "Using GPU " + str(gpu_id)
24

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\chainer\cuda.pyc in get_device(*args)
159 continue
160 if not isinstance(arg, numpy.ndarray):
--> 161 check_cuda_available()
162 if isinstance(arg, cupy.ndarray):
163 if arg.device is None:

C:\Users\user\AppData\Local\Enthought\Canopy\User\lib\site-packages\chainer\cuda.pyc in check_cuda_available()
80 '(see https://github.com/pfnet/chainer#installation).')
81 msg += str(_resolution_error)
---> 82 raise RuntimeError(msg)
83 if not cudnn_enabled:
84 warnings.warn(

RuntimeError: CUDA environment is not correctly set up
(see https://github.com/pfnet/chainer#installation).cannot import name core

I would be grateful for help and any clarification.

Dimension of Topic Vector

According to LDA algorithm, dimension of topic vector (word probabilities) should be K X V. K is number of topics and V is vocabulary size (http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf). But here dimension is defined as K X W. W is word vector dimension or constant somewhere.

#Number of dimensions in a single word vector
n_units = int(os.getenv('n_units', 300))

Kindly explain, how these dimension can also follow LDA algorithm.

Thanks in advance!!

The lda2vec package should have a big warning for Windows users

Hi,

One probably final comment from me. The lda2vec package should have a HUGE warning sign that trying it on a Windows is a big challenge. Making dependencies work on a windows is really a pain and requires great depths of experience in working on similar problems.

Chainer doesn't really work on Windows systems. I learned to appreciate the magnitude of the problem only later into my work with lda2vec. For the record, I did manage to run the preprocessing and modelling code from the Hacker News example after quite significant modifications of the code but it was a very time consuming feat and with limited success. For example, cuDNN still doesn't work and I need to do any computations on CPU. Adapting this package to a windows system on one's own computer only achievable for well trained IT experts with experience

All the best,
RK

The top words are very similar after 5-6 epochs

screen shot 2016-06-16 at 12 16 27 pm

I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.

Using 2 GPUs

Is it even possible to utilize 2 GPUs without messing a lot with the code?

Non-existent "topics.pyldavis.npz"

The Jupyter Notebook states that "After runnning lda2vec_run.py script in examples/twenty_newsgroups/lda2vec directory a topics.pyldavis.npz will be created..."

However looking at the code of lda2vec_run.py, the creation of " topics.pyldavis.npz" is nowhere to be found. In the end, the script only creates a "lda2vec.hdf5" file. How can this "lda2vec.hdf5" file be used in the Jupyter Notebook tutorial?

Thanks,
Nicholas

lda2vec requires cymem package as well

Hi,

Just a quick note. I worked through the preprocess.py file and figured out that the cymem package is also required. Moreover, the newest version of cymem is not appropriate. I had to install cymem version 1.30 before I could install the lda2vec dependencies correctly. I think that should be mentioned to those interested in using the package

MemoryError

I am working with python 2.7 64bit. I tried to run lda2vec without pretrained (On the fly). I am getting memoryerror at preprocess step with twentynewsgroup example. Could you please help me to figure out this issue?

Incompatibility with spacy>=0.100

spacy deprecated the use of the LOCAL_DATA_DIR variable after version 0.100. The spacy source indicates that you can now call a Language class without a data_dir.

Using spacy==0.100.5, I was gettting the following error:

Traceback (most recent call last):
  File "cluster_articles_lda2vec.py", line 35, in <module>
    tokens, vocab = preprocess.tokenize(texts, max_length, tag=False, parse=False, entity=False)
  File "/Users/brianabelson/.virtualenvs/similar/lib/python2.7/site-packages/lda2vec/preprocess.py", line 65, in tokenize
    nlp = English(data_dir=data_dir)
  File "/Users/brianabelson/.virtualenvs/similar/lib/python2.7/site-packages/spacy/language.py", line 231, in __init__
    vocab = self.default_vocab(package)
  File "/Users/brianabelson/.virtualenvs/similar/lib/python2.7/site-packages/spacy/language.py", line 165, in default_vocab
    return Vocab.from_package(package, get_lex_attr=get_lex_attr)
  File "spacy/vocab.pyx", line 65, in spacy.vocab.Vocab.from_package (spacy/vocab.cpp:3592)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/brianabelson/.virtualenvs/similar/lib/python2.7/site-packages/sputnik/package_stub.py", line 68, in open
    raise default(self.file_path(*path_parts))
IOError: ~/.virtualenvs/similar/lib/python2.7/site-packages/spacy/en/data/vocab/strings.json

This is related to #8 and #2.

Trouble with executing preprocess.py

I'm trying to run the twenty_newsgroups example. However, I have a lot of problems.
I realized this code use the old package. So, I tried to use the suggested version of package.
When I ran the preprocess.py, I use the spacy (0.99), but it said
Traceback (most recent call last): File "preprocess.py", line 31, in <module> n_threads=4) File "/home/allen.wu/lda2vec-master/lda2vec/preprocess.py", line 72, in tokenize for row, doc in enumerate(nlp.pipe(texts, **kwargs)): AttributeError: 'English' object has no attribute 'pipe'

I upgrade the spacy to (0.100.1), the error turn out:
Traceback (most recent call last): File "preprocess.py", line 31, in <module> n_threads=4) File "/home/allen.wu/lda2vec-master/lda2vec/preprocess.py", line 68, in tokenize nlp = English() File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/language.py", line 202, in __init__ package = util.get_package_by_name() File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/util.py", line 28, in get_package_by_name raise RuntimeError("Model not installed. Please run 'python -m " RuntimeError: Model not installed. Please run 'python -m spacy.en.download' to install latest compatible model.

I follow the commands to run 'python -m spacy.en.download', then
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/en/download.py", line 58, in <module> plac.call(main) File "/usr/local/lib/python2.7/dist-packages/plac-0.9.6-py2.7.egg/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/usr/local/lib/python2.7/dist-packages/plac-0.9.6-py2.7.egg/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/en/download.py", line 42, in main package = sputnik.install(about.__name__, about.__version__, about.__default_model__) File "/home/allen.wu/.local/lib/python2.7/site-packages/sputnik/__init__.py", line 44, in install archive = cache.fetch(package_name) File "/home/allen.wu/.local/lib/python2.7/site-packages/sputnik/cache.py", line 57, in fetch package = self.get(package_string) File "/home/allen.wu/.local/lib/python2.7/site-packages/sputnik/package_list.py", line 61, in get raise PackageNotFoundException(package_string) sputnik.package_list.PackageNotFoundException: en_default==1.0.5

Afterwards, I upgrade the spacy package to latest version. It said
Traceback (most recent call last): File "preprocess.py", line 12, in <module> from lda2vec import preprocess, Corpus File "/home/allen.wu/lda2vec-master/lda2vec/__init__.py", line 4, in <module> import preprocess File "/home/allen.wu/lda2vec-master/lda2vec/preprocess.py", line 1, in <module> import spacy File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/__init__.py", line 4, in <module> from .cli.info import info as cli_info File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/cli/__init__.py", line 1, in <module> from .download import download File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/cli/download.py", line 10, in <module> from .link import link File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/cli/link.py", line 7, in <module> from ..compat import symlink_to, path2str File "/home/allen.wu/.local/lib/python2.7/site-packages/spacy/compat.py", line 11, in <module> from thinc.neural.util import copy_array File "/home/allen.wu/.local/lib/python2.7/site-packages/thinc/neural/__init__.py", line 1, in <module> from ._classes.model import Model File "/home/allen.wu/.local/lib/python2.7/site-packages/thinc/neural/_classes/model.py", line 12, in <module> from ..train import Trainer File "/home/allen.wu/.local/lib/python2.7/site-packages/thinc/neural/train.py", line 3, in <module> from .optimizers import Adam, SGD, linear_decay File "optimizers.pyx", line 13, in init thinc.neural.optimizers File "ops.pyx", line 52, in init thinc.neural.ops AttributeError: 'module' object has no attribute 'PinnedMemoryPool'

I was stuck by this code for few days. Is there any one success running this example?

document inference

hi,

could you provide example for predicting new document topics or vectors from trained model?

thanks.

tokenize error

I have been following your instruction to test lda2vec, but I got an error when I tried to run this line:
tokens, vocab = preprocess.tokenize(texts,max_length,tag=False,parse=False,entity=False)

runfile('/Users/lm/Dropbox/Athena/Feature_Reduction/WordVectors/lda2vec_test.py', wdir='/Users/m/Dropbox/Athena/Feature_Reduction/WordVectors')
Traceback (most recent call last):

File "", line 1, in
runfile('/Users/lm/Dropbox/Athena/Feature_Reduction/WordVectors/lda2vec_test.py', wdir='/Users/lm/Dropbox/Athena/Feature_Reduction/WordVectors')

File "/Users/lm/Documents/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
execfile(filename, namespace)

File "/Users/lm/Documents/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 81, in execfile
builtins.execfile(filename, *where)

File "/Users/lm/Dropbox/Athena/Feature_Reduction/WordVectors/lda2vec_test.py", line 29, in
tokens, vocab = preprocess.tokenize(texts,max_length,tag=False,parse=False,entity=False)

File "build/bdist.macosx-10.5-x86_64/egg/lda2vec/preprocess.py", line 65, in tokenize
nlp = English(data_dir=data_dir)

File "/Users/lm/Documents/anaconda/lib/python2.7/site-packages/spacy/language.py", line 210, in init
vocab = self.default_vocab(package)

File "/Users/lm/Documents/anaconda/lib/python2.7/site-packages/spacy/language.py", line 144, in default_vocab
return Vocab.from_package(package, get_lex_attr=get_lex_attr)

File "spacy/vocab.pyx", line 65, in spacy.vocab.Vocab.from_package (spacy/vocab.cpp:3592)
with package.open(('vocab', 'strings.json')) as file_:

File "/Users/lm/Documents/anaconda/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()

File "/Users/lm/Documents/anaconda/lib/python2.7/site-packages/sputnik/package_stub.py", line 68, in open
raise default(self.file_path(*path_parts))

IOError: /Users/lm/Documents/anaconda/lib/python2.7/site-packages/spacy/en/data/vocab/strings.json

I have my related modules(numpy,space...) updated to the newest, I still got this error.

Using in Tensorflow

Is it possible to use this in tensorflow? I am asking because we might need to do it in multicluster set up?

Is there any example provided?

IndexError: Error calculating span: Can't find end

Running on OX X 10.11.6
$ python --version
Python 2.7.11 :: Anaconda custom (x86_64)

$ python preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 47, in
merge=True)
File "build/bdist.macosx-10.5-x86_64/egg/lda2vec/preprocess.py", line 78, in tokenize
# Chop timestamps into days
File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.len (spacy/tokens/span.cpp:3955)
File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5105)
IndexError: Error calculating span: Can't find end

Related to:
#38

Getting JSON SPACY Issue

I get the error below. I tried to install spacy with force all but that didn't fix the problem.


IOError Traceback (most recent call last)
in ()
22 max_length = 10000 # Limit of 10k words per document
23 tokens, vocab = preprocess.tokenize(texts, max_length, tag=False,
---> 24 parse=False, entity=False)
25 corpus = Corpus()
26 # Make a ranked list of rare vs frequent words

//anaconda/lib/python2.7/site-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, **kwargs)
63 if nlp is None:
64 data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
---> 65 nlp = English(data_dir=data_dir)
66 data = np.zeros((len(texts), max_length), dtype='int32')
67 data[:] = skip

//anaconda/lib/python2.7/site-packages/spacy/language.pyc in init(self, data_dir, vocab, tokenizer, tagger, parser, entity, matcher, serializer, load_vectors, package)
229
230 if vocab in (None, True):
--> 231 vocab = self.default_vocab(package)
232 self.vocab = vocab
233 if tokenizer in (None, True):

//anaconda/lib/python2.7/site-packages/spacy/language.pyc in default_vocab(cls, package, get_lex_attr)
163 get_lex_attr = cls.default_lex_attrs()
164 if hasattr(package, 'dir_path'):
--> 165 return Vocab.from_package(package, get_lex_attr=get_lex_attr)
166 else:
167 return Vocab.load(package, get_lex_attr)

//anaconda/lib/python2.7/site-packages/spacy/vocab.pyx in spacy.vocab.Vocab.from_package (spacy/vocab.cpp:3592)()
63 lemmatizer=lemmatizer, serializer_freqs=serializer_freqs)
64
---> 65 with package.open(('vocab', 'strings.json')) as file_:
66 self.strings.load(file_)
67 self.load_lexemes(package.file_path('vocab', 'lexemes.bin'))

//anaconda/lib/python2.7/contextlib.pyc in enter(self)
15 def enter(self):
16 try:
---> 17 return self.gen.next()
18 except StopIteration:
19 raise RuntimeError("generator didn't yield")

//anaconda/lib/python2.7/site-packages/sputnik/package_stub.pyc in open(self, path_parts, mode, encoding, default)
65 else:
66 if isinstance(default, type) and issubclass(default, Exception):
---> 67 raise default(self.file_path(*path_parts))
68 elif isinstance(default, Exception):
69 raise default

IOError: //anaconda/lib/python2.7/site-packages/spacy/en/data/vocab/strings.json

Documentation outdated

The Readme for the hacker_news example refers to a model.py and visualize.py.
I couldn't find those files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.