lda-project / lda Goto Github PK
View Code? Open in Web Editor NEWTopic modeling with latent Dirichlet allocation using Gibbs sampling
Home Page: https://lda.readthedocs.io/
License: Mozilla Public License 2.0
Topic modeling with latent Dirichlet allocation using Gibbs sampling
Home Page: https://lda.readthedocs.io/
License: Mozilla Public License 2.0
in my main code I configure:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
But I get lda's logger configuration:
INFO:lda:<450> log likelihood: -111, per-word: -3.4701
INFO:lda:<460> log likelihood: -111, per-word: -3.4701
INFO:lda:<470> log likelihood: -111, per-word: -3.4701
INFO:lda:<480> log likelihood: -111, per-word: -3.4701
If in lda's init file I mark out line 11:
...
__version__ = pbr.version.VersionInfo('lda').version_string()
# logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())
I get the correct logger config:
2015-02-23 16:05:37,286 : INFO : <470> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,288 : INFO : <480> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,290 : INFO : <490> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,291 : INFO : <500> log likelihood: -111, per-word: -3.4701
I took this from gensim's init file.
python setup.py sdist
should transparently generate c code if it's not there. This means learning more about pbr
.
There are many PEP8 errors that should be fixed.
This shows up only under
but not
======================================================================
ERROR: lda.tests.test_lda_transform.TestLDATransform.test_lda_transform_basic_sparse
----------------------------------------------------------------------
testtools.testresult.real._StringException: Traceback (most recent call last):
File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/tests/test_lda_transform.py", line 85, in test_lda_transform_basic_sparse
doc_topic_test = model.transform(dtm_test)
File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/lda.py", line 174, in transform
WS, DS = lda.utils.matrix_to_lists(X)
File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/utils.py", line 44, in matrix_to_lists
if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:
File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/scipy/sparse/compressed.py", line 586, in sum
ret[major_index] = value
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
It appears there is a nice, GPL-2 ARMS sampler and slice sampler in C: http://cran.r-project.org/web/packages/SamplerCompare/index.html
written by Madeleine Thompson
Following semantic versioning guidelines at http://semver.org/ (lda's API is stable)
The rust docs have a nice model (see http://doc.rust-lang.org/index.html#guides)
e.g., do a quick start and then a more detailed example
Do you have any advice for using LDA when the number of clusters is unknown a priori?
I: pybuild base:170: python2.7 setup.py clean
error in setup command: Error parsing /tmp/buildd/lda-1.0.2/setup.cfg: OSError: [Errno 2] No such file or directory
E: pybuild pybuild:256: clean: plugin distutils failed with: exit code=1: python2.7 setup.py clean
dh_auto_clean: pybuild --clean -i python{version} -p 2.7 --dir . returned exit code 13
debian/rules:10: recipe for target 'clean' failed
some sort of pbr interaction
Is there any way or principle to achieve this goal?
Take an example for simplicity, docA have two prob.dists, such as P(topic=1)=0.21, P(topic=2)=0.22.
How to produce a higher P for docA in topic1 or topic2 or assign one?
Add
python setup.py build_sphinx
should generate no errors.Thanks for an excellent LDA implementation in Cython.
I'm developing a model that extends the LDA-model and thus I need to modify the source code in this package and then run it straight from my locally modified source code. However, when trying to import the package lda
from the source directory, I run into the same problem as #25 "ImportError: No module named _lda" (I'm also using Ubuntu 14.04).
What would the best approach be to run a locally modified version of this package?
How do I train a old model with additional data rather than train a new model with all data?
Running tox
indicates the problems.
Hi,
I stumbled across an issue when pipelining the scikit count vectorizer with the lda model. In the transform method of lda the line X = np.atleast_2d(X)
can be problematic due to the scipy sparse matrix returned from CountVectorizer. This issue indicates that not only does the dimension conversion not work with the scipy sparse matrix, but also that there doesn't seem to be an appetite to handle it.
My work around was to wrap the pipeline object, override transform and convert the word count sparse matrix to a dense 2D before it gets to lda's transform.
def transform(self, X, **kwargs):
x_t = self.count_vectorizer.transform(X)
return self.lda.transform(np.asarray(x_t.todense()), **kwargs)
If there's a better way of navigating this issue, please let me know. Otherwise, it would be great to have a check put into the transform method that avoid the use of X = np.atleast_2d(X)
in the case of scipy sparse matrix.
Thanks
Rob
Can I represent my corpus as sparse vectors? like in gensim
Must I fit the LDA with the input X of shape (n_samples, n_features) ?
Since it's inefficient and unnecessary to do that.
See the bug ogrisel/wheelhouse-uploader/issues/1
The algorithm of inferencing the topics on new unseen docs is mentioned in transform
method, i.e. iterated pseudo-counts (I missed this NOTE at first).
AFAIK, however the inference on new doc is typically implemented using gibbs sampling too, could you please give me some brief hints on how these two methods differ from each other in terms of accuracy and efficiency?
supports the numpy docstring format, part of new sphinx
dtm
below is a sparse matrix of floats. Error here is unhelpful (and is better for a numpy array).
In [34]: clf.fit(dtm)
INFO:lda:n_documents: 2740
INFO:lda:vocab_size: 50000
INFO:lda:n_words: 2739
INFO:lda:n_topics: 40
INFO:lda:n_iter: 2000
WARNING:lda:all zero column in document-term matrix found
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-34-ba91fde2c737> in <module>()
----> 1 clf.fit(dtm)
/home/ar/work/lda/lda-ariddell/lda/lda.py in fit(self, X, y)
118 Returns the instance itself.
119 """
--> 120 self._fit(X)
121 return self
122
/home/ar/work/lda/lda-ariddell/lda/lda.py in _fit(self, X)
213 random_state = lda.utils.check_random_state(self.random_state)
214 rands = self._rands.copy()
--> 215 self._initialize(X)
216 for it in range(self.n_iter):
217 # FIXME: using numpy.roll with a random shift might be faster
/home/ar/work/lda/lda-ariddell/lda/lda.py in _initialize(self, X)
257 np.testing.assert_equal(N, len(WS))
258 for i in range(N):
--> 259 w, d = WS[i], DS[i]
260 z_new = i % n_topics
261 ZS[i] = z_new
IndexError: index 0 is out of bounds for axis 0 with size 0
MCMC should yield similar or lower perplexity (train and test)
Adding a function that would estimate document-completion perplexity would be a nice feature. It has been requested on the mailing list.
Perhaps something like model.score(n_iter=N, burnin=M, method='completion')
would be roughly consistent with the scikit-learn api?
Maybe I say a error in file lda.py line 261:
N = int(X.sum())
N means the number of words, but here is the sum of all index. Maybe it is: sum the len of each element in X , sum([len(i) for i in X])?
After running pip install lda
I go to the python interpreter and I got
>>> import lda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lda/__init__.py", line 7, in <module>
from lda.lda import LDA # noqa
File "lda/lda.py", line 10, in <module>
import lda._lda
ImportError: No module named _lda
Alternatively, I cloned the repo and ran python setup.py build
and i got
error: can't copy 'lda/_lda.c': doesn't exist or not a regular file
Hi,
I apologise in advance if this is not an issue with the software but our own server.
I installed and used lda on my local Mac (with anaconda) without problems, both with pip install and the install from source via make + python setup.py install. However, neither appears to work on our linux server (with anaconda). The lda library appears to be installed, i.e.
In [1]: import l
lda libfuturize linecache llpython llvm_array llvmpy logging
lib2to3 libpasteurize linuxaudiodev llvm llvm_cbuilder locale lxml
but when I actually load it, I get
In [1]: import lda
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-4b4ad2b47765> in <module>()
----> 1 import lda
/homes/matthiasm/code/lda/lda/__init__.py in <module>()
5 import pbr.version
6
----> 7 from lda.lda import LDA # noqa
8
9 __version__ = pbr.version.VersionInfo('lda').version_string()
/homes/matthiasm/code/lda/lda/lda.py in <module>()
8 import numpy as np
9
---> 10 import lda._lda
11 import lda.utils
12
ImportError: No module named _lda
Any hints to how I might resolve this?
Cheers,
Matthias
I need save the fitted model so that i could reload the model to predict with new data next time.
Can we persist an LDA model for later re-use? What's the simplest way of doing that?
Just an informal note in README. For example, "lda is 0.5 as fast as MALLET"
Something went wrong with the sdist pre hook. I removed it from the 1.0 release but it should be there as a useful check during the release process.
Hi!
I'm trying to install the HEAD version using pip from repo but it gives an error:
(my_env) 00:37:31 ~/Documents/codes/my_env $ pip install -e git+https://github.com/ariddell/lda.git@master\#egg\=lda
Obtaining lda from git+https://github.com/ariddell/lda.git@master#egg=lda
Updating /home/paulo/.virtualenvs/my_env/src/lda clone (to master)
Requirement already satisfied (use --upgrade to upgrade): pbr!=0.7,<1.0,>=0.6 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): numpy<2.0,>=1.6.1 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): pip in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from pbr!=0.7,<1.0,>=0.6->lda)
Installing collected packages: lda
Running setup.py develop for lda
Complete output from command /home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps:
running develop
running egg_info
writing pbr to lda.egg-info/pbr.json
writing lda.egg-info/PKG-INFO
writing dependency_links to lda.egg-info/dependency_links.txt
writing top-level names to lda.egg-info/top_level.txt
writing requirements to lda.egg-info/requires.txt
[pbr] Processing SOURCES.txt
warning: LocalManifestMaker: standard file '-c' not found
[pbr] In git context, generating filelist from git
warning: no files found matching 'AUTHORS'
warning: no files found matching 'ChangeLog'
warning: no previously-included files found matching '.gitreview'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
reading manifest template 'MANIFEST.in'
warning: no files found matching 'AUTHORS'
warning: no files found matching 'ChangeLog'
warning: no previously-included files found matching '.gitignore'
warning: no previously-included files found matching '.gitreview'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
writing manifest file 'lda.egg-info/SOURCES.txt'
running build_ext
building 'lda._lda' extension
gcc -pthread -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/paulo/.pyenv/versions/3.4.3/include/python3.4m -c lda/_lda.c -o build/temp.linux-x86_64-3.4/lda/_lda.o
gcc: error: lda/_lda.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4
----------------------------------------
Command "/home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/paulo/.virtualenvs/my_env/src/lda
I also tried to follow the [http://pythonhosted.org/lda/installation.html#linux](installation instructions for Linux) and copy the contents:
cp -R /usr/lib/python3/dist-packages/lda* ~/.virtualenvs/my_env/lib/python3.4/site-packages/
but this error happens. I'm stuck now.
I'm using Ubuntu 14.04.3, Python 3.4.3 (from yyuu/pyenv) and virtualenv.
Thanks!
i.e., Wallach et al. NIPS 2009
Make the API slightly more consistent with scikit-learn, see comments on scikit-learn lda branch
e.g., coveralls
For those interested in compiling by hand on Windows, there are instructions on Stack Overflow: https://stackoverflow.com/a/2838827/121704
This library has been hugely useful to us. Would it be possible to work with scipy's sparse matrices? It doesn't look like scipy and numpy play well together with their matrics, for example numpy.sum
doesn't work with scipy sparse matrices.
Any ideas?
Thanks again.
Converting from ldac to dtm is rather slow.
It should probably take an n_iter
as an argument.
The reuters sample corpus is included in the test data, so there's nothing to download.
Just sort theta
and relabel everything. Label switching is confusing.
To double-check, the derivation is in Cowans' thesis.
logging
messages do not show up in IPython notebook.
At minimum, a brief "Monitoring the sampler" section of the documentation could describe the problem.
Alternatively, provide a traceplot of the complete loglikelihoods? à la Stan or PyMC?
This might stabilize label-switching and is easier than implementing truncated HDP.
Cython wraps the C++ std:;unordered_map -- perhaps this could be used to speed up gammaln calls? I believe hca
uses this strategy at certain points.
subject says it all
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.