lda-project / lda Goto Github PK

View Code? Open in Web Editor NEW

1.2K 49.0 389.0 521 KB

Topic modeling with latent Dirichlet allocation using Gibbs sampling

Home Page: https://lda.readthedocs.io/

License: Mozilla Public License 2.0

Makefile 0.19% Shell 1.52% Python 79.71% C 10.18% Cython 7.67% Meson 0.73%

lda's Introduction

lda: Topic modeling with latent Dirichlet allocation

NOTE: This package is in maintenance mode. Critical bugs will be fixed. No new features will be added.

lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. lda is fast and is tested on Linux, OS X, and Windows.

You can read more about lda in the documentation.

Installation

pip install lda

Getting started

lda.LDA implements latent Dirichlet allocation (LDA). The interface follows conventions found in scikit-learn.

The following demonstrates how to inspect a model of a subset of the Reuters news dataset. The input below, X, is a document-term matrix (sparse matrices are accepted).

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set

The document-topic distributions are available in model.doc_topic_.

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)

Requirements

Python ≥3.10 and NumPy.

Caveat

lda aims for simplicity. (It happens to be fast, as essential parts are written in C via Cython.) If you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca and MALLET. hca is written entirely in C and MALLET is written in Java. Unlike lda, hca can use more than one processor at a time. Both MALLET and hca implement topic models known to be more robust than standard latent Dirichlet allocation.

Notes

Latent Dirichlet allocation is described in Blei et al. (2003) and Pritchard et al. (2000). Inference using collapsed Gibbs sampling is described in Griffiths and Steyvers (2004).

Important links

Documentation: http://lda.readthedocs.org
Source code: https://github.com/lda-project/lda/
Issue tracker: https://github.com/lda-project/lda/issues

Other implementations

scikit-learn's LatentDirichletAllocation (uses online variational inference)
gensim (uses online variational inference)

License

lda is licensed under Version 2.0 of the Mozilla Public License.

lda's People

Contributors

Stargazers

Watchers

Forkers

nvdnkpr ousaizen fbkarsdorp mpezeshki matthiasmauch mehdidc aurora1625 clarkwang1214 zhchxi11 xashely cogentmentat caotianwei qluo hhself wangshihan9 mpercy mindis bearnshaw isnowalarm liangchen815 cxysteven tdhopper paschalidoud jihyunp thomashoppe huangkai0225 jellycat0000 hothhowler abnering dds-dong ml-lab dragonly mrvaldez sayiho dtrckd pombredanne teffland zbxzc35 alcuardxviiii ricebeans mcdelaney hijbul aadilh qiuzhangcheng rebeccabilbro priyankpalod gsdgdf xuerenlv haobogu o-github-o iamkakadong waiteryee1 saidalfaraby wicast crazysherman lovetimil thuzhf jjjkaixin optimus1009 ww880412 jizhihang halofanx jinyu0310 athityakumar elani0 imclab royfangqi millawell xingxjhui worldhosung hcutler davidchu201 tanthml casandreu mskylsjwg shihuaxing poseidon1214 juary88 meflyup wangxingjun778 ypscut imran273 yexuan lsc-priscilla abenton xiangyongcao superf2t peipei1109 yaohsienhsieh julyraining vyraun bombooo wwbigdata902 mt-digital mcznhaha akbargumbira allenfork baimei1 amirbalaish kapoyegou

lda's Issues

Scipy int64 / int32 error on OS X Python 3.3

This shows up only under

OS X, Clang, Python 3.3.5, Numpy 1.7.1

but not

OS X, Clang, Python 3.4.3, Numpy 1.7.1

======================================================================
ERROR: lda.tests.test_lda_transform.TestLDATransform.test_lda_transform_basic_sparse
----------------------------------------------------------------------
testtools.testresult.real._StringException: Traceback (most recent call last):

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/tests/test_lda_transform.py", line 85, in test_lda_transform_basic_sparse
    doc_topic_test = model.transform(dtm_test)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/lda.py", line 174, in transform
    WS, DS = lda.utils.matrix_to_lists(X)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/utils.py", line 44, in matrix_to_lists
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/scipy/sparse/compressed.py", line 586, in sum
    ret[major_index] = value

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

Is it a bug？

Maybe I say a error in file lda.py line 261：
N = int(X.sum())
N means the number of words, but here is the sum of all index. Maybe it is: sum the len of each element in X , sum([len(i) for i in X])?

Improve error reporting when user passes sparse float matrix

dtm below is a sparse matrix of floats. Error here is unhelpful (and is better for a numpy array).

In [34]: clf.fit(dtm)
INFO:lda:n_documents: 2740
INFO:lda:vocab_size: 50000
INFO:lda:n_words: 2739
INFO:lda:n_topics: 40
INFO:lda:n_iter: 2000
WARNING:lda:all zero column in document-term matrix found
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-ba91fde2c737> in <module>()
----> 1 clf.fit(dtm)

/home/ar/work/lda/lda-ariddell/lda/lda.py in fit(self, X, y)
    118             Returns the instance itself.
    119         """
--> 120         self._fit(X)
    121         return self
    122 

/home/ar/work/lda/lda-ariddell/lda/lda.py in _fit(self, X)
    213         random_state = lda.utils.check_random_state(self.random_state)
    214         rands = self._rands.copy()
--> 215         self._initialize(X)
    216         for it in range(self.n_iter):
    217             # FIXME: using numpy.roll with a random shift might be faster

/home/ar/work/lda/lda-ariddell/lda/lda.py in _initialize(self, X)
    257         np.testing.assert_equal(N, len(WS))
    258         for i in range(N):
--> 259             w, d = WS[i], DS[i]
    260             z_new = i % n_topics
    261             ZS[i] = z_new

IndexError: index 0 is out of bounds for axis 0 with size 0

Order topics in descending order by average prevalence

Just sort theta and relabel everything. Label switching is confusing.

Use Asymmetric-Symmetric LDA / Hierarchical Pitman-Yor prior

i.e., Wallach et al. NIPS 2009

Can't install on Ubuntu using pip install -e

Hi!

I'm trying to install the HEAD version using pip from repo but it gives an error:

(my_env) 00:37:31 ~/Documents/codes/my_env $ pip install -e git+https://github.com/ariddell/lda.git@master\#egg\=lda 
Obtaining lda from git+https://github.com/ariddell/lda.git@master#egg=lda
  Updating /home/paulo/.virtualenvs/my_env/src/lda clone (to master)
Requirement already satisfied (use --upgrade to upgrade): pbr!=0.7,<1.0,>=0.6 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): numpy<2.0,>=1.6.1 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): pip in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from pbr!=0.7,<1.0,>=0.6->lda)
Installing collected packages: lda
  Running setup.py develop for lda
    Complete output from command /home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing pbr to lda.egg-info/pbr.json
    writing lda.egg-info/PKG-INFO
    writing dependency_links to lda.egg-info/dependency_links.txt
    writing top-level names to lda.egg-info/top_level.txt
    writing requirements to lda.egg-info/requires.txt
    [pbr] Processing SOURCES.txt
    warning: LocalManifestMaker: standard file '-c' not found

    [pbr] In git context, generating filelist from git
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    reading manifest template 'MANIFEST.in'
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitignore'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    writing manifest file 'lda.egg-info/SOURCES.txt'
    running build_ext
    building 'lda._lda' extension
    gcc -pthread -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/paulo/.pyenv/versions/3.4.3/include/python3.4m -c lda/_lda.c -o build/temp.linux-x86_64-3.4/lda/_lda.o
    gcc: error: lda/_lda.c: No such file or directory
    gcc: fatal error: no input files
    compilation terminated.
    error: command 'gcc' failed with exit status 4

    ----------------------------------------
Command "/home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/paulo/.virtualenvs/my_env/src/lda

I also tried to follow the [http://pythonhosted.org/lda/installation.html#linux](installation instructions for Linux) and copy the contents:

cp -R /usr/lib/python3/dist-packages/lda* ~/.virtualenvs/my_env/lib/python3.4/site-packages/

but this error happens. I'm stuck now.

I'm using Ubuntu 14.04.3, Python 3.4.3 (from yyuu/pyenv) and virtualenv.

Thanks!

Remove dependency on scipy

use John Cook's lgamma
Work around the need for scipy.sparse in utils

Add coverage to testing

e.g., coveralls

Impossible to monitor convergence of gibbs sampling in IPython notebook

logging messages do not show up in IPython notebook.

At minimum, a brief "Monitoring the sampler" section of the documentation could describe the problem.

Alternatively, provide a traceplot of the complete loglikelihoods? à la Stan or PyMC?

Expand documentation more

The rust docs have a nice model (see http://doc.rust-lang.org/index.html#guides)

e.g., do a quick start and then a more detailed example

Rewrite lda.utils in Cython

Converting from ldac to dtm is rather slow.

LDA without a known number of clusters

Do you have any advice for using LDA when the number of clusters is unknown a priori?

Use unordered_map to cache log gamma calculations?

Cython wraps the C++ std:;unordered_map -- perhaps this could be used to speed up gammaln calls? I believe hca uses this strategy at certain points.

Sample hyperparameters using ARMS

It appears there is a nice, GPL-2 ARMS sampler and slice sampler in C: http://cran.r-project.org/web/packages/SamplerCompare/index.html

written by Madeleine Thompson

pipy download statistics badge in readme broken

Add speed comparison with MALLET

Just an informal note in README. For example, "lda is 0.5 as fast as MALLET"

Support scipy sparse matrices

This library has been hugely useful to us. Would it be possible to work with scipy's sparse matrices? It doesn't look like scipy and numpy play well together with their matrics, for example numpy.sum doesn't work with scipy sparse matrices.

Any ideas?

Thanks again.

Document-completion perplexity

Adding a function that would estimate document-completion perplexity would be a nice feature. It has been requested on the mailing list.

Perhaps something like model.score(n_iter=N, burnin=M, method='completion') would be roughly consistent with the scikit-learn api?

Test against onlinevb

MCMC should yield similar or lower perplexity (train and test)

Release 1.0 with next feature addition

Following semantic versioning guidelines at http://semver.org/ (lda's API is stable)

Train a old model with additional data

How do I train a old model with additional data rather than train a new model with all data?

Diagnose problem with pbr and setup hook

Something went wrong with the sdist pre hook. I removed it from the 1.0 release but it should be there as a useful check during the release process.

How to run from modified source?

Thanks for an excellent LDA implementation in Cython.

I'm developing a model that extends the LDA-model and thus I need to modify the source code in this package and then run it straight from my locally modified source code. However, when trying to import the package lda from the source directory, I run into the same problem as #25 "ImportError: No module named _lda" (I'm also using Ubuntu 14.04).

What would the best approach be to run a locally modified version of this package?

iterated pseudo-counts in transform_single

The algorithm of inferencing the topics on new unseen docs is mentioned in transform method, i.e. iterated pseudo-counts (I missed this NOTE at first).

AFAIK, however the inference on new doc is typically implemented using gibbs sampling too, could you please give me some brief hints on how these two methods differ from each other in terms of accuracy and efficiency?

PEP8 compliance

There are many PEP8 errors that should be fixed.

Debian/Ubuntu ppa build broken

I: pybuild base:170: python2.7 setup.py clean
error in setup command: Error parsing /tmp/buildd/lda-1.0.2/setup.cfg: OSError: [Errno 2] No such file or directory
E: pybuild pybuild:256: clean: plugin distutils failed with: exit code=1: python2.7 setup.py clean
dh_auto_clean: pybuild --clean -i python{version} -p 2.7 --dir . returned exit code 13
debian/rules:10: recipe for target 'clean' failed

some sort of pbr interaction

Installation fails (using pip and cloning the repo)

After running pip install lda I go to the python interpreter and I got

>>> import lda
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lda/__init__.py", line 7, in <module>
    from lda.lda import LDA  # noqa
  File "lda/lda.py", line 10, in <module>
    import lda._lda
ImportError: No module named _lda

Alternatively, I cloned the repo and ran python setup.py build and i got

error: can't copy 'lda/_lda.c': doesn't exist or not a regular file

Automate uploading of wheels using wheelhouse-uploader

See the bug ogrisel/wheelhouse-uploader/issues/1

Release OS X wheels

Add hyperparameter optimization

This might stabilize label-switching and is easier than implementing truncated HDP.

Save and reload a fitted model

I need save the fitted model so that i could reload the model to predict with new data next time.

Progress bar for IPython

See stan-dev/pystan#153

Implement the `transform` method

It should probably take an n_iter as an argument.

Add simple partial_fit, update error messages

Make the API slightly more consistent with scikit-learn, see comments on scikit-learn lda branch

"ImportError: No module named _lda" on linux

Hi,

I apologise in advance if this is not an issue with the software but our own server.

I installed and used lda on my local Mac (with anaconda) without problems, both with pip install and the install from source via make + python setup.py install. However, neither appears to work on our linux server (with anaconda). The lda library appears to be installed, i.e.

In [1]: import l
lda            libfuturize    linecache      llpython       llvm_array     llvmpy         logging        
lib2to3        libpasteurize  linuxaudiodev  llvm           llvm_cbuilder  locale         lxml

but when I actually load it, I get

In [1]: import lda
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-4b4ad2b47765> in <module>()
----> 1 import lda

/homes/matthiasm/code/lda/lda/__init__.py in <module>()
      5 import pbr.version
      6 
----> 7 from lda.lda import LDA  # noqa
      8 
      9 __version__ = pbr.version.VersionInfo('lda').version_string()

/homes/matthiasm/code/lda/lda/lda.py in <module>()
      8 import numpy as np
      9 
---> 10 import lda._lda
     11 import lda.utils
     12 

ImportError: No module named _lda

Any hints to how I might resolve this?

Cheers,
Matthias

Generate C source code via Cython à la pandas, scikit-learn

python setup.py sdist should transparently generate c code if it's not there. This means learning more about pbr.

Use gamma-less calculation of log likelihood

To double-check, the derivation is in Cowans' thesis.

not support sparse vectors?

Can I represent my corpus as sparse vectors? like in gensim

Must I fit the LDA with the input X of shape (n_samples, n_features) ?

Since it's inefficient and unnecessary to do that.

Switch to sphinx napoleon

supports the numpy docstring format, part of new sphinx

Release Windows wheels

Persisting a model

Can we persist an LDA model for later re-use? What's the simplest way of doing that?

Rewrite README and Getting Started example to use Reuters news as example

The reuters sample corpus is included in the test data, so there's nothing to download.

Fix documentation

running python setup.py build_sphinx should generate no errors.
README.rst needs to validate.

logger configuration overrides project's logger

in my main code I configure:

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

But I get lda's logger configuration:

INFO:lda:<450> log likelihood: -111, per-word: -3.4701
INFO:lda:<460> log likelihood: -111, per-word: -3.4701
INFO:lda:<470> log likelihood: -111, per-word: -3.4701
INFO:lda:<480> log likelihood: -111, per-word: -3.4701

If in lda's init file I mark out line 11:

...
__version__ = pbr.version.VersionInfo('lda').version_string()

# logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())

I get the correct logger config:

2015-02-23 16:05:37,286 : INFO : <470> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,288 : INFO : <480> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,290 : INFO : <490> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,291 : INFO : <500> log likelihood: -111, per-word: -3.4701

I took this from gensim's init file.

Build wheels for Python 3.3 on Windows

For those interested in compiling by hand on Windows, there are instructions on Stack Overflow: https://stackoverflow.com/a/2838827/121704

assign or increase higher probability for a doc belonging to a topic?

Is there any way or principle to achieve this goal?

Take an example for simplicity, docA have two prob.dists, such as P(topic=1)=0.21, P(topic=2)=0.22.
How to produce a higher P for docA in topic1 or topic2 or assign one?

lda transform and scipy sparse matrices

Hi,

I stumbled across an issue when pipelining the scikit count vectorizer with the lda model. In the transform method of lda the line X = np.atleast_2d(X) can be problematic due to the scipy sparse matrix returned from CountVectorizer. This issue indicates that not only does the dimension conversion not work with the scipy sparse matrix, but also that there doesn't seem to be an appetite to handle it.

My work around was to wrap the pipeline object, override transform and convert the word count sparse matrix to a dense 2D before it gets to lda's transform.

def transform(self, X, **kwargs):
    x_t = self.count_vectorizer.transform(X)
    return self.lda.transform(np.asarray(x_t.todense()), **kwargs)

If there's a better way of navigating this issue, please let me know. Otherwise, it would be great to have a check put into the transform method that avoid the use of X = np.atleast_2d(X) in the case of scipy sparse matrix.

Thanks

Rob

bnpy
pgmult