Git Product home page Git Product logo

lda's Introduction

lda: Topic modeling with latent Dirichlet allocation

pypi version github actions build status Zenodo citation

NOTE: This package is in maintenance mode. Critical bugs will be fixed. No new features will be added.

lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. lda is fast and is tested on Linux, OS X, and Windows.

You can read more about lda in the documentation.

Installation

pip install lda

Getting started

lda.LDA implements latent Dirichlet allocation (LDA). The interface follows conventions found in scikit-learn.

The following demonstrates how to inspect a model of a subset of the Reuters news dataset. The input below, X, is a document-term matrix (sparse matrices are accepted).

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set

The document-topic distributions are available in model.doc_topic_.

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)

Requirements

Python ≥3.10 and NumPy.

Caveat

lda aims for simplicity. (It happens to be fast, as essential parts are written in C via Cython.) If you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca and MALLET. hca is written entirely in C and MALLET is written in Java. Unlike lda, hca can use more than one processor at a time. Both MALLET and hca implement topic models known to be more robust than standard latent Dirichlet allocation.

Notes

Latent Dirichlet allocation is described in Blei et al. (2003) and Pritchard et al. (2000). Inference using collapsed Gibbs sampling is described in Griffiths and Steyvers (2004).

Important links

Other implementations

License

lda is licensed under Version 2.0 of the Mozilla Public License.

lda's People

Contributors

ariddell avatar b-trout avatar ghuls avatar katrinleinweber avatar luoshao23 avatar riddella avatar severinsimmler avatar tdhopper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lda's Issues

Scipy int64 / int32 error on OS X Python 3.3

This shows up only under

  • OS X, Clang, Python 3.3.5, Numpy 1.7.1

but not

  • OS X, Clang, Python 3.4.3, Numpy 1.7.1
======================================================================
ERROR: lda.tests.test_lda_transform.TestLDATransform.test_lda_transform_basic_sparse
----------------------------------------------------------------------
testtools.testresult.real._StringException: Traceback (most recent call last):

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/tests/test_lda_transform.py", line 85, in test_lda_transform_basic_sparse
    doc_topic_test = model.transform(dtm_test)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/lda.py", line 174, in transform
    WS, DS = lda.utils.matrix_to_lists(X)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/utils.py", line 44, in matrix_to_lists
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/scipy/sparse/compressed.py", line 586, in sum
    ret[major_index] = value

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

Is it a bug?

Maybe I say a error in file lda.py line 261:
N = int(X.sum())
N means the number of words, but here is the sum of all index. Maybe it is: sum the len of each element in X , sum([len(i) for i in X])?

Improve error reporting when user passes sparse float matrix

dtm below is a sparse matrix of floats. Error here is unhelpful (and is better for a numpy array).

In [34]: clf.fit(dtm)
INFO:lda:n_documents: 2740
INFO:lda:vocab_size: 50000
INFO:lda:n_words: 2739
INFO:lda:n_topics: 40
INFO:lda:n_iter: 2000
WARNING:lda:all zero column in document-term matrix found
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-ba91fde2c737> in <module>()
----> 1 clf.fit(dtm)

/home/ar/work/lda/lda-ariddell/lda/lda.py in fit(self, X, y)
    118             Returns the instance itself.
    119         """
--> 120         self._fit(X)
    121         return self
    122 

/home/ar/work/lda/lda-ariddell/lda/lda.py in _fit(self, X)
    213         random_state = lda.utils.check_random_state(self.random_state)
    214         rands = self._rands.copy()
--> 215         self._initialize(X)
    216         for it in range(self.n_iter):
    217             # FIXME: using numpy.roll with a random shift might be faster

/home/ar/work/lda/lda-ariddell/lda/lda.py in _initialize(self, X)
    257         np.testing.assert_equal(N, len(WS))
    258         for i in range(N):
--> 259             w, d = WS[i], DS[i]
    260             z_new = i % n_topics
    261             ZS[i] = z_new

IndexError: index 0 is out of bounds for axis 0 with size 0

Can't install on Ubuntu using pip install -e

Hi!

I'm trying to install the HEAD version using pip from repo but it gives an error:

(my_env) 00:37:31 ~/Documents/codes/my_env $ pip install -e git+https://github.com/ariddell/lda.git@master\#egg\=lda 
Obtaining lda from git+https://github.com/ariddell/lda.git@master#egg=lda
  Updating /home/paulo/.virtualenvs/my_env/src/lda clone (to master)
Requirement already satisfied (use --upgrade to upgrade): pbr!=0.7,<1.0,>=0.6 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): numpy<2.0,>=1.6.1 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): pip in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from pbr!=0.7,<1.0,>=0.6->lda)
Installing collected packages: lda
  Running setup.py develop for lda
    Complete output from command /home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing pbr to lda.egg-info/pbr.json
    writing lda.egg-info/PKG-INFO
    writing dependency_links to lda.egg-info/dependency_links.txt
    writing top-level names to lda.egg-info/top_level.txt
    writing requirements to lda.egg-info/requires.txt
    [pbr] Processing SOURCES.txt
    warning: LocalManifestMaker: standard file '-c' not found

    [pbr] In git context, generating filelist from git
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    reading manifest template 'MANIFEST.in'
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitignore'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    writing manifest file 'lda.egg-info/SOURCES.txt'
    running build_ext
    building 'lda._lda' extension
    gcc -pthread -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/paulo/.pyenv/versions/3.4.3/include/python3.4m -c lda/_lda.c -o build/temp.linux-x86_64-3.4/lda/_lda.o
    gcc: error: lda/_lda.c: No such file or directory
    gcc: fatal error: no input files
    compilation terminated.
    error: command 'gcc' failed with exit status 4

    ----------------------------------------
Command "/home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/paulo/.virtualenvs/my_env/src/lda

I also tried to follow the [http://pythonhosted.org/lda/installation.html#linux](installation instructions for Linux) and copy the contents:

cp -R /usr/lib/python3/dist-packages/lda* ~/.virtualenvs/my_env/lib/python3.4/site-packages/

but this error happens. I'm stuck now.

I'm using Ubuntu 14.04.3, Python 3.4.3 (from yyuu/pyenv) and virtualenv.

Thanks!

Support scipy sparse matrices

This library has been hugely useful to us. Would it be possible to work with scipy's sparse matrices? It doesn't look like scipy and numpy play well together with their matrics, for example numpy.sum doesn't work with scipy sparse matrices.

Any ideas?

Thanks again.

Document-completion perplexity

Adding a function that would estimate document-completion perplexity would be a nice feature. It has been requested on the mailing list.

Perhaps something like model.score(n_iter=N, burnin=M, method='completion') would be roughly consistent with the scikit-learn api?

How to run from modified source?

Thanks for an excellent LDA implementation in Cython.

I'm developing a model that extends the LDA-model and thus I need to modify the source code in this package and then run it straight from my locally modified source code. However, when trying to import the package lda from the source directory, I run into the same problem as #25 "ImportError: No module named _lda" (I'm also using Ubuntu 14.04).

What would the best approach be to run a locally modified version of this package?

iterated pseudo-counts in transform_single

The algorithm of inferencing the topics on new unseen docs is mentioned in transform method, i.e. iterated pseudo-counts (I missed this NOTE at first).

AFAIK, however the inference on new doc is typically implemented using gibbs sampling too, could you please give me some brief hints on how these two methods differ from each other in terms of accuracy and efficiency?

Debian/Ubuntu ppa build broken

I: pybuild base:170: python2.7 setup.py clean
error in setup command: Error parsing /tmp/buildd/lda-1.0.2/setup.cfg: OSError: [Errno 2] No such file or directory
E: pybuild pybuild:256: clean: plugin distutils failed with: exit code=1: python2.7 setup.py clean
dh_auto_clean: pybuild --clean -i python{version} -p 2.7 --dir . returned exit code 13
debian/rules:10: recipe for target 'clean' failed

some sort of pbr interaction

Installation fails (using pip and cloning the repo)

After running pip install lda I go to the python interpreter and I got

>>> import lda
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lda/__init__.py", line 7, in <module>
    from lda.lda import LDA  # noqa
  File "lda/lda.py", line 10, in <module>
    import lda._lda
ImportError: No module named _lda

Alternatively, I cloned the repo and ran python setup.py build and i got

error: can't copy 'lda/_lda.c': doesn't exist or not a regular file

"ImportError: No module named _lda" on linux

Hi,

I apologise in advance if this is not an issue with the software but our own server.

I installed and used lda on my local Mac (with anaconda) without problems, both with pip install and the install from source via make + python setup.py install. However, neither appears to work on our linux server (with anaconda). The lda library appears to be installed, i.e.

In [1]: import l
lda            libfuturize    linecache      llpython       llvm_array     llvmpy         logging        
lib2to3        libpasteurize  linuxaudiodev  llvm           llvm_cbuilder  locale         lxml  

but when I actually load it, I get

In [1]: import lda
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-4b4ad2b47765> in <module>()
----> 1 import lda

/homes/matthiasm/code/lda/lda/__init__.py in <module>()
      5 import pbr.version
      6 
----> 7 from lda.lda import LDA  # noqa
      8 
      9 __version__ = pbr.version.VersionInfo('lda').version_string()

/homes/matthiasm/code/lda/lda/lda.py in <module>()
      8 import numpy as np
      9 
---> 10 import lda._lda
     11 import lda.utils
     12 

ImportError: No module named _lda

Any hints to how I might resolve this?

Cheers,
Matthias

not support sparse vectors?

Can I represent my corpus as sparse vectors? like in gensim

Must I fit the LDA with the input X of shape (n_samples, n_features) ?

Since it's inefficient and unnecessary to do that.

Persisting a model

Can we persist an LDA model for later re-use? What's the simplest way of doing that?

Fix documentation

  • running python setup.py build_sphinx should generate no errors.
  • README.rst needs to validate.

logger configuration overrides project's logger

in my main code I configure:

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

But I get lda's logger configuration:

INFO:lda:<450> log likelihood: -111, per-word: -3.4701
INFO:lda:<460> log likelihood: -111, per-word: -3.4701
INFO:lda:<470> log likelihood: -111, per-word: -3.4701
INFO:lda:<480> log likelihood: -111, per-word: -3.4701

If in lda's init file I mark out line 11:

...
__version__ = pbr.version.VersionInfo('lda').version_string()

# logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())

I get the correct logger config:

2015-02-23 16:05:37,286 : INFO : <470> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,288 : INFO : <480> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,290 : INFO : <490> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,291 : INFO : <500> log likelihood: -111, per-word: -3.4701

I took this from gensim's init file.

lda transform and scipy sparse matrices

Hi,

I stumbled across an issue when pipelining the scikit count vectorizer with the lda model. In the transform method of lda the line X = np.atleast_2d(X) can be problematic due to the scipy sparse matrix returned from CountVectorizer. This issue indicates that not only does the dimension conversion not work with the scipy sparse matrix, but also that there doesn't seem to be an appetite to handle it.

My work around was to wrap the pipeline object, override transform and convert the word count sparse matrix to a dense 2D before it gets to lda's transform.

def transform(self, X, **kwargs):
    x_t = self.count_vectorizer.transform(X)
    return self.lda.transform(np.asarray(x_t.todense()), **kwargs)

If there's a better way of navigating this issue, please let me know. Otherwise, it would be great to have a check put into the transform method that avoid the use of X = np.atleast_2d(X) in the case of scipy sparse matrix.

Thanks

Rob

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.