Git Product home page Git Product logo

lda's People

Contributors

ariddell avatar ghuls avatar katrinleinweber avatar luoshao23 avatar riddella avatar severinsimmler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lda's Issues

logger configuration overrides project's logger

in my main code I configure:

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

But I get lda's logger configuration:

INFO:lda:<450> log likelihood: -111, per-word: -3.4701
INFO:lda:<460> log likelihood: -111, per-word: -3.4701
INFO:lda:<470> log likelihood: -111, per-word: -3.4701
INFO:lda:<480> log likelihood: -111, per-word: -3.4701

If in lda's init file I mark out line 11:

...
__version__ = pbr.version.VersionInfo('lda').version_string()

# logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())

I get the correct logger config:

2015-02-23 16:05:37,286 : INFO : <470> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,288 : INFO : <480> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,290 : INFO : <490> log likelihood: -111, per-word: -3.4701
2015-02-23 16:05:37,291 : INFO : <500> log likelihood: -111, per-word: -3.4701

I took this from gensim's init file.

Scipy int64 / int32 error on OS X Python 3.3

This shows up only under

  • OS X, Clang, Python 3.3.5, Numpy 1.7.1

but not

  • OS X, Clang, Python 3.4.3, Numpy 1.7.1
======================================================================
ERROR: lda.tests.test_lda_transform.TestLDATransform.test_lda_transform_basic_sparse
----------------------------------------------------------------------
testtools.testresult.real._StringException: Traceback (most recent call last):

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/tests/test_lda_transform.py", line 85, in test_lda_transform_basic_sparse
    doc_topic_test = model.transform(dtm_test)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/lda.py", line 174, in transform
    WS, DS = lda.utils.matrix_to_lists(X)

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/lda/utils.py", line 44, in matrix_to_lists
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:

  File "/Users/travis/build/ariddell/lda-wheel-builder/venv/lib/python3.3/site-packages/scipy/sparse/compressed.py", line 586, in sum
    ret[major_index] = value

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

Debian/Ubuntu ppa build broken

I: pybuild base:170: python2.7 setup.py clean
error in setup command: Error parsing /tmp/buildd/lda-1.0.2/setup.cfg: OSError: [Errno 2] No such file or directory
E: pybuild pybuild:256: clean: plugin distutils failed with: exit code=1: python2.7 setup.py clean
dh_auto_clean: pybuild --clean -i python{version} -p 2.7 --dir . returned exit code 13
debian/rules:10: recipe for target 'clean' failed

some sort of pbr interaction

Fix documentation

  • running python setup.py build_sphinx should generate no errors.
  • README.rst needs to validate.

How to run from modified source?

Thanks for an excellent LDA implementation in Cython.

I'm developing a model that extends the LDA-model and thus I need to modify the source code in this package and then run it straight from my locally modified source code. However, when trying to import the package lda from the source directory, I run into the same problem as #25 "ImportError: No module named _lda" (I'm also using Ubuntu 14.04).

What would the best approach be to run a locally modified version of this package?

lda transform and scipy sparse matrices

Hi,

I stumbled across an issue when pipelining the scikit count vectorizer with the lda model. In the transform method of lda the line X = np.atleast_2d(X) can be problematic due to the scipy sparse matrix returned from CountVectorizer. This issue indicates that not only does the dimension conversion not work with the scipy sparse matrix, but also that there doesn't seem to be an appetite to handle it.

My work around was to wrap the pipeline object, override transform and convert the word count sparse matrix to a dense 2D before it gets to lda's transform.

def transform(self, X, **kwargs):
    x_t = self.count_vectorizer.transform(X)
    return self.lda.transform(np.asarray(x_t.todense()), **kwargs)

If there's a better way of navigating this issue, please let me know. Otherwise, it would be great to have a check put into the transform method that avoid the use of X = np.atleast_2d(X) in the case of scipy sparse matrix.

Thanks

Rob

not support sparse vectors?

Can I represent my corpus as sparse vectors? like in gensim

Must I fit the LDA with the input X of shape (n_samples, n_features) ?

Since it's inefficient and unnecessary to do that.

iterated pseudo-counts in transform_single

The algorithm of inferencing the topics on new unseen docs is mentioned in transform method, i.e. iterated pseudo-counts (I missed this NOTE at first).

AFAIK, however the inference on new doc is typically implemented using gibbs sampling too, could you please give me some brief hints on how these two methods differ from each other in terms of accuracy and efficiency?

Improve error reporting when user passes sparse float matrix

dtm below is a sparse matrix of floats. Error here is unhelpful (and is better for a numpy array).

In [34]: clf.fit(dtm)
INFO:lda:n_documents: 2740
INFO:lda:vocab_size: 50000
INFO:lda:n_words: 2739
INFO:lda:n_topics: 40
INFO:lda:n_iter: 2000
WARNING:lda:all zero column in document-term matrix found
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-ba91fde2c737> in <module>()
----> 1 clf.fit(dtm)

/home/ar/work/lda/lda-ariddell/lda/lda.py in fit(self, X, y)
    118             Returns the instance itself.
    119         """
--> 120         self._fit(X)
    121         return self
    122 

/home/ar/work/lda/lda-ariddell/lda/lda.py in _fit(self, X)
    213         random_state = lda.utils.check_random_state(self.random_state)
    214         rands = self._rands.copy()
--> 215         self._initialize(X)
    216         for it in range(self.n_iter):
    217             # FIXME: using numpy.roll with a random shift might be faster

/home/ar/work/lda/lda-ariddell/lda/lda.py in _initialize(self, X)
    257         np.testing.assert_equal(N, len(WS))
    258         for i in range(N):
--> 259             w, d = WS[i], DS[i]
    260             z_new = i % n_topics
    261             ZS[i] = z_new

IndexError: index 0 is out of bounds for axis 0 with size 0

Document-completion perplexity

Adding a function that would estimate document-completion perplexity would be a nice feature. It has been requested on the mailing list.

Perhaps something like model.score(n_iter=N, burnin=M, method='completion') would be roughly consistent with the scikit-learn api?

Is it a bug?

Maybe I say a error in file lda.py line 261:
N = int(X.sum())
N means the number of words, but here is the sum of all index. Maybe it is: sum the len of each element in X , sum([len(i) for i in X])?

Installation fails (using pip and cloning the repo)

After running pip install lda I go to the python interpreter and I got

>>> import lda
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lda/__init__.py", line 7, in <module>
    from lda.lda import LDA  # noqa
  File "lda/lda.py", line 10, in <module>
    import lda._lda
ImportError: No module named _lda

Alternatively, I cloned the repo and ran python setup.py build and i got

error: can't copy 'lda/_lda.c': doesn't exist or not a regular file

"ImportError: No module named _lda" on linux

Hi,

I apologise in advance if this is not an issue with the software but our own server.

I installed and used lda on my local Mac (with anaconda) without problems, both with pip install and the install from source via make + python setup.py install. However, neither appears to work on our linux server (with anaconda). The lda library appears to be installed, i.e.

In [1]: import l
lda            libfuturize    linecache      llpython       llvm_array     llvmpy         logging        
lib2to3        libpasteurize  linuxaudiodev  llvm           llvm_cbuilder  locale         lxml  

but when I actually load it, I get

In [1]: import lda
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-4b4ad2b47765> in <module>()
----> 1 import lda

/homes/matthiasm/code/lda/lda/__init__.py in <module>()
      5 import pbr.version
      6 
----> 7 from lda.lda import LDA  # noqa
      8 
      9 __version__ = pbr.version.VersionInfo('lda').version_string()

/homes/matthiasm/code/lda/lda/lda.py in <module>()
      8 import numpy as np
      9 
---> 10 import lda._lda
     11 import lda.utils
     12 

ImportError: No module named _lda

Any hints to how I might resolve this?

Cheers,
Matthias

Persisting a model

Can we persist an LDA model for later re-use? What's the simplest way of doing that?

Can't install on Ubuntu using pip install -e

Hi!

I'm trying to install the HEAD version using pip from repo but it gives an error:

(my_env) 00:37:31 ~/Documents/codes/my_env $ pip install -e git+https://github.com/ariddell/lda.git@master\#egg\=lda 
Obtaining lda from git+https://github.com/ariddell/lda.git@master#egg=lda
  Updating /home/paulo/.virtualenvs/my_env/src/lda clone (to master)
Requirement already satisfied (use --upgrade to upgrade): pbr!=0.7,<1.0,>=0.6 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): numpy<2.0,>=1.6.1 in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from lda)
Requirement already satisfied (use --upgrade to upgrade): pip in /home/paulo/.virtualenvs/my_env/lib/python3.4/site-packages (from pbr!=0.7,<1.0,>=0.6->lda)
Installing collected packages: lda
  Running setup.py develop for lda
    Complete output from command /home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing pbr to lda.egg-info/pbr.json
    writing lda.egg-info/PKG-INFO
    writing dependency_links to lda.egg-info/dependency_links.txt
    writing top-level names to lda.egg-info/top_level.txt
    writing requirements to lda.egg-info/requires.txt
    [pbr] Processing SOURCES.txt
    warning: LocalManifestMaker: standard file '-c' not found

    [pbr] In git context, generating filelist from git
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    reading manifest template 'MANIFEST.in'
    warning: no files found matching 'AUTHORS'
    warning: no files found matching 'ChangeLog'
    warning: no previously-included files found matching '.gitignore'
    warning: no previously-included files found matching '.gitreview'
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    writing manifest file 'lda.egg-info/SOURCES.txt'
    running build_ext
    building 'lda._lda' extension
    gcc -pthread -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/paulo/.pyenv/versions/3.4.3/include/python3.4m -c lda/_lda.c -o build/temp.linux-x86_64-3.4/lda/_lda.o
    gcc: error: lda/_lda.c: No such file or directory
    gcc: fatal error: no input files
    compilation terminated.
    error: command 'gcc' failed with exit status 4

    ----------------------------------------
Command "/home/paulo/.virtualenvs/my_env/bin/python3 -c "import setuptools, tokenize; __file__='/home/paulo/.virtualenvs/my_env/src/lda/setup.py'; exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/paulo/.virtualenvs/my_env/src/lda

I also tried to follow the [http://pythonhosted.org/lda/installation.html#linux](installation instructions for Linux) and copy the contents:

cp -R /usr/lib/python3/dist-packages/lda* ~/.virtualenvs/my_env/lib/python3.4/site-packages/

but this error happens. I'm stuck now.

I'm using Ubuntu 14.04.3, Python 3.4.3 (from yyuu/pyenv) and virtualenv.

Thanks!

Support scipy sparse matrices

This library has been hugely useful to us. Would it be possible to work with scipy's sparse matrices? It doesn't look like scipy and numpy play well together with their matrics, for example numpy.sum doesn't work with scipy sparse matrices.

Any ideas?

Thanks again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.