Git Product home page Git Product logo

infvoclda's Introduction

InfVocLDA

InfVocLDA is a Latent Dirichlet Allocation topic modeling package based on Variational Bayesian learning approach under online settings, developed by the Cloud Computing Research Team in [University of Maryland, College Park] (http://www.umd.edu). You may find more details about this project on our papaer [Online Latent Dirichlet Allocation with Infinite Vocabulary] (http://kzhai.github.io/paper/2013_icml.pdf) appeared in ICML 2013.

Please download the latest version from our GitHub repository.

Please send any bugs of problems to Ke Zhai ([email protected]).

Install and Build

This package depends on many external python libraries, such as numpy, scipy and nltk. After downloading the source code packages, unzip the datasets to the 'input' directory. The package includes a few fundamental datasets --- ap, de-news and 20-newsgroup datasets.

Launch and Execute

First, redirect to the source code directory

cd InfVocLDA/src

To launch the online LDA with pre-defined vocabulary, run the following command

python -m fixvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=20-news --number_of_topics=10 --number_of_documents=18600 --batch_size=100

To launch the online LDA with dynamic vocabulary, run the following command

python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000

Under any cirsumstances, you may also get help information and usage hints by running the following command

python -m fixvoc.launch --help
python -m infvoc.launch --help

infvoclda's People

Contributors

kzhai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infvoclda's Issues

how to evaluate whather output is good or bad?

I got output after running the programme. Output of in file have same word in each topic. I should should change to time by time. while i run it on news dataset. I want to share my output here.
image
I am also sharing a gamma distribution here.
hyper_parameter :
input_directory=../../input/de-news
corpus_name=de-news
dictionary_file=None
number_of_documents=9800
number_of_topics=10
truncation_level=1000
vocab_prune_interval=2
snapshot_interval=10
batch_size=100
online_iterations=98
tau=64.0
kappa=0.75
alpha_theta=0.1
alpha_beta=1000.0

I want to know how to evaluate this output. and why word is not changing over time here?

batch_size, e_step function

Hello, I would really like to add your lda with an infinite vocabulary feature to the Creme library (online machine learning library) https://github.com/creme-ml/creme

I have a doubt about the e_step function, you initiate the batch_size variable from the length of the wordids variable however a few lines later, your comment suggests that batch_size is an integer that refers to the number of documents.

Here len(wordids) = number of words in the document if you set batch_size to 1.

Did you voluntarily initialize the batch_size variable from len(wordids)? Should
batch_size = self._batch_size?

def e_step(self, wordids, directory=None):
batch_size = len(wordids);

Your comment:

# Now, for each document document_index update that document's phi_d for every words
for document_index in xrange(batch_size):

Thank you in advance for your feedback to confirm that batch_size = len(wordids)

Raphaël

AttributeError: 'FreqDist' object has no attribute 'inc'

ubgpu@ubgpu:/github/InfVocLDA/src$ python -m fixvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=20-news --number_of_topics=10 --number_of_documents=18600 --batch_size=100
successfully load all training documents...
successfully load all the words from ../input/20-news/voc.dat...
========== ========== ========== ========== ==========
output_directory=../output/20-news/15Jun17-223315-fixvoc-D18600-K10-I10-B100-O186-t64-k0.6-at0.1-ae1.22546e-05-False-False/
input_directory=../input/20-news
corpus_name=20-news
dictionary_file=../input/20-news/voc.dat
number_of_documents=18600
number_of_topics=10
snapshot_interval=10
batch_size=100
online_iterations=186
tau=64.0
kappa=0.6
alpha_theta=0.1
alpha_eta=1.22546016029e-05
hybrid_mode=False
hash_oov_words=False
========== ========== ========== ========== ==========
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ubgpu/github/InfVocLDA/src/fixvoc/launch.py", line 222, in
main()
File "/home/ubgpu/github/InfVocLDA/src/fixvoc/launch.py", line 189, in main
olda.export_beta(os.path.join(output_directory, 'exp_beta-0'), 50);
File "fixvoc/inferencer.py", line 89, in export_beta
freqdist.inc(word, self._exp_E_log_beta[k, self._vocab[word]]);
AttributeError: 'FreqDist' object has no attribute 'inc'
ubgpu@ubgpu:
/github/InfVocLDA/src$

launch.py: error: no such option: --desired_truncation_level

ubgpu@ubgpu:~/github/InfVocLDA/src$ python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --desired_truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
Usage: launch.py [options]

launch.py: error: no such option: --desired_truncation_level
ubgpu@ubgpu:~/github/InfVocLDA/src$

What is word_trace here?

I am a little bit confused here about word_trace matrix here. Can anyone explain to me what is the role of word_trace matrix here?

launch don't work and index issues

Hi,

I running the instructions on windows 10, python 3.6. After editing the "print" command inside launch (because we know that there is a change on it), I keep get the following errors after launch the command:

Traceback (most recent call last):
File "C:\Users\Jorge Castillo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\Jorge Castillo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 153, in _get_module_details
code = loader.get_code(mod_name)
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
File "C:\Users\Jorge Castillo(...)\InfVocLDA-master\src\infvoc\launch.py", line 232
gamma_path = os.path.join(output_directory, "gamma.txt")
^
SyntaxError: invalid syntax

I'm launching the command (following the instruction) on cmd windows, in the source established (i.e.,
in InfVocLDA-master\src) and getting the error and can't installing. Also, changing "python" with "py " don't fit the error. What I'm doing wrong? Thanks for help.

launch issues

Hi,
What are the recommended python and nltk versions to run this project?

At first, python complained about missing GoodTuringProbDist. I provided SimpleGoodTuringProbDist.

Now getting:

python launch.py --input_directory=../../input/ --output_directory=../../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
========== ========== ========== ========== ==========
output_directory=../../output/de-news/15Sep18-223048-D9800-K10-T4000-P10-I10-B100-O98-t64-k0.75-at0.1-ab1000/
input_directory=../../input/de-news
corpus_name=de-news
dictionary_file=None
number_of_documents=9800
number_of_topics=10
truncation_level=4000
vocab_prune_interval=10
snapshot_interval=10
batch_size=100
online_iterations=98
tau=64.0
kappa=0.75
alpha_theta=0.1
alpha_beta=1000.0
========== ========== ========== ========== ==========
successfully load all training documents...
Traceback (most recent call last):
  File "launch.py", line 248, in <module>
    main()
  File "launch.py", line 189, in main
    import hybrid;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/hybrid.py", line 13, in <module>
    import nchar;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/nchar.py", line 13, in <module>
    from nltk.util import ingrams
ImportError: cannot import name ingrams

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.