kzhai / infvoclda Goto Github PK

Online Latent Dirichlet Allocation with Infinite Vocabulary using Variational Inference

Home Page: https://github.com/kzhai/InfVocLDA

License: Apache License 2.0

Python 100.00%

infvoclda's Introduction

InfVocLDA

InfVocLDA is a Latent Dirichlet Allocation topic modeling package based on Variational Bayesian learning approach under online settings, developed by the Cloud Computing Research Team in [University of Maryland, College Park] (http://www.umd.edu). You may find more details about this project on our papaer [Online Latent Dirichlet Allocation with Infinite Vocabulary] (http://kzhai.github.io/paper/2013_icml.pdf) appeared in ICML 2013.

Please download the latest version from our GitHub repository.

Please send any bugs of problems to Ke Zhai ([email protected]).

Install and Build

This package depends on many external python libraries, such as numpy, scipy and nltk. After downloading the source code packages, unzip the datasets to the 'input' directory. The package includes a few fundamental datasets --- ap, de-news and 20-newsgroup datasets.

Launch and Execute

First, redirect to the source code directory

cd InfVocLDA/src

To launch the online LDA with pre-defined vocabulary, run the following command

python -m fixvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=20-news --number_of_topics=10 --number_of_documents=18600 --batch_size=100

To launch the online LDA with dynamic vocabulary, run the following command

python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000

Under any cirsumstances, you may also get help information and usage hints by running the following command

python -m fixvoc.launch --help
python -m infvoc.launch --help

infvoclda's People

Contributors

Stargazers

Watchers

Forkers

li-ximing karinabunyik laurencecao mrshanth zhujiahui jasonfarrell mrgloom edwardt davyfeng nersle gopinutakki luffyhwl qiuzhangcheng juyaya guitar-monkey burakakrishna gauravkoradiya gipster

infvoclda's Issues

how to evaluate whather output is good or bad?

I got output after running the programme. Output of in file have same word in each topic. I should should change to time by time. while i run it on news dataset. I want to share my output here.

I am also sharing a gamma distribution here.
hyper_parameter :
input_directory=../../input/de-news
corpus_name=de-news
dictionary_file=None
number_of_documents=9800
number_of_topics=10
truncation_level=1000
vocab_prune_interval=2
snapshot_interval=10
batch_size=100
online_iterations=98
tau=64.0
kappa=0.75
alpha_theta=0.1
alpha_beta=1000.0

I want to know how to evaluate this output. and why word is not changing over time here?

batch_size, e_step function

Hello, I would really like to add your lda with an infinite vocabulary feature to the Creme library (online machine learning library) https://github.com/creme-ml/creme

I have a doubt about the e_step function, you initiate the batch_size variable from the length of the wordids variable however a few lines later, your comment suggests that batch_size is an integer that refers to the number of documents.

Here len(wordids) = number of words in the document if you set batch_size to 1.

Did you voluntarily initialize the batch_size variable from len(wordids)? Should
batch_size = self._batch_size?

InfVocLDA/src/infvoc/hybrid.py

Lines 279 to 280 in 05a8789

 def e_step(self, wordids, directory=None): 

 batch_size = len(wordids);

Your comment:

InfVocLDA/src/infvoc/hybrid.py

Lines 292 to 293 in 05a8789

 # Now, for each document document_index update that document's phi_d for every words 

 for document_index in xrange(batch_size):

Thank you in advance for your feedback to confirm that batch_size = len(wordids)

Raphaël

AttributeError: 'FreqDist' object has no attribute 'inc'

ubgpu@ubgpu:/github/InfVocLDA/src$ python -m fixvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=20-news --number_of_topics=10 --number_of_documents=18600 --batch_size=100
successfully load all training documents...
successfully load all the words from ../input/20-news/voc.dat...
========== ========== ========== ========== ==========
output_directory=../output/20-news/15Jun17-223315-fixvoc-D18600-K10-I10-B100-O186-t64-k0.6-at0.1-ae1.22546e-05-False-False/
input_directory=../input/20-news
corpus_name=20-news
dictionary_file=../input/20-news/voc.dat
number_of_documents=18600
number_of_topics=10
snapshot_interval=10
batch_size=100
online_iterations=186
tau=64.0
kappa=0.6
alpha_theta=0.1
alpha_eta=1.22546016029e-05
hybrid_mode=False
hash_oov_words=False
========== ========== ========== ========== ==========
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ubgpu/github/InfVocLDA/src/fixvoc/launch.py", line 222, in
main()
File "/home/ubgpu/github/InfVocLDA/src/fixvoc/launch.py", line 189, in main
olda.export_beta(os.path.join(output_directory, 'exp_beta-0'), 50);
File "fixvoc/inferencer.py", line 89, in export_beta
freqdist.inc(word, self._exp_E_log_beta[k, self._vocab[word]]);
AttributeError: 'FreqDist' object has no attribute 'inc'
ubgpu@ubgpu:/github/InfVocLDA/src$

launch.py: error: no such option: --desired_truncation_level

ubgpu@ubgpu:~/github/InfVocLDA/src$ python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --desired_truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
Usage: launch.py [options]

launch.py: error: no such option: --desired_truncation_level
ubgpu@ubgpu:~/github/InfVocLDA/src$

What is word_trace here?

I am a little bit confused here about word_trace matrix here. Can anyone explain to me what is the role of word_trace matrix here?

launch don't work and index issues

Hi,

I running the instructions on windows 10, python 3.6. After editing the "print" command inside launch (because we know that there is a change on it), I keep get the following errors after launch the command:

Traceback (most recent call last):
File "C:\Users\Jorge Castillo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\Jorge Castillo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 153, in _get_module_details
code = loader.get_code(mod_name)
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
File "C:\Users\Jorge Castillo(...)\InfVocLDA-master\src\infvoc\launch.py", line 232
gamma_path = os.path.join(output_directory, "gamma.txt")
^
SyntaxError: invalid syntax

I'm launching the command (following the instruction) on cmd windows, in the source established (i.e.,
in InfVocLDA-master\src) and getting the error and can't installing. Also, changing "python" with "py " don't fit the error. What I'm doing wrong? Thanks for help.

launch issues

Hi,
What are the recommended python and nltk versions to run this project?

At first, python complained about missing GoodTuringProbDist. I provided SimpleGoodTuringProbDist.

Now getting:

python launch.py --input_directory=../../input/ --output_directory=../../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
========== ========== ========== ========== ==========
output_directory=../../output/de-news/15Sep18-223048-D9800-K10-T4000-P10-I10-B100-O98-t64-k0.75-at0.1-ab1000/
input_directory=../../input/de-news
corpus_name=de-news
dictionary_file=None
number_of_documents=9800
number_of_topics=10
truncation_level=4000
vocab_prune_interval=10
snapshot_interval=10
batch_size=100
online_iterations=98
tau=64.0
kappa=0.75
alpha_theta=0.1
alpha_beta=1000.0
========== ========== ========== ========== ==========
successfully load all training documents...
Traceback (most recent call last):
  File "launch.py", line 248, in <module>
    main()
  File "launch.py", line 189, in main
    import hybrid;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/hybrid.py", line 13, in <module>
    import nchar;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/nchar.py", line 13, in <module>
    from nltk.util import ingrams
ImportError: cannot import name ingrams

	def e_step(self, wordids, directory=None):
	batch_size = len(wordids);

	# Now, for each document document_index update that document's phi_d for every words
	for document_index in xrange(batch_size):