datquocnguyen / lftm Goto Github PK

View Code? Open in Web Editor NEW

176.0 13.0 59.0 9.24 MB

Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)

License: Other

Java 100.00%

topic-modeling gibbs-sampling word-embeddings short-text

lftm's Introduction

LF-LDA and LF-DMM latent feature topic models

The implementations of the LF-LDA and LF-DMM latent feature topic models, as described in my TACL paper:

Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313, 2015. [.bib] [Datasets] [Example_20Newsgroups_20Topics_Top50Words]

The implementations of the LDA and DMM topic models are available at http://jldadmm.sourceforge.net/

Usage

This section describes the usage of the implementations in command line or terminal, using the pre-compiled LFTM.jar file.

Here, it is expected that Java 1.7+ is already set to run in command line or terminal (for example: adding Java to the path environment variable in Windows OS).

The pre-compiled LFTM.jar file and source codes are in the jar and src folders, respectively. Users can recompile the source codes by simply running ant (it is also expected that ant is already installed). In addition, the users can find input examples in the test folder.

File format of input topic-modeling corpus

Similar to the corpus.txt file in the test folder, each line in the input topic-modeling corpus represents a document. Here, a document is a sequence words/tokens separated by white space characters. The users should preprocess the input topic-modeling corpus before training the topic models, for example: down-casing, removing non-alphabetic characters and stop-words, removing words shorter than 3 characters and words appearing less than a certain times.

Format of input word-vector file

Similar to the wordVectors.txt file in the test folder, each line in the input word-vector file starts with a word type which is followed by a vector representation.

To obtain the vector representations of words, the users can use the pre-trained word vectors learned from large external corpora OR the word vectors which are trained on the input topic-modeling corpus.

In case of using the pre-trained word vectors learned from the large external corpora, the users have to remove words in the input topic-modeling corpus, in which these words are not found in the input word-vector file.

Some sets of the pre-trained word vectors can be found at:

Word2Vec: https://code.google.com/p/word2vec/

Glove: http://nlp.stanford.edu/projects/glove/

If the input topic-modeling corpus is too domain-specific, the domain of the external corpus (from which the word vectors are derived) should not be too different to that of the input topic-modeling corpus. For example, when applying to the biomedical domain, the users may use Word2Vec or Glove to learn 50 or 100-dimensional word vectors on the large external MEDLINE corpus instead of using the pre-trained Word2Vec or Glove word vectors.

Training LF-LDA and LF-DMM

$ java [-Xmx2G] -jar jar/LFTM.jar –model <LFLDA_or_LFDMM> -corpus <Input_corpus_file_path> -vectors <Input_vector_file_path> [-ntopics <int>] [-alpha <double>] [-beta <double>] [-lambda <double>] [-initers <int>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]

where hyper-parameters in [ ] are optional.

-model: Specify the topic model.
-corpus: Specify the path to the input training corpus file.
-vectors: Specify the path to the file containing word vectors.
-ntopics <int>: Specify the number of topics. The default value is 20.
-alpha <double>: Specify the hyper-parameter alpha. Following [1, 2], the default value is 0.1.
-beta <double>: Specify the hyper-parameter beta. The default value is 0.01. Following [2], you might also want to try beta value of 0.1 for short texts.
-lambda <double>: Specify the mixture weight lambda (0.0 < lambda <= 1.0). Set the mixture weight lambda to be 1.0 to obtain the best topic coherence. In case of document clustering/classification evaluation, fine-tune this parameter to obtain the highest results if you have time; otherwise try both values 0.6 and 1.0 (I would suggest to set lambda 0.6 for normal text corpora and 1.0 for short text corpora if you don't have time to try both 0.6 and 1.0).
-initers <int>: Specify the number of initial sampling iterations to separate the counts for the latent feature component and the Dirichlet multinomial component. The default value is 2000.
-niters <int>: Specify the number of sampling iterations for the latent feature topic models. The default value is 200.
-twords <int>: Specify the number of the most probable topical words. The default value is 20.
-name <String>: Specify a name to the topic modeling experiment. The default value is “model”.
-sstep <int>: Specify a step to save the sampling output (-sstep value < -niters value). The default value is 0 (i.e. only saving the output from the last sample).

NOTE that (topic vectors are learned in parallel, so) running LFTM code with multiple CPU/core machine to obtain a significantly faster training process, e.g. using a multi-core computer, or set number of CPUs requested for a remote job to be equal to number of topics.

Examples:

$ java -jar jar/LFTM.jar -model LFLDA -corpus test/corpus.txt -vectors test/wordVectors.txt -ntopics 4 -alpha 0.1 -beta 0.01 -lambda 0.6 -initers 500 -niters 50 -name testLFLDA

Basically, with this command we run 500 LDA sampling iterations (i.e., -initers 500) for initialization and then run 50 LF-LDA sampling iterations (i.e., -niters 50). The output files are saved in the same folder as the input training corpus file, in this case in the test folder. We have output files of testLFLDA.theta, testLFLDA.phi, testLFLDA.topWords, testLFLDA.topicAssignments and testLFLDA.paras, referring to the document-to-topic distributions, topic-to-word distributions, top topical words, topic assignments and model hyper-parameters, respectively. Similarly, we perform:

$ java -jar jar/LFTM.jar -model LFDMM -corpus test/corpus.txt -vectors test/wordVectors.txt -ntopics 4 -alpha 0.1 -beta 0.1 -lambda 1.0 -initers 500 -niters 50 -name testLFDMM

We have output files of testLFDMM.theta, testLFDMM.phi, testLFDMM.topWords, testLFDMM.topicAssignments and testLFDMM.paras.

In the LF-LDA and LF-DMM latent feature topic models, a word is generated by the latent feature topic-to-word component OR by the topic-to-word Dirichlet multinomial component. In practical implementation, instead of using a binary selection variable to record that, I simply add a value of the number of topics to the actual topic assignment value. For example with 20 topics, the output topic assignment 3 23 4 4 24 3 23 3 23 3 23 for a document means that the first word in the document is generated from topic 3 by the latent feature topic-to-word component. The second word is also generated from the topic 23 - 20 = 3, but by the topic-to-word Dirichlet multinomial component. It is similar for the remaining words in the document.

Document clustering evaluation

Here, we treat each topic as a cluster, and we assign every document the topic with the highest probability given the document. To get the clustering scores of Purity and normalized mutual information, we perform:

$ java –jar jar/LFTM.jar –model Eval –label <Golden_label_file_path> -dir <Directory_path> -prob <Document-topic-prob/Suffix>

–label: Specify the path to the ground truth label file. Each line in this label file contains the golden label of the corresponding document in the input training corpus. See the corpus.LABEL and corpus.txt files in the test folder.
-dir: Specify the path to the directory containing document-to-topic distribution files.
-prob: Specify a document-to-topic distribution file or a group of document-to-topic distribution files in the specified directory.

Examples:

The command $ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob testLFLDA.theta will produce the clustering score for the testLFLDA.theta file.

The command $ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob testLFDMM.theta will produce the clustering score for testLFDMM.theta file.

The command $ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob theta will produce the clustering scores for all the document-to-topic distribution files having names ended by theta. In this case, the distribution files are testLFLDA.theta and testLFDMM.theta. It also provides the mean and standard deviation of the clustering scores.

Inference of topic distribution on unseen corpus

To infer topics on an unseen/new corpus using a pre-trained LF-LDA/LF-DMM topic model, we perform:

$ java -jar jar/LFTM.jar -model <LFLDAinf_or_LFDMMinf> -paras <Hyperparameter_file_path> -corpus <Unseen_corpus_file_path> [-initers <int>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]

-paras: Specify the path to the hyper-parameter file produced by the pre-trained LF-LDA/LF-DMM topic model.

Examples:

$ java -jar jar/LFTM.jar -model LFLDAinf -paras test/testLFLDA.paras -corpus test/corpus_test.txt -initers 500 -niters 50 -name testLFLDAinf

$ java -jar jar/LFTM.jar -model LFDMMinf -paras test/testLFDMM.paras -corpus test/corpus_test.txt -initers 500 -niters 50 -name testLFDMMinf

Acknowledgments

The LF-LDA and LF-DMM implementations used utilities including the LBFGS implementation from MALLET toolkit, the random number generator in Java version of MersenneTwister, the Parallel.java utility from Mines Java Toolkit and the Java command line arguments parser. I would like to thank the authors of the mentioned utilities for sharing the codes.

References

[1] Yue Lu, Qiaozhu Mei, and ChengXiang Zhai. 2011. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval, 14:178–203.

[2] Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.

lftm's People

Contributors

Stargazers

Watchers

lftm's Issues

InvalidOptimizableException: zero slope ???

Hi Dat Quoc, thanks for your great work!

I have tried to perform topic modelling on my twitter corpus using LFDMM:
java -jar jar/LFTM.jar -model LFDMM -corpus fsd/preprocessed_tweets.txt -vectors embeddings/glove.twitter.word2vec.27B.100d.txt -ntopics 100 -name LFDMM -lambda 1.0 -twords 10

But I keep getting cc.mallet.optimize.InvalidOptimizableException: Slope = 0.0 is zero exceptions:

Initial sampling iteration: 2000
LFDMM sampling iteration: 1
Estimating topic vectors ...
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
LFDMM sampling iteration: 2
Estimating topic vectors ...
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
L-BFGS initial gradient is zero; saying converged
cc.mallet.optimize.InvalidOptimizableException: Slope = 0.0 is zero
at cc.mallet.optimize.BackTrackLineSearch.optimize(BackTrackLineSearch.java:112)
at utility.LBFGS.optimize(Unknown Source)
at models.LFDMM$1.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:719)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:719)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at utility.Parallel$LoopIntAction.compute(Unknown Source)
at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

Why? Do I need to use different settings for alpha and beta instead of the default ones? TFLDA seems to work fine.

Thanks.

ArrayIndexOutOfBounds Error When Training New Corpus

I face the following error when I am trying to train a new pre-processed corpus:
java.lang.ArrayIndexOutOfBoundsException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.util.concurrent.ForkJoinTask.getThrowableException(Unknown Source) at java.util.concurrent.ForkJoinTask.reportException(Unknown Source) at java.util.concurrent.ForkJoinTask.join(Unknown Source) at java.util.concurrent.ForkJoinPool.invoke(Unknown Source) at utility.Parallel.loop(Unknown Source) at utility.Parallel.loop(Unknown Source) at models.LFLDA.optimizeTopicVectors(Unknown Source) at models.LFLDA.inference(Unknown Source) at LFTM.main(Unknown Source) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at models.TopicVectorOptimizer.<init>(Unknown Source) at models.LFLDA$1.compute(Unknown Source) at utility.Parallel$LoopIntAction.compute(Unknown Source) at utility.Parallel$LoopIntAction.compute(Unknown Source) at utility.Parallel$LoopIntAction.compute(Unknown Source) at java.util.concurrent.RecursiveAction.exec(Unknown Source) at java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(Unknown Source) at java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

The dataset I used is attached in this post. Can someone help me with this? I read the previous issue and made sure there is no empty lines in my code.

Please help.

Thank you so much guys.

Best regards,
Kelvin
LFDLA-Challenges-Full.txt

word2id Vocabulary

Thank you for sharing your work.

Is there any way to get the word2id or id2word for the vocabulary? In jLDADMM, you had the script write out a .vocabulary file. However, there is no corresponding output for this project.

I need the word2id as I am using the topic-word distribution to do some custom keyword scoring. Without the right vocabulary order, I have no idea which word does each column in the matrix refer to.

How would I be able to get the word2id in this case? After a quick exploration, I know it is definitely not following the word2id for the word embeddings file.

EDIT:
I realised that you had a function writeDictionary() for writing the word2id to a file but it is not used in the write() function. I think it will be a good idea to include it.

The same goes for the writeTopicVectors() function. I believe users will benefit from having access to those. I have recompiled the jar file after making changes to include those and they are giving me expected outputs.

LFDMM prediction is vector of NaN for long document

Hi a Dat Quoc,

Many thank for your work.

I run LFDMM algorithm on my corpus that mixing short documents and long documents. Checking file LFDMM.theta I found that with the long document that contain more than 76 words the result for that document in file LFDMM.theta is list of NaN
"NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN"

Is there any threshold for the length of document for LFDMM algorithm? I look at the source code but still not figure out where it happen.

Hope to see your response.
Thank a Dat Quoc

Inconsistent Handling of blank lines in input corpuses

I ran into a (minor) issue using this tool to work with newspaper archive data.

In constructing my corpus, the process of removing words that did not have corresponding word-vectors resulted in empty lines in my input corpus.

The DMM model worked on the corpus without a problem, suggesting that there is a working mechanism in the code for handling this situation.

However, when I attempted to run DMMinf using the resulting model, I received a fatal error:

Error: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:657)
        at java.util.ArrayList.get(ArrayList.java:433)
        at models.LFDMM_Inf.sampleSingleInitialIteration(Unknown Source)
        at models.LFDMM_Inf.inference(Unknown Source)
        at LFTM.main(Unknown Source)

The obvious solution to my problem is to fix my corpus-producing code, and make sure I don't feed empty lines into the DMMinf model.

But I post the issue here in case a future user runs into the same issue, or in case you would like to fix a minor bug in your otherwise excellent tool.

Can LFTM run the Chinese corpus?

Would you like to share the Twitter dataset with us?

I notice that TACL-datasets.zip does not contain the Twitter dataset used in your paper and the link you offered for downloading is invalid(http://www.sananalytics.com/lab/index.php). It would be a great help if you could share the Twitter dataset with us. ^_^

Get probability a document to belong in topic

Suppose I have a training set of tweets as the test/corpus.txt. It's straight forward how to create the topic clusters.
Now, I have a test set (in one file) and I want to get the probability each tweet (line of the file) to belong in one of the topics clusters found in the first step.

Example: From the testLFLDA.topWords you have:

Topic0: iphone great siri ios time awesome amazing day loving yeah shows pretty store year love job million macbook phone mango

Topic1: android nexus cream ice sandwich ics samsung phone search good galaxy nice works iphone smart mango screen windows awesome beautiful

Topic2: facebook love free retweets users world application ios work blackberry technology today feel power mac show fucking impressive email working

Topic3: windows good people lol facebook bookcase haven back sleep agree social great man shit ipad text wow happy store cloud

If I have a tweet I enjoy using siri in my iphone, I would expect a result such as: [0.5, 0.1, 0,3, 0.1] where each value is for topic0, topic1, etc.

I don't have any gold labels and I don't need any labels. Is that possible? If yes, how?

How to evaluate topic coherence?

Hi, I'm a little confused about how to compute NPMI-score to evaluate topic coherence in your paper, is there any code to evaluate?

How about the time complexity of LFTM

I am interested in LFTM, and I find this model cost a lot time compared with other topic model. So I want to know the time complexity of this method.
Thanks.

How do I use pre-trained word vectors?

Hi,
first of all, thanks a lot for your implementation of this method! Unfortunately, I am quite new to Java, so I first tried to run your code on the test example, which worked.

Now I want to apply topic modelling to my own corpus of 4800 short texts. I have them in the .txt file format with 1 line per text. I named it corpus.txt.

I would like to apply a pre-trained vector from Word2Vec. The file I get from them is a .bin.gz file. However, in your example, you have a .txt file containing the word vectors.

How do I now change this to use the pre-trained vectors? I tried java -jar jar/LFTM.jar -model LFLDA -corpus test/corpus.txt -vectors test/knowledge-vectors-skipgram1000.bin -ntopics 4 -alpha 0.1 -beta 0.01 -lambda 0.6 -initers 500 -niters 50 -name testLFLDA but this gets me:

Reading topic modeling corpus: test/corpus.txt Reading word vectors from word-vectors file test/knowledge-vectors-skipgram1000.bin... java.lang.ArrayIndexOutOfBoundsException: 1 at models.LFLDA.readWordVectorsFile(Unknown Source) at models.LFLDA.<init>(Unknown Source) at LFTM.main(Unknown Source) The word "ordered" doesn't have a corresponding vector!!! Error: null java.lang.Exception at models.LFLDA.readWordVectorsFile(Unknown Source) at models.LFLDA.<init>(Unknown Source) at LFTM.main(Unknown Source)

The word "ordered" is the first word in my texts, so it does not seem to work at all. Same if I try the bin.gz file.

Could you help me out in that matter?
Thanks a lot!

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The size of corpus is about 250M. I am getting an error message. Any ideas?

Reading topic modeling corpus: test/corpus.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at models.LFLDA.(Unknown Source)
at LFTM.main(Unknown Source)

Dataset is not accessible

When I try to visit the web page: http://web.science.mq.edu.au/~dqnguyen/papers/TACL-datasets.zip

It returns:
Forbidden Error 403
Sorry, it doesn't look like you have access to this page. You may need to be on campus or using a VPN connection in order to access this page. Please contact us for more information.

How to solve this problem?

datquocnguyen / lftm Goto Github PK

lftm's Introduction

LF-LDA and LF-DMM latent feature topic models

Usage

File format of input topic-modeling corpus

Format of input word-vector file

Training LF-LDA and LF-DMM

Document clustering evaluation

Inference of topic distribution on unseen corpus

Acknowledgments

References

lftm's People

Contributors

Stargazers

Watchers

Forkers

lftm's Issues

Recommend Projects

Recommend Topics

Recommend Org