ucla-scai / source-lda Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 4.0 38.89 MB

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources (ICDE 2017)

License: GNU Affero General Public License v3.0

Makefile 0.23% C++ 88.36% Shell 11.41%

source-lda's People

Contributors

Stargazers

Watchers

Forkers

palash20-zz bohanli timothyhay as35396425

source-lda's Issues

Question about the Algorithm

My apologies for yet another question (actually, two), but I wanted to be able to apply the model to an unseen document, and so I was digging into the algorithm at the end of last week. As I understand it, one of the key differences between a mixed model and source-LDA is the estimation of lambda, which represents the extent to which the distribution of a topic differs from the distribution of the topic's knowledge source. I can't find lambda in the source-LDA script--it appears to use the algorithm that the paper describes for a mixed model. If I'm missing it, where should I be looking?

Also, it occurred to me that scaling might be a problem for estimation of the known topics: as far as I can tell, for a given word in a given topic, the formula uses the word count for that word for that topic in the knowledge source (in the numerator) and the total count for that word in the entire knowledge source. Like most algorithms of this sort LDA uses raw counts rather than proportions, since the denominators in the proportions divide out., Don't raw frequencies pose a problem, though, for knowledge sources that are much bigger or smaller than articles in the corpus?

Algorithm Error?

After a couple of months of working with the algorithm (including both modifying these scripts and writing another application that implements it with online variational Bayes), I think I see a problem with the math in the "Pop_sample" and "Populate_prob" functions.

The use of the probability density function to allow variation in the extent to which each topic matches the knowledge source is very clever; however, I don't think the code accomplishes what was intended, because, when calculating the "pr" vector, it aggregates the values for all A intervals for each known topic before sampling each word.

The effect of performing the sampling in this fashion is that no information is extracted that would allow the algorithm to weight the estimation towards some intervals and away from others (which, I believe, is the point of using the probability density function in the first place). It looks like what should actually happen is to treat each interval as a separate topic (though the "calculate_phi" function can probably stay the way it is, since it's not used in the estimation).

Mind you, breaking up the known topics in this way would reduce the statistical power of the algorithm for the known topics--and with small corpora, that could weight the results toward the (undivided, hence more accurately estimable) unknown/arbitrary topics.

Question about the Assumptions of the Source-LDA Model

Thank you for all your help to date. I have a more conceptual question now.

As I understand from the paper, there are two aspects that distinguish the Source-LDA model from the bijective and known-mixture models:

The Source-LDA model relaxes the assumptions about phi, so that the phi of a known topic can diverge somewhat from the phi implied by the distribution implied by the corresponding distribution within the knowledge source.
The Source-LDA assumes that the known topics in the knowledge source are a superset of the known topics in the corpus, and therefore prunes topics as part of the estimation process.

I'm working with a knowledge source and corpus for which the first point should absolutely be true (and therefore I don't want to use the known-mixture model), but where I need (because of the end use to which the model will be put) to keep all the topics in the knowledge source.

In terms of code, it's pretty easy for me to do this, by disabling all the calls to the "prune" function. Before I do this, though, I want to make sure that the relaxed assumptions about phi can still hold true if the topics aren't pruned. Is that the case?

Question about "Load_corpus"

I'm writing a new version of "Load_corpus" to handle loading a bunch of unseen documents in order to estimate theta for them using a previously estimated model. This might not be that important, but I'm curious why the "reverse" function is applied to the vector ("lines_test") holding the test set.

EDIT: Never mind; I see now you're just restoring the original order.

What sort of performance to expect?

I have been trying to run the reuters shell script on my laptop (virtualbox) and it takes over an hour to execute a single iteration of CTM (CPU goes to 100%). Is this normal? do I need to compile GSL with ATLAS for example? anything else you would suggest to improve performance?

Question about running the application

The logic behind your method is compelling, but I'm having trouble reproducing your analyses: the article seems a little vague on exactly what you did, and I'm not sure what formats are needed for the inputs (including the knowledge source). Where could I look for more information?

ucla-scai / source-lda Goto Github PK

source-lda's People

Contributors

Stargazers

Watchers

Forkers

source-lda's Issues

Question about the Algorithm

Algorithm Error?

Question about the Assumptions of the Source-LDA Model

Question about "Load_corpus"

What sort of performance to expect?

Question about running the application

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent