ucla-scai / source-lda Goto Github PK
View Code? Open in Web Editor NEWSource-LDA: Enhancing probabilistic topic models using prior knowledge sources (ICDE 2017)
License: GNU Affero General Public License v3.0
Source-LDA: Enhancing probabilistic topic models using prior knowledge sources (ICDE 2017)
License: GNU Affero General Public License v3.0
My apologies for yet another question (actually, two), but I wanted to be able to apply the model to an unseen document, and so I was digging into the algorithm at the end of last week. As I understand it, one of the key differences between a mixed model and source-LDA is the estimation of lambda, which represents the extent to which the distribution of a topic differs from the distribution of the topic's knowledge source. I can't find lambda in the source-LDA script--it appears to use the algorithm that the paper describes for a mixed model. If I'm missing it, where should I be looking?
Also, it occurred to me that scaling might be a problem for estimation of the known topics: as far as I can tell, for a given word in a given topic, the formula uses the word count for that word for that topic in the knowledge source (in the numerator) and the total count for that word in the entire knowledge source. Like most algorithms of this sort LDA uses raw counts rather than proportions, since the denominators in the proportions divide out., Don't raw frequencies pose a problem, though, for knowledge sources that are much bigger or smaller than articles in the corpus?
After a couple of months of working with the algorithm (including both modifying these scripts and writing another application that implements it with online variational Bayes), I think I see a problem with the math in the "Pop_sample" and "Populate_prob" functions.
The use of the probability density function to allow variation in the extent to which each topic matches the knowledge source is very clever; however, I don't think the code accomplishes what was intended, because, when calculating the "pr" vector, it aggregates the values for all A intervals for each known topic before sampling each word.
The effect of performing the sampling in this fashion is that no information is extracted that would allow the algorithm to weight the estimation towards some intervals and away from others (which, I believe, is the point of using the probability density function in the first place). It looks like what should actually happen is to treat each interval as a separate topic (though the "calculate_phi" function can probably stay the way it is, since it's not used in the estimation).
Mind you, breaking up the known topics in this way would reduce the statistical power of the algorithm for the known topics--and with small corpora, that could weight the results toward the (undivided, hence more accurately estimable) unknown/arbitrary topics.
Thank you for all your help to date. I have a more conceptual question now.
As I understand from the paper, there are two aspects that distinguish the Source-LDA model from the bijective and known-mixture models:
The Source-LDA model relaxes the assumptions about phi, so that the phi of a known topic can diverge somewhat from the phi implied by the distribution implied by the corresponding distribution within the knowledge source.
The Source-LDA assumes that the known topics in the knowledge source are a superset of the known topics in the corpus, and therefore prunes topics as part of the estimation process.
I'm working with a knowledge source and corpus for which the first point should absolutely be true (and therefore I don't want to use the known-mixture model), but where I need (because of the end use to which the model will be put) to keep all the topics in the knowledge source.
In terms of code, it's pretty easy for me to do this, by disabling all the calls to the "prune" function. Before I do this, though, I want to make sure that the relaxed assumptions about phi can still hold true if the topics aren't pruned. Is that the case?
I'm writing a new version of "Load_corpus" to handle loading a bunch of unseen documents in order to estimate theta for them using a previously estimated model. This might not be that important, but I'm curious why the "reverse" function is applied to the vector ("lines_test") holding the test set.
EDIT: Never mind; I see now you're just restoring the original order.
I have been trying to run the reuters shell script on my laptop (virtualbox) and it takes over an hour to execute a single iteration of CTM (CPU goes to 100%). Is this normal? do I need to compile GSL with ATLAS for example? anything else you would suggest to improve performance?
The logic behind your method is compelling, but I'm having trouble reproducing your analyses: the article seems a little vague on exactly what you did, and I'm not sure what formats are needed for the inputs (including the knowledge source). Where could I look for more information?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.