Git Product home page Git Product logo

source-lda's People

Contributors

juwood03 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

source-lda's Issues

Question about the Algorithm

My apologies for yet another question (actually, two), but I wanted to be able to apply the model to an unseen document, and so I was digging into the algorithm at the end of last week. As I understand it, one of the key differences between a mixed model and source-LDA is the estimation of lambda, which represents the extent to which the distribution of a topic differs from the distribution of the topic's knowledge source. I can't find lambda in the source-LDA script--it appears to use the algorithm that the paper describes for a mixed model. If I'm missing it, where should I be looking?

Also, it occurred to me that scaling might be a problem for estimation of the known topics: as far as I can tell, for a given word in a given topic, the formula uses the word count for that word for that topic in the knowledge source (in the numerator) and the total count for that word in the entire knowledge source. Like most algorithms of this sort LDA uses raw counts rather than proportions, since the denominators in the proportions divide out., Don't raw frequencies pose a problem, though, for knowledge sources that are much bigger or smaller than articles in the corpus?

Algorithm Error?

After a couple of months of working with the algorithm (including both modifying these scripts and writing another application that implements it with online variational Bayes), I think I see a problem with the math in the "Pop_sample" and "Populate_prob" functions.

The use of the probability density function to allow variation in the extent to which each topic matches the knowledge source is very clever; however, I don't think the code accomplishes what was intended, because, when calculating the "pr" vector, it aggregates the values for all A intervals for each known topic before sampling each word.

The effect of performing the sampling in this fashion is that no information is extracted that would allow the algorithm to weight the estimation towards some intervals and away from others (which, I believe, is the point of using the probability density function in the first place). It looks like what should actually happen is to treat each interval as a separate topic (though the "calculate_phi" function can probably stay the way it is, since it's not used in the estimation).

Mind you, breaking up the known topics in this way would reduce the statistical power of the algorithm for the known topics--and with small corpora, that could weight the results toward the (undivided, hence more accurately estimable) unknown/arbitrary topics.

Question about the Assumptions of the Source-LDA Model

Thank you for all your help to date. I have a more conceptual question now.

As I understand from the paper, there are two aspects that distinguish the Source-LDA model from the bijective and known-mixture models:

  1. The Source-LDA model relaxes the assumptions about phi, so that the phi of a known topic can diverge somewhat from the phi implied by the distribution implied by the corresponding distribution within the knowledge source.

  2. The Source-LDA assumes that the known topics in the knowledge source are a superset of the known topics in the corpus, and therefore prunes topics as part of the estimation process.

I'm working with a knowledge source and corpus for which the first point should absolutely be true (and therefore I don't want to use the known-mixture model), but where I need (because of the end use to which the model will be put) to keep all the topics in the knowledge source.

In terms of code, it's pretty easy for me to do this, by disabling all the calls to the "prune" function. Before I do this, though, I want to make sure that the relaxed assumptions about phi can still hold true if the topics aren't pruned. Is that the case?

Question about "Load_corpus"

I'm writing a new version of "Load_corpus" to handle loading a bunch of unseen documents in order to estimate theta for them using a previously estimated model. This might not be that important, but I'm curious why the "reverse" function is applied to the vector ("lines_test") holding the test set.

EDIT: Never mind; I see now you're just restoring the original order.

What sort of performance to expect?

I have been trying to run the reuters shell script on my laptop (virtualbox) and it takes over an hour to execute a single iteration of CTM (CPU goes to 100%). Is this normal? do I need to compile GSL with ATLAS for example? anything else you would suggest to improve performance?

Question about running the application

The logic behind your method is compelling, but I'm having trouble reproducing your analyses: the article seems a little vague on exactly what you did, and I'm not sure what formats are needed for the inputs (including the knowledge source). Where could I look for more information?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.