The tm-experiments from src-d

Investigate vocabulary mismatch in train_artm

In train_artm.py, when using CF data, doctopic.shape[0] (the number of documents in the ARTM theta matrix) is different from the number of documents reported from docword_input_path or doc_names.

The current fix is to save the new index (from the ARTM model) and use it in later operations. The places in the code related to this bug and its workaround are marked with a # TODO(https://github.com/src-d/tm-experiments/issues/21) for easy grepping.

Use NN token splitter

Currently we use a split on underscores to split subtokens from words. We have a NN tokenizer that we could use instead to reduce noise.

Since the NN tokenizer is really slow on CPU, it should be possible to enable it only if a GPU is available.

ARTM bug - again

As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.

However, this was unfortunately not the only bug. While working on a PR to implement the consolidated training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.

As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.

After trying out different things I found a couple observations:

the problem usually appeared after the first and last phase of training
taking a large amount of topic compared to the range towards the model converged did not change much, however taking a too small amount of topics did
it affected usually small documents, like previously
an alternative way of retrieving the matrix corrected the problem after the first phase of training:

# instead of 
doctopic, _, _ = model_artm.get_phi_dense(model_name="theta")
# we do
doctopic = model_artm.transform_sparse(batch_vectorizer)[0].todense().T

the above method did not work after the last part of the training, where we induce sparsity - it also fell for the same bug as previously: documents being cut out, resulting in an incorrect matrix shape. Past that phase, neither worked (I tested literally all methods this time) and in some cases the current one gave "better" results- although almost never good ones.
if setting the --sparse-doc-coeff to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.
I did not find any issues, at any point, for the wordtopic (phi) matrix

Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):

Systematically retrieve the theta matrix with the method shown above
Check the doctopic matrix is sane after each phase of training (DEBUG mode except for last).
Save the doctopic and wordtopic matrices before inducing sparsity
Compare results pre/post sparsisty inducing, save results only if the doctopic matrix is sane and the results better.

I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.

Experiments planning

Research the most insightful ways to apply topic modeling to source code.

src-d / tm-experiments Goto Github PK

tm-experiments's People

Contributors

Stargazers

Watchers

Forkers

tm-experiments's Issues

Investigate vocabulary mismatch in train_artm

Use NN token splitter

ARTM bug - again

Experiments planning

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent