src-d / tm-experiments Goto Github PK
View Code? Open in Web Editor NEWTopic Modeling Experiments on Source Code
License: Apache License 2.0
Topic Modeling Experiments on Source Code
License: Apache License 2.0
In train_artm.py
, when using CF data, doctopic.shape[0]
(the number of documents in the ARTM theta matrix) is different from the number of documents reported from docword_input_path
or doc_names
.
The current fix is to save the new index (from the ARTM model) and use it in later operations. The places in the code related to this bug and its workaround are marked with a # TODO(https://github.com/src-d/tm-experiments/issues/21)
for easy grepping.
Currently we use a split on underscores to split subtokens from words. We have a NN tokenizer that we could use instead to reduce noise.
Since the NN tokenizer is really slow on CPU, it should be possible to enable it only if a GPU is available.
As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.
However, this was unfortunately not the only bug. While working on a PR to implement the consolidated
training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.
As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.
After trying out different things I found a couple observations:
# instead of
doctopic, _, _ = model_artm.get_phi_dense(model_name="theta")
# we do
doctopic = model_artm.transform_sparse(batch_vectorizer)[0].todense().T
--sparse-doc-coeff
to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):
I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.
Research the most insightful ways to apply topic modeling to source code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.