Git Product home page Git Product logo

tm-experiments's People

Contributors

m09 avatar r0maink avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tm-experiments's Issues

Investigate vocabulary mismatch in train_artm

In train_artm.py, when using CF data, doctopic.shape[0] (the number of documents in the ARTM theta matrix) is different from the number of documents reported from docword_input_path or doc_names.

The current fix is to save the new index (from the ARTM model) and use it in later operations. The places in the code related to this bug and its workaround are marked with a # TODO(https://github.com/src-d/tm-experiments/issues/21) for easy grepping.

Use NN token splitter

Currently we use a split on underscores to split subtokens from words. We have a NN tokenizer that we could use instead to reduce noise.

Since the NN tokenizer is really slow on CPU, it should be possible to enable it only if a GPU is available.

ARTM bug - again

As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.

However, this was unfortunately not the only bug. While working on a PR to implement the consolidated training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.

As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.

After trying out different things I found a couple observations:

  • the problem usually appeared after the first and last phase of training
  • taking a large amount of topic compared to the range towards the model converged did not change much, however taking a too small amount of topics did
  • it affected usually small documents, like previously
  • an alternative way of retrieving the matrix corrected the problem after the first phase of training:
# instead of 
doctopic, _, _ = model_artm.get_phi_dense(model_name="theta")
# we do
doctopic = model_artm.transform_sparse(batch_vectorizer)[0].todense().T
  • the above method did not work after the last part of the training, where we induce sparsity - it also fell for the same bug as previously: documents being cut out, resulting in an incorrect matrix shape. Past that phase, neither worked (I tested literally all methods this time) and in some cases the current one gave "better" results- although almost never good ones.
  • if setting the --sparse-doc-coeff to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.
  • I did not find any issues, at any point, for the wordtopic (phi) matrix

Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):

  1. Systematically retrieve the theta matrix with the method shown above
  2. Check the doctopic matrix is sane after each phase of training (DEBUG mode except for last).
  3. Save the doctopic and wordtopic matrices before inducing sparsity
  4. Compare results pre/post sparsisty inducing, save results only if the doctopic matrix is sane and the results better.

I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.

Experiments planning

Research the most insightful ways to apply topic modeling to source code.

  • Review existing literature: https://www.overleaf.com/7163137971dqhdpqmkdhyv
  • Create the feature extraction and model training pipeline:
    • Parse repositories at several points in history and extract features for each revision, following either the Diff or Hall model defined by Thomas (cf the literature review)
    • Define a pipeline to compute a topic model over the complete features
    • Post-process the obtained model to obtain topics for each file in each considered point in history
    • Explore the interaction of the diff model with document sparsity regularization and general topic model training by changing the corpus we train on: instead of keeping diff documents (both added and deleted) to train on, try to concatenate all add-documents of a given document into a consolidated document for training, and use the diff-documents only to compute topics for each revision in post-processing
  • Make sure the models trained with the Python ARTM client are equivalent to the ones trained with the ARTM CLI.
  • Experiment on https://github.com/pytorch/pytorch/ (20+ releases):
    • Find the best way to automatically label topics:
      • Implement Mei's approach
      • Test different ways to define contexts for labeling: #31 (comment), with the added option of using the max/mean/median count across revisions for each identifier as a context
      • Test monoword labels to mitigate the bigram mining/ordering problem
      • Test filtering label names to use only some type of identifiers (class names, function names, …), potentially boosting them compared to others when we score them using the trained topic model
    • Research how to circumvent the need for a manually set number of topics
    • Measure topic evolution cohesion with existing changelogs
    • Define the best evaluation schemes for production setups (which metrics, at which stages)
  • Scale up the experiments on http://github.com/tensorflow/tensorflow/ (80+ releases), and to commit granularity. Currently blocked by:
    • bblfsh python client OOMs
    • gitbase UAST cache issues/OOMs when parsing UASTs
    • diffs are not parsable in gitbase
  • Compute the topics at the user level, not document level, using either blame to determine ownership (easy to try, not so precise) or identifiers in diffs (a bit harder, very precise), and apply all the above points to the obtained topics (automatic labels, evolution, …)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.