Git Product home page Git Product logo

tm-experiments's Introduction

Topic Modeling Experiments on Source Code

Getting Started

In the following we do not do so, however, it is good practice to limit the amount of memory docker containers have access to with the -m flag.

Start by cloning the repository, and building the docker image:

git clone https://github.com/src-d/tm-experiments
docker build tm-experiments -t tmexp

If you have GPU(s) and want to use our lit Neural Identifier Splitter then consider building the image with:

docker build tm-experiments -t tmexp --build-arg USE_NN=true

In all of the following, the created data will be stored in a single directory (hereinafter referred to as /path/to/data), which would have the following structure if you were to run each command with the exact same arguments:

data
├── datasets
│   ├── my-dataset.pkl
│   ├── my-dataset-2.pkl
│   └── my-merged-dataset.pkl
├── bows
│   ├── my-hall-bow
│   │   ├── doc.bow_tm.txt
│   │   ├── docword.bow_tm.txt
│   │   ├── refs.bow_tm.txt
│   │   └── vocab.bow_tm.txt
│   └── my-diff-bow
│       ├── doc.bow_tm.txt
│       ├── docword.bow_concat_tm.txt
│       ├── docword.bow_tm.txt
│       ├── refs.bow_tm.txt
│       ├── vocab.bow_concat_tm.txt -> /data/bows/my-diff-bow/vocab.bow_tm.txt
│       ├── vocab.bow_tm.txt
│       └── wordcount.pkl
├── topics
│   └── my-diff-bow
│       ├── my-hdp-exp
│       │   ├── doctopic.npy
│       │   └── wordtopic.npy
│       └── my-artm-exp
│           ├── doctopic.npy
│           ├── wordtopic.npy
│           ├── labels.txt
│           ├── membership.pkl
│           └── metrics.pkl
└── visualisations
    └── my-diff-bow
        └── my-artm-exp
            ├── heatmap_distinctness.png
            ├── topic_1.png
            ├── topic_2.png
            ├── heatmap_assignment.png
            ├── heatmap_weight.png
            ├── heatmap_scatter.png
            └── heatmap_focus.png

For each commands we only specify the required arguments, check the optional ones with docker run --rm -i tmexp $CMD --help.

preprocess command

This command will create a dataset from a cloned git repository. Before launching the command, you will thus need to clone one (or multiple) repository in a directory, as well as start the Babelfish and Gitbase servers:

make start REPOS=~/path/to/cloned/repos

You can now launch the preprocessing with the following command (we give the docker socket in order to be able to relaunch Babelfish, however this is a temporary hack):

docker run --rm -it -v /path/to/data:/data \ 
                    -v /var/run/docker.sock:/var/run/docker.sock \
                    --link tmexp_bblfshd:tmexp_bblfshd \
                    --link tmexp_gitbase:tmexp_gitbase \
  tmexp preprocess -r repo --dataset-name my-dataset

Once this job is finished, the output file should be located in /path/to/data/datasets/. Unless you wish to run this command once more, you can remove the Babelfish and Gitbase containers, as they will not be of further use, with:

docker stop tmexp_gitbase tmexp_bblfshd

For the sake of explaining the next command, we assume you ran it a second time, and created a dataset named my-dataset-2.

merge command

This command will merge multiple dataset created by the previous command. Assuming you created the two datasets, my-dataset and my-dataset-2, you can launch the merging with the following command:

docker run --rm -it -v /path/to/data:/data \ 
  tmexp merge -i my-dataset my-dataset-2 --dataset-name my-merged-dataset

Once this job is finished, the output file should be located in /path/to/data/datasets/.

create-bow command

This command will create the input used for the topic modeling, ie bags of words, from datasets created by one of the above commands. You will need to choose between one of two topic evolution models:

  • hall: each blob is considered to be a document
  • diff: we create delta-documents of added and deleted words for each series of documents in the hall model, using the tagged reference order for each repository

In the case of the diff model, two corpora will be created: one containing the delta-documents, as well as one where each document is the concatenation of all added delta-dcouments (a consolidated document). For more information about these evolutions models, you can check out this paper.

You can launch the bag of words creation with the following command (don't forget to specify the dataset name and the topic evolution model you want to use):

docker run --rm -it -v /path/to/data:/data \
  tmexp create-bow --topic-model hall --dataset-name my-dataset --bow-name my-hall-bow

Once this job is finished, the output files should be located in /path/to/data/bows/my-hall-bow/. For the sake of showing the additional output files, we assume you ran it a second time, and created bags named my-diff-bow.

train-artm command

This command will create an ARTM model from the bags-of-words created previously. You will need to specify the minimum amount of documents belonging to a given topic (by default, belonging means the document has a topic probability over .5) for this topic to be kept, which can either be an absolute number of documents, or relative to the number of documents. For more information about ARTM models, you can check out this paper or the BigARTM documentation.

You can launch the training with the following command (don't forget to specify the bow name and one of the min-docs arguments):

docker run --rm -it -v /path/to/data:/data \
  tmexp train-artm --bow-name my-diff-bow --exp-name my-artm-exp --min-docs-abs 1

Once this job is finished, the output files should be located in /path/to/data/topics/my-diff-bow/my-artm-exp.

train-hdpcommand

This command will allow you to create an HDP model from the bags-of-words created previously. For more information about HDP models, this paper or the Gensim documentation.

You can launch the training with the following command (don't forget to specify the bow name):

docker run --rm -it -v /path/to/data:/data \
  tmexp train-hdp --bow-name my-diff-bow --exp-name my-hdp-exp

Once this job is finished, the output files should be located in /path/to/data/topics/my-diff-bow/my-hdp-exp.

label command

This command will automatically label topics of a previously created model. You will need to choose between one of the following context creation methods, used to computed word probabilities:

  • hall: the context will be the hall model of the corpus, ie each blob will be a document in the context
  • last: blobs from only the last reference of each repo will be a document in the context
  • mean/median/max: the mean, max or median word count of each document (from the hall model) across all refs where it exists will be taken as a document in the context
  • concat: the concatenation of all added delta-documents (from the diff model) of the corpus will be a document in the context

For more information about the method used to label topics, you can check out this paper.

You can launch the labeling with the following command (don't forget to specify the bow name, experience name and one the context creation method):

docker run --rm -it -v /path/to/data:/data \
  tmexp label --bow-name my-diff-bow --exp-name my-artm-exp --context hall

Once this job is finished, the output file should be located in /path/to/data/topics/my-diff-bow/my-artm-exp.

postprocess command

This command will convert the corpus to the hall model if needed, then compute the total word count and topic assignment for each document, inputs that are needed to compute some of the metrics evaluating the quality of the topic model. You can launch the postprocessing with the following command (don't forget to specify the bow and experience name):

docker run --rm -it -v /path/to/data:/data \
  tmexp postprocess --bow-name my-diff-bow --exp-name my-artm-exp

Once this job is finished, the total word count output file should be located in ``/path/to/data/bows/my-diff-bow/and the topic membership output file should be located in/path/to/data/topics/my-diff-bow/my-artm-exp`.

compute-metrics command

This command will compute metrics over the topic model from inputs created by the previous commands:

  • the distinctness between a pair of topics, which is defined as the Jensen–Shannon divergence between both word distributions. We also use this metric as our convergence metric for ARTM models. It gives a measure of the similarity - or lack thereof - between topics.
  • the assignment of a topic at a given version, which is defined as the mean of all membership values of the topic over all documents for that version. It gives a measure of the volume of documents related to the topic.
  • the weight of a topic at a given version, which is defined as the mean of all membership values of the topic over all documents for that version, weighted by the number of words in each document. It gives a measure of the volume of code related to the topic.
  • the scatter of a topic at a given version, which is defined as the entropy of the membership values of the topic for that version. It gives a measure of how spread out the topic is across documents.
  • the focus of a topic at a given version, which is defined as the proportion of documents of that version which have a membership value over 50 % for the given version in the documents they appear in. It gives a measure of how dominant the topic is in the documents where is is present.

For more information about the selected metrics, you can check out this paper.

You can launch the computing with the following command (don't forget to specify the bow and experience name):

docker run --rm -it -v /path/to/data:/data \
  tmexp compute-metrics --bow-name my-diff-bow --exp-name my-artm-exp

Once this job is finished, the output file should be located in /path/to/data/topics/my-diff-bow/my-artm-exp.

visualize command

This command will create visualizations for the metrics computed previously. If you built your topic model on a single repository, then it will create evolution plots of all metrics for each topic, as well as a heatmap per metric across references and topics. If you build your topic model on multiple repositories, it will assume you did so with only reference per repository, and well compute heatmaps per metric across repositories and topics. You can launch the creation with the following command (don't forget to specify the bow and experience name):

docker run --rm -it -v /path/to/data:/data \
  tmexp visualize --bow-name my-diff-bow --exp-name my-artm-exp --max-topics 2

Once this job is finished, the output file should be located in /path/to/data/visualisations/my-diff-bow/my-artm-exp.

tm-experiments's People

Contributors

m09 avatar r0maink avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.