pip install -r requirements.txt
python setup.py develop
Note that Tensorflow needs to be installed separately using these steps
First, copy config.yml.example to config.yml
cp config.yml.example config.yml
Then, modify the content of config.yml
. The configuration file is
self-documented and the most important parameters we used can be found in the paper.
We train our model using a dataset with data extracted from the competitive programming website AtCoder: https://atcoder.jp.
The dataset can be downloaded as an SQLite3 database : java-python-clones.db.gz.
You will most likely need to decompress the database before using it.
We also provide the raw data as a tarball but it should generally not be needed: java-python-clones.tar.gz.
The database contains both the text representation and the AST representation
of the source code. All the data is in the submissions
table. We describe
the different rows of the table below.
Name | Type | Description |
---|---|---|
id | INTEGER | Primary key for the submission |
url | VARCHAR(255) | URL of the problem on AtCoder |
contest_type | VARCHAR(64) | Contest type on AtCoder (beginner or regular) |
contest_id | INTEGER | Contest ID on AtCoder |
problem_id | INTEGER | Problem ID on AtCoder |
problem_title | VARCHAR(255) | Problem title on AtCoder (usually in Japanese) |
filename | VARCHAR(255) | Original path of the file |
language | VARCHAR(64) | Full name of the language used |
language_code | VARCHAR(64) | Short name of the language used |
source_length | INTEGER | Source length in bytes |
exec_time | INTEGER | Execution time in ms |
tokens_count | INTEGER | Number of tokens in the source |
source | TEXT | Source code of the submission |
ast | TEXT | JSON encoded AST representation of the source code |
The database also contains a samples
table which should be populated
using the suplearn-clone generate-dataset
command.
The model should already be configured in config.yml
to use the following steps.
Before training the model, the clones pair for training/cross-validation/test must first be generated using the following command.
suplearn-clone generate-dataset -c /path/to/config.yml
Once the data is generated, the model can be trained by simply using the following command
suplearn-clone train -c /path/to/config.yml
The model can be evaulated on test data by using the following command:
suplearn-clone evaulate -c /path/to/config.yml -m /path/to/model.h5 --data-type=<dev|test> -o results.json
Note that config.yml
should be the same file as the one used for training.
Pre-trained embeddings can be used by using the model.languages.n.embeddings
setting in the configuration file.
This repository does not provide any functionality to train emebddings.
Please check the bigcode-tools repository for the instructions
on how to train embeddings.
If you are using this for academic work, we would be thankful if you could cite the following paper.
@inproceedings{Perez:2019:CCD:3341883.3341965,
author = {Perez, Daniel and Chiba, Shigeru},
title = {Cross-language Clone Detection by Learning over Abstract Syntax Trees},
booktitle = {Proceedings of the 16th International Conference on Mining Software Repositories},
series = {MSR '19},
year = {2019},
location = {Montreal, Quebec, Canada},
pages = {518--528},
numpages = {11},
url = {https://doi.org/10.1109/MSR.2019.00078},
doi = {10.1109/MSR.2019.00078},
acmid = {3341965},
publisher = {IEEE Press},
address = {Piscataway, NJ, USA},
keywords = {clone detection, machine learning, source code representation},
}