refactoring-ai / machine-learning Goto Github PK

Software refactorings classifier pipeline for java source code, for training data from https://github.com/refactoring-ai/Data-Collection.

License: MIT License

Python 100.00%

machine-learning python java refactoring classification mysql software-engineering software-refactoring

machine-learning's Introduction

Machine Learning for Software refactoring

This repository contains the machine learning part on the use of machine learning methods to recommend software refactoring.

Paper and appendix

The paper can be found here: https://arxiv.org/abs/2001.03338
The raw dataset can be found here: https://zenodo.org/record/3547639
The appendix with our full results can be found here: https://zenodo.org/record/3583980

The machine learning pipeline

This project contains all the Python scripts that are responsible for the ML pipeline.

Installing and configuring the database

This project requires python 3.6 or higher.

First, install all the dependencies:

pip3 install --user -r requirements.txt

Then, create a dbconfig.ini file, following the example structure in dbconfig-example.ini. In this file, you configure your database connection.

Finally, configure the training in the configs.py. There, you can define which datasets to analyze, which models to build, which under sampling algorithms to use, and etc. Please, read the comments in this file.

Training and testing models

The main Python script that generates all the models and results is the binary_classification.py. You run it by simply calling python3 binary_classification.py.

The results will be stored in a results/ folder.

The generated output is a text file with a weak structure. A quick way to get results is by grepping:

cat *.txt | grep "CSV": returns a CSV with all the models and their precision, recall, and accuracy.
cat *.txt | grep "TIME": returns a CSV with how much time it took to train and test the model.

Before running the pipeline, we suggest you to warm up the cache. The warming up basically executes all the required queries and cache them as CSV files. These queries can take a long time to run... and if you are like us, you will most likely re-execute your experiments many times! :) Thus, having them cached helps:

python3 warm_cache.py

If you need to clean up the cache, simply delete the _cache directory.

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.

machine-learning's People

Contributors

Stargazers

Watchers

Forkers

ip1102 ayaz345

machine-learning's Issues

String as Features

We have a small collection of features within our training data that are strings, see below for the features.
How will we handle these features? For now, I disabled them, see here.

These features might be interesting for some refactoring types, especially the "Rename" refactorings types, thus it would be good if we could use them.

Most ways to handle strings as features for machine-learning are not applicable in our case, because:

One-Hot-Encoding/ Categorical Variables: descriptors are difficult to map into categories that still contain the properties we are interested in
Convert to number: naively converting a name to a byte array, will probably not yield to good results, as the entire data is very messy and the length are very different.
Extract properties: we could extract properties from the names that we deem potentially relevant, e.g. number of characters, number of special characters, etc.

Features

Method Level:

fullMethodName
shortMethodName

Field Level:

fieldName

Variable Level:

variableName

Understand why the early stop is not really working

I have the impression that the early stop mechanism is not really working. We need to double check as it would save lots of processing time.

Stratify test-data-split

In the new version of the code, we should set stratify=y in the train_test_split_call

Clean-up outdated files

We have many outdated, probably unused files in the ml src folder, e.g. results_parsing/. We should remove them, in order to keep the project more maintainable and easier to understand.

Upgrade tensorflow-gpu

We are currently, using tensorflow-gpu which is no longer the suggested version of tensorflow to use for gpu support.
This creates issues with the Travis CI pipeline and might create further issues in the future.

Collect the IDs of the predicted methods in the test set

Right now, we only collect performance metrics (e.g., precision, recall, accuracy).

We need to collect some examples for future qualitative analysis. In other words, for each of the models we build, a collection of [method_id, expected_prediction, model_prediction].

This way we can later look at code examples of false positives, false negatives, etc.

I suppose all these changes will be:

_single_run_model should receive X_train, X_test, y_train, and y_test (which will be implemented in refactoring-ai/predicting-refactoring-ml#36), we should pass X_test_id.
_single_run_model then returns, besides the performance metrics, a dataframe as suggested above.
This should be printed to the logs in a way that becomes easy to parse later. Suggestion: "PRED,refactoring,model,id_element,expected_prediction,predicted_value". "PRED" is just a prefix that is easy to be find by grep.

I'm using method as an example, but it can also be a class or a variable or a field, i.e., everything we predict.

Update Readme

The Readme is not up-to-date.

Update the Readme to the latest developments.

Linked to #10

Test of the ML part: Add more assertions

We should write more assertions to be 100% sure that all the transformations happened as expected.

We have already some assertions to make sure that, e.g., after balancing, the dataset has 50%-50% in both classes. What other assertions should we add?

Suggestion:

Number of features at the end is the one expected
We have no duplicated data (to avoid some SQL query returning the same IDs more than once)
... ?

Case sensitivity table names Windows vs Linux

pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT processmetrics.authorOwnership, processmetrics.bugFixCount, processmetrics.qtyMajorAuthors, processmetrics.qtyMinorAuthors, processmetrics.qtyOfAuthors, processmetrics.qtyOfCommits, processmetrics.refactoringsInvolved FROM stablecommit INNER JOIN commitmetadata ON stablecommit.commitmetadata_id = commitmetadata.id INNER JOIN processmetrics ON stablecommit.processmetrics_id = processmetrics.id WHERE stablecommit.level = 0 AND stablecommit.project_id in (select id from project where datasetName = "github") order by commitmetadata.commitDate': 1146 (42S02): Table 'refactoringdb.commitmetadata' doesn't exist

I think this is caused by case sensitivity between windows and Linux.
https://stackoverflow.com/questions/6134006/are-table-names-in-mysql-case-sensitive?rq=1
On Windows tables are case insensitive as seen by the above post. However in Linux they are not causing the python scripts to fail since the tables are camel cased while in the python program they are all lower case

CommitID
Affected File/Class/Method/Variable
Repository ID

Is this paused ?

Little news and refactory repo has been archived, have you moved to something else ? Is there a fork ?