Git Product home page Git Product logo

machine-learning's Introduction

Machine Learning for Software refactoring

This repository contains the machine learning part on the use of machine learning methods to recommend software refactoring.

Paper and appendix

The machine learning pipeline

This project contains all the Python scripts that are responsible for the ML pipeline.

Installing and configuring the database

This project requires python 3.6 or higher.

First, install all the dependencies:

pip3 install --user -r requirements.txt

Then, create a dbconfig.ini file, following the example structure in dbconfig-example.ini. In this file, you configure your database connection.

Finally, configure the training in the configs.py. There, you can define which datasets to analyze, which models to build, which under sampling algorithms to use, and etc. Please, read the comments in this file.

Training and testing models

The main Python script that generates all the models and results is the binary_classification.py. You run it by simply calling python3 binary_classification.py.

The results will be stored in a results/ folder.

The generated output is a text file with a weak structure. A quick way to get results is by grepping:

  • cat *.txt | grep "CSV": returns a CSV with all the models and their precision, recall, and accuracy.
  • cat *.txt | grep "TIME": returns a CSV with how much time it took to train and test the model.

Before running the pipeline, we suggest you to warm up the cache. The warming up basically executes all the required queries and cache them as CSV files. These queries can take a long time to run... and if you are like us, you will most likely re-execute your experiments many times! :) Thus, having them cached helps:

python3 warm_cache.py

If you need to clean up the cache, simply delete the _cache directory.

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.

machine-learning's People

Contributors

dahny avatar dependabot[bot] avatar dvanderleij avatar egmaziero avatar jan-gerling avatar macro-mancer avatar mauricioaniche avatar rafadurelli avatar rdurelli avatar v2vivar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ip1102 ayaz345

machine-learning's Issues

String as Features

We have a small collection of features within our training data that are strings, see below for the features.
How will we handle these features? For now, I disabled them, see here.

These features might be interesting for some refactoring types, especially the "Rename" refactorings types, thus it would be good if we could use them.

Most ways to handle strings as features for machine-learning are not applicable in our case, because:

  1. One-Hot-Encoding/ Categorical Variables: descriptors are difficult to map into categories that still contain the properties we are interested in
  2. Convert to number: naively converting a name to a byte array, will probably not yield to good results, as the entire data is very messy and the length are very different.
  3. Extract properties: we could extract properties from the names that we deem potentially relevant, e.g. number of characters, number of special characters, etc.

Features

Method Level:

  • fullMethodName
  • shortMethodName

Field Level:

  • fieldName

Variable Level:

  • variableName

Clean-up outdated files

We have many outdated, probably unused files in the ml src folder, e.g. results_parsing/. We should remove them, in order to keep the project more maintainable and easier to understand.

Upgrade tensorflow-gpu

We are currently, using tensorflow-gpu which is no longer the suggested version of tensorflow to use for gpu support.
This creates issues with the Travis CI pipeline and might create further issues in the future.

Collect the IDs of the predicted methods in the test set

Right now, we only collect performance metrics (e.g., precision, recall, accuracy).

We need to collect some examples for future qualitative analysis. In other words, for each of the models we build, a collection of [method_id, expected_prediction, model_prediction].

This way we can later look at code examples of false positives, false negatives, etc.

I suppose all these changes will be:

  • _single_run_model should receive X_train, X_test, y_train, and y_test (which will be implemented in refactoring-ai/predicting-refactoring-ml#36), we should pass X_test_id.
  • _single_run_model then returns, besides the performance metrics, a dataframe as suggested above.
  • This should be printed to the logs in a way that becomes easy to parse later. Suggestion: "PRED,refactoring,model,id_element,expected_prediction,predicted_value". "PRED" is just a prefix that is easy to be find by grep.

I'm using method as an example, but it can also be a class or a variable or a field, i.e., everything we predict.

Update Readme

The Readme is not up-to-date.

Update the Readme to the latest developments.

Linked to #10

Test of the ML part: Add more assertions

We should write more assertions to be 100% sure that all the transformations happened as expected.

We have already some assertions to make sure that, e.g., after balancing, the dataset has 50%-50% in both classes. What other assertions should we add?

Suggestion:

  • Number of features at the end is the one expected
  • We have no duplicated data (to avoid some SQL query returning the same IDs more than once)
  • ... ?

Case sensitivity table names Windows vs Linux

pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT processmetrics.authorOwnership, processmetrics.bugFixCount, processmetrics.qtyMajorAuthors, processmetrics.qtyMinorAuthors, processmetrics.qtyOfAuthors, processmetrics.qtyOfCommits, processmetrics.refactoringsInvolved FROM stablecommit INNER JOIN commitmetadata ON stablecommit.commitmetadata_id = commitmetadata.id INNER JOIN processmetrics ON stablecommit.processmetrics_id = processmetrics.id WHERE stablecommit.level = 0 AND stablecommit.project_id in (select id from project where datasetName = "github") order by commitmetadata.commitDate': 1146 (42S02): Table 'refactoringdb.commitmetadata' doesn't exist

I think this is caused by case sensitivity between windows and Linux.
https://stackoverflow.com/questions/6134006/are-table-names-in-mysql-case-sensitive?rq=1
On Windows tables are case insensitive as seen by the above post. However in Linux they are not causing the python scripts to fail since the tables are camel cased while in the python program they are all lower case

Different feature reduction strategies

Enable us to configure different feature reduction strategies (e.g., variance-based, etc). Enable also the possibility of "no feature reduction".

See how we implemented the different balancing strategies, and follow the same code pattern.

Do not balance data during training

I'm curious to see what happens if we don't balance the data during training. Do we improve the number of FPs? (I am seeing this trend in other works I'm doing...)

Classifier Results

We want to investigate the results of the classifier. Therefore, we want to have a look at the original files.

For each result of a classifier store:

  • CommitID
  • Affected File/Class/Method/Variable
  • Repository ID

Is this paused ?

Little news and refactory repo has been archived, have you moved to something else ? Is there a fork ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.