wm-semeru / ds4se Goto Github PK

Data Science for Software Engineering (ds4se) is an academic initiative to perform exploratory and causal inference analysis on software engineering artifacts and metadata. Data Management, Analysis, and Benchmarking for DL and Traceability.

Home Page: https://wm-csci-435-f19.github.io/ds4se/

License: Apache License 2.0

Dockerfile 0.01% Jupyter Notebook 99.50% Python 0.46% Shell 0.01% C 0.03% Makefile 0.01%

ds4se's People

Contributors

Stargazers

Watchers

Forkers

rmclanton leylig charleswang528

ds4se's Issues

Fix: exp Mongo simulation tests

There are pathing errors in the simulate mongodb method in 1.0_exp.i. These need to be fixed in order to use this as a basis for testing exploratory.

CSCI-SE-Proj2: Refactor 6.x

Missing Dependencies

Running list of dependencies that were not installed using requirements.txt
0.0
tokenizers

1.1
fastprogress - in ds4se.mgmnt.prep.bpe (?)

6.2
Lizard
Tree_sitter

3.3
Gensim
Prg

Fix CI/CD

Have the repository properly test the project with each push/pull request

CSCI-SE-Proj2: Facade architecture design

Add desc Test flags to settings.ini

Testing team needs to flag all of the tests created in the desc section of the nbs and then add those flags to the settings.ini file in the proper format. Before this issue can be submitted as complete the tests must be invokable through the ds4se format of running tests.

Trace Value computations using the techniques

Test Cases For desc.stats

Formalizing existing and creating new test cases for methods in desc.stats file

Create/Implement TestDS4SE.ipynb

Create a program in python that when run will run nbdev tests on our select nbdev files and will print out to the User a stub summary of the process.

CSCI-SE-Proj2: Create Branches for Sub Domains

Create Branches For Project 2:

SE_Proj2: Main branch of project too, interacted with by everyone
SE_Proj2_Testing: Branch used by Yangchen and Alex
SE_Proj2_Refactor: Branch used by Robert and Will
SE_Proj2_Facade: BRanch used by Charles and Daniel

These separte branches are meant to prevent unnecessary collisions in merging and pushing allowing domains to merge with each other when needed but also keep certain changes isolated until the group can confirm them.

Enumerate and run test cases over refactored components

Replace random generators with actual traceability models

Doc2vec and Word2vec are available

CSCI-SE-Proj2: Contact other team to see what components they would need

CSCI-SE-Proj2: Resolve depreciated function usage 6.0

Integrate CodeBERT to the baseline

Please use and integrate CodeBERT to the baseline in a separate component:
https://github.com/microsoft/CodeBERT

CSCI-SE-Proj2: Documentation for API Usage

Fix and Reorganize Facade Tests

A minor return error in Facade file that needs correcting. Test cases should be reorganized.

Test Cases for 1.0 exp

Test shannon dit and data frame methods

CSCI-SE-Proj2: Facade Proto Test Cases

Create Assertion Tests for the Facade created by the Facade team. These proto asserts are barebones assertions that will need to be updated as the Facade changes.

Fixing Corrupted Files

Certain files have become corrupted resulting in problems throughout the repo. Fix these files and push to main.

CSCI-SE-Proj2: Clean .vis

Implement Mutual Information Metric for Traceability Datasets

We use DIT as our core library for entropy metrics.
https://dit.readthedocs.io/en/latest/
https://github.com/dit/dit

Please focus on Copy Mutual Information: https://dit.readthedocs.io/en/latest/measures/divergences/copy_mutual_information.html

Incorrect Behavior of get_cnt method in 1.1_exp.info

CSCI-SE-Proj2: Prototype

markdown file as a technical manual

architecture, installation, deployment, and usage

Integrate Probabilistic Libraries to compute Mutual Information

Import Comet Library Functionality to process the mutual information metric
(this is a test issue)

CSCI-SE-Proj2: Remove scrap code

System probability colab notebook tutorial

Create a colab notebook presenting a tutorial of how to use DS4SE to analyze a system probabilistically. Use the refactored information theory and statistical components.

Create documentation markdown file and pip page

Change method documentation into reST format

Add parameter details and return values to each function in the facade

Implementation of Facade Functions

Use refactored code to implement functions in facade to return actual calculated value

CSCI-SE-Proj2: List for Refactor team to refactor

Non-exported code

3.0_mining.ir.model
3.0_mining.unsupervised.traceability.ida
3.1_mining.ir.i
3.1_mining.unsupervised.traceability.eda - stuff in here, maybe need to export?

CSCI-SE-Proj2: Refactor 1.x

Test Cases for Facade

TraceLinkValue function for word2vec and doc2vec functionality
NumDoc
VocabSize
AverageToken
Vocab
VocabShared
SharedVocabSize
MutualInformation
CrossEntropy

CSCI-SE-Proj2: Test Cases For Desc 6.x

CSCI-SE-Proj2: Clean .exp files

Integrating CodeBERT

Please use and integrate CodeBERT to the baseline in a separate component:
https://github.com/microsoft/CodeBERT

Test Cases for 1.1_exp.info

Make test cases for methods exported in 1.1_exp.info

Traceability facade tutorial colab notebook

Notebook should present a tutorial of how to use the traceability facades.
Maybe use Libest

[Phase II] T-Miner & DS4SE

Phase II is aiming at filling the gaps to have a fully functional T-Miner (beta) version. To have a stable version, we need to adopt new SE methodologies that work specifically for data science and machine learning. Such methodologies involve other frameworks such as DVC, nbdev, and TFX. This phase is composed of the following activities:

T-Miner

T-Miner Interoperability and Deployment. We must guarantee that T-miner is communicating with the DS4SE library, Jenkins, and a SecureReqNet deployed version.
T-Miner Navigation. We must guarantee that the proposed navigation is functional and stable. Important use cases: information recovery (traceability) and information analysis (entropy). The tool should retrieve, create, update, and delete traceability results.
Causal Inference View. We require to implement a causal inference view for T-miner. CI should be consumed from DS4SE. However, no modules in DS4SE have been fully developed. This is a whole bach-end solution to update our previous COMET solution.

DS4SE

Data repository integration. We have been employing DVC for data versioning. However, our projects are not fully integrated. We require to centralize in a single remote all the SE-Related data. Our current architecture allows one remote per git-project, which generates data redundancies.
Data Science/ML Continues Integration. We want to adopt Continuous Machine Learning or CML. The main goal of CML is to keep all our experiments and models under control. Similar to TFX, DVC has its own pipeline solution here.
Migrating Unsupervised Traceability Models into CML-DVC. All our unsupervised models will be shaped as an ML pipeline for further enhancement and development.