Git Product home page Git Product logo

codes's Introduction

CodeS-distribution-shift-benchmark-datasets

This project is for the paper CodeS: Towards Code Model Generalization Under Distribution Shift.

Please check our project site for more details.

Dependencies

tensorflow==2.5.1
scikit-learn==0.24.2
numpy==1.19.5

Datasets and Models

All the datasets and models can be downloaded at figshare.

Datasets: Python75.zip, Java250.zip, Python800.zip

Each collection of dataset has the same structure of directories. Take Python75.zip as an example:

├── raw                           # Raw data files scrapped from the online resources.
│   ├── LICENSE.md                # License of using the data
│   ├── [task_name]               # Code files in each task
│   │   └──  [submission_id].py   # Source code file (*.py) with the submission id.
│   ├── csv                       # Data descriptions
│   │   └──  [task_name].csv      # The description (e.g., Submission_id, Task_name, User) of each code file for this task
├── task                          # Datasets with the task distribution shift 
│   ├── LICENSE.md                # License of using the data
│   ├── pre-trained               # Data files for pre-trained language models
│   │   ├── train.jsonl           # Code files and labels in the trianing set
│   │   ├── id_test.jsonl         # Code files and labels in the ID test set
│   │   ├── ood_test.jsonl        # Code files and labels in the OOD test set
│   ├── token                     # Data files for DNN models
│   │   ├── train                 # Training set
│   │   │   ├── [task_name].tkn   # Token representations of source code files in this task for training
│   │   │   ├── info.json         # Information of the programming language and number of tokens
│   │   │   └── problems.json     # Information of the data size of each task
│   │   ├── id_test               # ID test set
│   │   │   ├── [task_name].tkn   # Token representations of source code files in this task for ID test
│   │   │   ├── info.json         # Information of the programming language and number of tokens
│   │   │   └── problems.json     # Information of the data size of each task
│   │   ├── ood_test              # OOD test set
│   │   │   ├── [task_name].tkn   # Token representations of source code files in this task for OOD test
│   │   │   ├── info.json         # Information of the programming language and number of tokens
│   │   │   └── problems.json     # Information of the data size of each task
├── user                          # Datasets with the programmer distribution shift 
│   └── ...                       # The same as task structure
├── time                          # Datasets with the time distribution shift 
│   └── ...                       # The same as task structure
├── token                         # Datasets with the token distribution shift 
│   └── ...                       # The same as task structure
├── cst                           # Datasets with the cst distribution shift 
│   └── ...                       # The same as task structure
├── random                        # Datasets with no distribution shift 
│   └── ...                       # The same as task structure

Models: models.zip: trained models and OE detectors

|── LICENSE.md                                                            # License of using the models
├── cnns                                                                  # Trained DNNs (CNN(sequence) and MLP(Bag)) using the training and ID test sets with different distribution shifts.
│   └── [DNN name]-[data name]-[distribution shift type].h5               # Trained DNN with a specific architecture, for a specific dataset with a certain distribution shift
├── oe_detectors                                                          # Trained OE detectors
│   └── [DNN name]-[data name]-[distribution shift type]-oe.h5            # OE detector with a specific architecture, for a specific dataset with a certain distribution shift

OOD detectors

The implementation of 4 OOD detectors are under the directory Detection/.

mspDetector.py                         # The implementation of the Maximum Softmax Probability (MSP) detector.
odinDetector.py                        # The implementation of the Out-of-Distribution detector for neural networks (ODIN) detector.
mahalanobisDetector.py                 # The implementation of the Mahalanobis detector.
oeDetector.py                          # Implementation of the Outlier Exposure (OE) detector.

To obtain the AUROC of OOD detectors. Run:

python Detection/evaluation.py --data_name java250 --result_dir [user define] --metric cst --detector odin

This command calculates the AUROC of the ODIN detector for the java250 dataset with the cst distribution shift.

How to use the OOD detectors:

  1. git clone https://github.com/IBM/Project_CodeNet.git
  2. put the Detection directory into Project_CodeNet/tree/main/model-experiments/token-based-similarity-classification/src/
  3. run the commands to obtain the AUROC scores.

Acknowledgement

We appreciate the authors, Puri et al., of the Project CodeNet for making their datasets and code publicly available. The raw source code files in Java250.zip and Python800.zip are from CodeNet. We also tokenize source code files and build the models using the code in CodeNet.

We appreciate the authors, Liang et al., of ODIN for making their code publicly available. We create the odinDetector.py on the top of this open source code.

We appreciate the authors, Lee et al., of Mahalanobis for making their code publicly available. We create the mspDetector.py and mahalanobisDector.py on the top of this open source code.

Support and maintenance

This project aims to facilitate the research of distribution shift in source code understanding and we welcome your contributions! Please submit an issue or a pull request and we will try our best to respond in a timely manner.

License

This project is under the CC0 license.

The raw source code files in Java250.zip and Python800.zip come from the Project CodeNet under the Apache License 2.0 license.

We manually scrape the source code files in Python75.zip from AtCoder, a public programming contest site. !Note: we only scrape public-facing data and respect the Privacy Policy and Copyright declared by AtCoder.

!Note: we include a clear license file in each *.zip for the datasets and models.

If you use this project, please consider citing us:


@inproceedings{codes2023,
  author = {Hu, Qiang and Guo, Yuejun and Xie, Xiaofei and Cordy, Maxime and Ma, Lei and Papadakis, Mike and Traon, Yves Le},
  title = {CodeS: Towards Code Model Generalization Under Distribution Shift},
  year = {2023},
  booktitle = {IEEE/ACM 45nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)}
}

codes's People

Contributors

testing-cs avatar

Stargazers

jiayou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.