ibm / project_codenet Goto Github PK

This repository is to support contributions for tools for the Project CodeNet dataset hosted in DAX

License: Apache License 2.0

Dockerfile 0.03% Shell 4.77% Java 16.22% Awk 1.39% Makefile 0.24% C++ 1.28% jq 0.07% C 12.91% ANTLR 23.77% Python 31.47% Jupyter Notebook 7.84%

project_codenet's Introduction

Project CodeNet

The goal of Project CodeNet is to provide the AI-for-Code research community with a large scale, diverse, and high quality curated dataset to drive innovation in AI techniques.

Introduction
Models and experiments
Relevant links
Download the dataset
Dataset overview
- Dataset statistics
Data
Metadata
- Metadata at the dataset level
- Metadata at the problem level
Directory structure and naming convention
Relationships among the metadata and data
- Example of getting the source file for a particular submission
- Example of getting the metadata for a particular source file
Tools to process source files
Contributors

Introduction

A decade ago, Marc Andreessen famously wrote that "software is eating the world." Software now permeates every part of our existence; Google services combine for 2 billion lines of code, and a modern vehicle contains around 100 million lines of code. It's a monumental challenge to create, debug, maintain, and update these complex software systems. Recently, a fast-growing discipline known as AI for Code aims to help software developers improve their productivity by automating the software engineering process. AI for Code researchers have been leveraging technologies like NLP and augmenting them with code analysis and compilation techniques to perform a myriad of practical tasks, such as code search, summarization, and completion, as well as code-to-code translation. The discipline isn't limited to academic research either: Ruchir Puri, IBM Research's chief research scientist, discussed in a recent podcast how technologies from AI for Code are being used to modernize legacy software by helping migrate monolithic applications to microservices for IBM's enterprise clients.

AI for Code is poised to transition from proof-of-concept to widespread adoption. To provide a catalyst for such a tipping point, researchers at IBM Research have introduced Project CodeNet, a large-scale dataset for benchmarking and experimentation. Project CodeNet has many characteristics (large scale, diveristy, etc.) similar to ImageNet, a huge dataset for imagery that had a dramatic impact on the field of computer vision research. Project CodeNet is a large scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. Project CodeNet aims to do for AI for Code what ImageNet did for computer vision.

Differentiation

There are a few differentiating features of Project CodeNet when compared to other similar efforts. In addition to the size of the dataset, the code samples are written in over 50 programming languages, though the dominant languages are C++, C, Python, and Java. The code samples in Project CodeNet are annotated with a rich set of information, such as the code size, memory footprint, CPU run time, and status, which indicates acceptance or error types. Over 90% of the problems come with the respective problem description, which contains a concise problem statement, specification of the input format and the output format. When available, we also extracted from the problem description sample input and output, and provide them as part of the dataset. Users can execute the accepted codes samples (over 50% of the submissions are accepted) to extract additional metadata and verify outputs from generative AI models for correctness.

Another area that Project CodeNet addressed is the quality of the data samples. From a paper by Allamanis, we learned that quite a large number of frequently used AI for Code datasets have duplicate or near-duplicate code samples, which can inflate performance metrics as much as 100%. In addition, we found that problem-submission style datasets from online judging systems can contain clusters of identical problems, which will certainly skew the performance metrics. One example is POJ-104, in which problems 26 and 62 are identical. Therefore we identified the near-duplicates and the identical problem clusters in Project CodeNet and provide these information for the benefit of the users.

Benchmarks

In light of these issues, we have extracted several benchmark datasets from CodeNet for users to perform code classification and code similarity experiments. They have been filtered to remove identical problem clusters and near-duplicate code samples, so that performance metrics can be measured on training and test data samples with the appropriate statistics. There are two C++ benchmark datasets that are similar to the popular POJ-104 but are approximately ten times in size. We felt that the size increase is necessary, since 98% accuracy has been already achieved in code classification on POJ-104. An order of magnitude larger dataset will leave ample room to advance the state of the art with more complex neural networks and algorithms. The other two benchmark datasets are in Python and Java, which provides a different flavor because the frequent use of library functions.

Potential use cases

The rich metadata and diversity open Project CodeNet to a plethora of uses cases. The problem-submission relationship in Project CodeNet corresponds to type-4 similarity and can be used for code search and clone detection. The code samples in Project CodeNet are labeled with their acceptance status and we can explore AI techniques to distinguish correct codes from problematic ones. Project CodeNet's metadata also enables the tracking of how a submission evolves from problematic to accepted, which could be used for exploring automatic code correction. Each code sample is labeled with CPU run time and memory footprint, which can be used for regression studies and prediction. Given its wealth of programs written in a multitude of languages, Project CodeNet may serve as a valuable benchmark dataset for source-to-source translation.

Usability

To facilitate creation of customized benchmarks and dataset, we provide a set of productivity tools to aggregate codes samples based on user criteria. We are also releasing pre-processing tools to transform code samples into token sequences, simplified parse trees and other code graphs.

Models and experiments

We have performed numerous experiments on the CodeNet dataset. The goal of these experiments is to produce a set of baseline models and results for users of the CodeNet dataset to gauge their research. The run scripts and training scripts are available in the model-experiments directory. The classification and similarity experiments use the benchmark datasets we extracted from CodeNet as training and test datasets. In addition to experiments based on token sequences, we also have experiments leveraging graph neural networks (GNN). For the convenience of the users interested in GNN's, we have included the simplified parse tree (SPT) representation of the code samples for each benchmark dataset. The experiment on Masked Language Model has a companion Jupyter notebook in the notebooks directory.

Problem Descriptions

For the vast majority of problem classes, short problem descriptions are available in 'doc/problem_descriptions.tar.gz', a small html file for each problem.

Relevant links

Download the dataset

Download the full dataset in our data repository.

tar -zxf Project_CodeNet_full.tar.gz to uncompress and untar. The directory structure and how the code samples are organized are explained here.

The 4 benchmark datasets, Project_CodeNet_C++1000, Project_CodeNet_C++1400, Project_CodeNet_Python800, and Project_CodeNet_Java250 are included in the full dataset and are available separately in the "Archive Dataset File" column of the table in the "Get this Dataset" section in our data repository. They can be used for code classification and code similarity research as a replacement of or in addition to the dataset POJ-104.

To expedite AI for code research using graph neural networks, we have included the simplified parse tree (SPT) representation of the code samples for each benchmark dataset. They are available in the "Archive SPT File" column of the table in the "Get this Dataset" section in our data repository.

Dataset overview

The Project CodeNet Dataset consists of a very large collection of source files, extensive metadata, tooling to access the dataset and make tailored selections, and documentation.

The basis of the dataset is the data available on two online judge web sites:

An online judge website offers programmers an opportunity to test their skills by posing programming problems in the form of courses or contests. Users may submit their solution which is then judged by an automatic review mechanism. The outcome is reported back to the user. Both problem descriptions, user submissions and associated metadata are available for study via various REST APIs.

The first step in constructing Project CodeNet is downloading the problem descriptions and the source code submissions from the websites mentioned above, followed by reshaping and consolidating the metadata and cleaning up the inconsistencies, omissions, and mistakes in the source data itself.

Dataset statistics

The dataset comprises 13,916,868 submissions, divided into 4053 problems (of which 5 are empty). Of the submissions 53.6% (7,460,588) are accepted, 29.5% are marked as wrong answer and the remaining suffer from one of the possible rejection causes. The data contains submissions in 55 different languages, although 95% of them are coded in the six most common languages (C++, Python, Java, C, Ruby, C#). C++ is the most common language with 8,008,527 submissions (57% of the total) of which 4,353,049 are accepted. Here are 2 pie charts depicting submissions and status distribution of Project CodeNet.

A detailed overview of the dataset statistics can be found in this spreadsheet.

Data

The data consist of complete programs in a particular programming language. Each program is contained in a single file. The file will have a name with an extension that denotes the programming language used. (More details about the specific programming language and the version of the compiler/interpreter used, can be found in the metadata.)

Each program attempts to solve a certain programming task or problem. There are many problems and each problem might have many solutions in different languages. We refer to each program as a submission instead of a solution since it might not be complete and correct. Solutions are the accepted submissions that are compilable and executable, and at least correctly produce the expected results on all provided test cases. (Of course, according to the late Dijkstra, tests are no proof of correctness.)

Metadata

The metadata provides properties of interest about the problems and their submissions. Foremost it formalizes the organization of the data and the relationship between problems, languages, and the source code files. The metadata allows for queries about the data and to make specific selections among the large collection of problems, languages, and source files.

Metadata is made available in comma-separated value (CSV) files. This allows for easy processing, even with simple command-line tools. Some of the fields in the csv files might be empty, and for submissions that are not accepted, some fields might have invalid entries such as negative numbers for CPU time. Extra checking needs to be implemented in parsing these files.

The metadata is hierarchically organized on 2 levels: the first level is the dataset level that relates to all the different problems defined by the various dataset sources. The second level is the problem level that relates to all source code submissions pertaining to a single problem or task.

Metadata and data are deliberately kept fully separated within the file system.

Metadata at the dataset level

At the dataset level there is a single CSV file (problem_list.csv) listing all the different problems. Additionally, for each problem there is a more extensive description that sets the problem and any further requirements and constraints and often provides examples of data input and expected output.

The fields and their format of this CSV file are captured by the following table:

name of column	data type	unit	description
id	string	none	unique anonymized id of the problem
name	string	none	short name of the problem
dataset	string	none	original dataset, AIZU or AtCoder
time_limit	int	millisecond	maximum time allowed for a submission
memory_limit	int	KB	maximum memory allowed for a submission
rating	int	none	rating, i.e., difficulty of the problem
tags	string	none	list of tags separated by "\|"; not used
complexity	string	none	degree of difficulty of the problem; not used

Metadata at the problem level

At the problem level there is a CSV file per problem and all content of these files is of course organized under one and the same header.

The fields and their format of this CSV file are captured by the following table:

name of column	data type	unit	description
submission_id	string	none	unique anonymized id of the submission
problem_id	string	none	anonymized id of the problem
user_id	string	none	anonymized user id of the submission
date	int	seconds	date and time of submission in the Unix timestamp format (seconds since the epoch)
language	string	none	mapped language of the submission (ex: C++14 -> C++)
original_language	string	none	original language specification
filename_ext	string	none	extension of the filename that indicates the programming language used
status	string	none	acceptance status, or error type
cpu_time	int	millisecond	execution time
memory	int	KB	memory used
code_size	int	bytes	size of the submission source code in bytes
accuracy	string	none	number of tests passed; *Only for AIZU

Here is a table of all the possible status values. The “abbreviation” and “numeric code” are sometimes seen in the original metadata on the websites; it is listed here for reference and completeness. These fields do not occur in the Project CodeNet metadata.

status	abbreviation	numeric code
Compile Error	CE	0
Wrong Answer	WA	1
Time Limit Exceeded	TLE	2
Memory Limit Exceeded	MLE	3
Accepted	AC	4
Judge Not Available	JNA	5
Output Limit Exceeded	OLE	6
Runtime Error	RE	7
WA: Presentation Error	PE	8
Waiting for Judging	WJ
Waiting for Re-judging	WR
Internal Error	IE
Judge System Error

Directory structure and naming convention

The data and metadata are organized in a rigorous directory structure. At the top level sits the Project CodeNet directory with several sub-directories, data, metadata, and problem_descriptions:

data is further subdivided into a directory per problem and within each problem directory, directories for each language. The language directory contains all the source files supposed to be written in that particular programming or scripting language. When there are no submissions for a particular language, there will be no directory for it, but the problem directory will always be there, even if there are no submissions at all.

The name of the directory for a programming language is the common name for the language using proper capitalization and special characters. This name is the consolidation of the names used in the metadata. Information is available about how the original language designations are mapped into the directory names and how these more general and common names are mapped to the submission file name extensions. As an example, a source could be designated c++14, which is mapped into the directory C++ (notice the capital C) and will get the extension .cpp.
derived holds information about near-duplicates, identical problem clusters, sample input and output for each problem, as well as the benchmarks.
metadata holds all the problem CSV files and the problem_list.csv file.
problem_descriptions holds HTML files for most problems, giving an extensive description of the problem, often accompanied with some sample input and expected output.

For the sake of creating a uniform set of metadata across all data sources, and to hide any sensitive information, some metadata fields are anonymized by randomly (but uniquely and consistently) renumbering problem, submission, and user identifiers (ids). The identifiers we use are defined by simple regular expressions:

problem ids are anonymized and follow this pattern: p[0-9]{5} (a p followed by exactly 5 digits).
submission ids are anonymized and follow this pattern: s[0-9]{9} (an s followed by exactly 9 digits).
user ids are anonymized and follow this pattern: u[0-9]{9} (a u followed by exactly 9 digits).

Relationships among the metadata and data

The main relationship between problem metadata and data is the fact that each metadata record (a non-header row in a problem CSV file) describes one source file and provides all information about its location. The directory structure and naming convention as stated above are implicitly assumed.

Example of getting the source file for a particular submission

Starting at a CSV metadata entry for a particular submission, here is how to get to the corresponding source file. Say that the submission id is s300682070. Either we know this is a submission to problem p00001 upfront or we can grep through all Project_CodeNet/metadata/p?????.csv files to learn that. We get a brief description of this problem by looking at the p00001 entry in the Project_CodeNet/metadata/problem_list.csv:

p00001,List of Top 3 Hills,AIZU,1000,131072,,,

We can get a more verbose description of this problem by reading Project_CodeNet/problem_descriptions/p00001.html.

The Project_CodeNet/metadata/p00001.csv file provides the info on all submissions. For our selected submission we find:

s300682070,p00001,u558442027,1480319506,JavaScript,JavaScript,js,Accepted,60,15496,219,4/4

We see it is an Accepted submission in the language JavaScript with file extension .js.

The source file path therefore is: Project_CodeNet/data/p00001/JavaScript/s300682070.js

Example of getting the metadata for a particular source file

Likewise, we can play the reverse game of finding the metadata entry for a given submission source file. Say the source file is Project_CodeNet/data/p00001/JavaScript/s300682070.js.

Encoded in this file name path we see the problem id p00001 and language JavaScript and of course the submission id s300682070. We find the metadata CSV file to be: Project_CodeNet/metadata/p00001.csv. Opening that file and searching for the submission id we find the entry:

s300682070,p00001,u558442027,1480319506,JavaScript,JavaScript,js,Accepted,60,15496,219,4/4

Tools to process source files

The source files of Project CodeNet represent examples of some 50+ different programming and scripting languages. Of course not all languages are equally represented: most submissions are written in the more popular languages C, C++, Java, and Python.

To complement our large dataset of source code, a suite of tools and utilities will be provided. These tools target several purposes:

derive statistics from the dataset
access the dataset files to make selections
preprocess the source files to extract certain information
facilitate conversions between popular formats

Statistics

Since Project CodeNet uses the file system as storage and uses a rigorous directory structure, many (Linux) command-line utilities can be directly used to extract interesting statistics about the dataset. Utilities like ls, wc and grep are very useful. The CSV metadata can best be browsed using csvkit components like csvstat.

More elaborate statistics about the dataset can easily be retrieved using SQL queries on a database representation of the metadata. HSQLDB is a database that runs off a CSV file. Our CSV problem metadata files are simply stripped of their headers and concatenated. A suite of useful SQL queries is available. A separate document explains the necessary steps.

Access and selection

As described above, it should be easy to create specific subsets of the dataset merely by copying (or symlinking) relevant files and/or directories. For more elaborate selections based on a subset or range of problems, a subset of languages, statuses, and code sizes, several Bash scripts are available to accomplish that. These scripts reside in the tools/aggregation-scripts directory and are separately documented in this README.

Pre-processing

We provide tools to convert code samples into a representation that can be consumed by AI algorithms

generate stream of tokens tokenizer
parsing to tree/abstract syntax tree AST generation
control and data flow graph construction code analysis

Whether and to what extent the above steps can successfully be applied to any given source file depends on several factors. Obviously, if the submission is not of Accepted status, it is to be expected that even simple tokenization will fail because of malformed lexical elements. But the situation for Accepted submissions is not always better: programmers might have used certain non-standard features of the language that happen to be accepted by a certain compiler or interpreter. Simple cases are the use of a dollar sign as part of a C identifier. For languages like C and C++ that use a pre-processor, use of macros and conditional defines can hugely change how the code ultimately looks like.

Contributors

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss.

project_codenet's People

Contributors

Stargazers

Watchers

Forkers

manas-embold edgar-moran frreiss adbmd sxjscience baagie7 saisurajkarra strogo kiranvarghesev olivierh59500 ao-ship-it danielschulz nashid chelini cwickniss mattdanielbrown qikahh enripastrana hugosimasalmeida wuxiaoxue aingo03304 takenory ishita-roy016 gabachom bigdatasciencegroup mechanacor3 khemanta drupalio saislam pacificdynamics apixelinspace miguelzamoraa koide-lab juancafe kravitzmc paulobmsousa huangweiboy2 lsantiago majorcarrot mbijon joskid stjordanis jeewenjie techgorilla snowfly-guo freepiti gel1has3 neolee7300 kamaropoulos ai4dev ildussadykov delphigeek yasukatsuno dam775 rohand2110 jason-lee-lxx vivanvatsa knightxun vparisa riels89 richardodliu wenlihaoyu wjinshui timestap ccnucjp8136 emreyalcin26 newcooldiscoveries kuustudio technoindianjr henglicad saurabhpujar mihirc-github ml-d giacomo-domeniconi rancyy brocksprogramming grindelwald21 hiwot985 akbarbs leofernandesmo melnarte natnaelino slayerleon xrosliang woshiheihao2011 hell-to-heaven vnk0109 williamlisci huangeyou007 needrom nakaji-yuto g7shah phama2 gchacko tenzin-g rahlk srishtipithadia vegaonline blindspoter01 hackchm12345

project_codenet's Issues

Fix the link markdown in README.md

"Download the full dataset in our [data repository] (https://developer.ibm.com/technologies/artificial-intelligence/data/project-codenet/)."

The markdown syntax used here is incorrect.

approximate gpu memory usage when training

Hello, thank you for your awesome work on large-scale source code dataset!

I am training CNN and GNN models for code classification task by following the instructions on section 8.1 of paper, and under directory model-experiments/.
As the dataset is pretty large-scale (e.g. with C++ codes with token sequence, I obtained 4293714 code solutions for 2471 problems, with longest code has 200520 tokens), I'd like to check with you what was the approximate memory usage of GPUs when training the models.

Also, would it be possible to get the implementation of C-BERT as well?

Thank you in advance!

Please share SPT split (into 8 CSVs) mechanism for problem class ID 0 and 1 for GNN experiment.

Please share SPT split (into 8 CSVs) mechanism used in gnn experiment for problem class ID 0 and 1.

Program halts whenever trying to import ogb.graphproppred

Hi I am able to run from torch_geometric.data import InMemoryDataset successfully. But whenever I try to import anything from ogb, the program stuck for infinite time. Import code is: from ogb.graphproppred import Evaluator

I followed the following command to install torch in conda environment:

CUDA=cu113
TORCH=1.11.0
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html
pip install torch-geometric
Note that, I am using Ubuntu. My Nvidia-smi output:

Artifacts for Project CodeNet paper are not available.

The initiative of having an authoritative dataset is valuable. It could serve as a benchmark for different source code processing tasks like code summarization or language to language translation.

For the CodeNet paper, results are reported for code classification and code similarity experiments. However, the models and configuration scripts for MLP, CNN, or GNN are not available, and artifacts are not made available for replication.

I would like to know whether there is a plan to share the replication package for the CodeNet paper. I would be happy to get involved if there is any assistance required to make the replication package ready.

Metadata information not found (problem_list.csv)

Hi,

I am in search of description file, which can describe a brief summary depicting the implementation details of each set of problem in Project_CodeNet_Java250 dataset.

Right now, dataset is available without any description of meta data. I tried to find a file named as "problem_list.csv", but in vain. Can you please share this information?

Does the metadata include compiler version and execution platform information

Hi,

Thank you for sharing this wonderful projects.
You've annotated a rich set of information in metadata, but does the metadata include compiler version and execution platform information.

I think the compiler version or execution platform(ex.x86 ) may impact the execute result, and some programming syntax.

So how can we know the compiler version and platform information for the code sample in the dataset?
If these information were not included now, is it possible to add them?

Thank you!

garbled codes in some code files

There are garbled codes in some code files, which may affect semantic analysis, such as: s969976977.java

Please share what's the best possible mechanism to build a deployable pipeline for gnn model based code classification

WebSphere Portal to React modernization

Many customers are migrating from proprietary extensions to J2EE portlets in traditional ecommerce systems built on WebSphere portal. Have any of these types of apps been migrated to newer UX frameworks like React. Any samples in CodeNet?

Issue with metadata: cpu_time

Hi!
I have some confusion about cpu_time. Why is there zero value for cpu_time?

Why do you choose AIZU and AtCoder's problem and codesample as raw CodeNet's rawdata?

Hi,

Thank you for sharing this project.
I know that the codesample and problem_description is from AIZU & AtCoder
The basis of the dataset is the data available on two online judge web sites:

AIZU Online Judge
AtCoder

I want to ask why do you choose AIZU & AtCoder?
As there are a lot of online judge web sites, Is there anything special about these two sites ,
and will you add code sample and problem_descritption from other online judge web sites or other web sites in the future?

Looking forward to your reply, thank you!

Lack of documentation on SPT(json) split into CSVs to be fed to trained GNN model for code classification experiment

Hi,

Created SPT from https://github.com/IBM/Project_CodeNet/tree/main/tools/spt-generator for given sample C code.

Nitins-MacBook-Air:spt-generator nitinnanda$ ./scripts/run/spt-gen.sh -d=/Users/nitinnanda/Downloads/SPT ./examples/c/helloworld.c
/Users/nitinnanda/Downloads/SPT/helloworld.json is generated!
/Users/nitinnanda/Downloads/SPT/helloworld.csv is generated.
Nitins-MacBook-Air:spt-generator nitinnanda$

Created trained model from https://github.com/IBM/Project_CodeNet/tree/main/model-experiments/gnn-based-experiments

(pyg) Nitins-MacBook-Air:saved_models nitinnanda$ ls -ltr
total 11400
-rw-r--r--  1 nitinnanda  staff  5834983 Jun 21 18:07 gcn_lr1e-3_10_23.pt
(pyg) Nitins-MacBook-Air:saved_models nitinnanda$

Now the SPT here helloworld.json needs to be split into CSVs to feed to trained model gcn_lr1e-3_10_23.pt however, there is very little to no documentation given on this split.

Please guide.

Would Loyc trees be useful for ML-based language translation and prediction?

I'm not a machine learning expert by any means. However, I think that there may be some value in running multi-language ML models against an intermediate language rather than against the original source text. For example, the following two statements are essentially identical in meaning, but not in form:

if (x > 0 && x <= y) {
  process(x, y);
} else {
  addError("Out of range");
}

if x > 0 and x <= y:
  process(x, y);
else:
  addError("Out of range");

For some years I've been developing an interchange format for code, which would map these two blocks to an identical intermediate form. While ML models can presumably learn all by themselves that these two blocks are "the same", I suspect it would be beneficial to "help" models understand common similarities in a more explicit way.

By analogy, 7-zip can compress a file down to a lower size when the original file is already expressed more efficiently (e.g. given an xml file and a binary file representing the same data, the size of the compressed binary file is smaller, even if the compression ratio isn't as good). Similarly, I speculate that adjusting different languages to be more similar to each other in a preprocessing step would spare the model from having to deal with irrelevant syntactic differences, allowing it to dedicate more resources to the (still very numerous) remaining differences.

Since I only work on this in my free time (and I have several other free-time projects), important parts of a Loyc tree standard are un/underdeveloped. Even so I would like to hear from ML researchers about whether something like this could be a useful tool for them.

Feature extraction for GNN example

Hi,

Thanks for release the dataset with baseline methods. I have a question regarding the GNN example in https://github.com/IBM/Project_CodeNet/tree/main/model-experiments/gnn-based-experiments . The data is preprocessed into the form of .csv files. May I know how I can obtain these files from the original c++1000 dataset? I'm asking partly because there seems to some additional treatment such as the next node token on top of the SPT representation, so it would be great to have a step by step tutorial of obtaining the processed data in the first place.

I'm working on a project using the dataset, so your timely reply will be invaluable. Thanks in advance.

Didn't find any wrong solutions in Project_CodeNet_Python800 dataset

Project_CodeNet_Python800 and Project_CodeNet_Java250 consist of Accepted answers only. Wonder, where do I find the Wrong solutions?

Dataset Split for Benchmark Code Classification

Hi,
Thanks for sharing such a great AI code dataset.

In Page 8 Section 8.1 of the paper, there shows " For each experiment, 20% of the code samples are used for testing, while the rest are split in 4:1 for training and validation, respectively". However, in page 12 "GNN with SPT", there shows "We conduct 6/2/2 random split for each of the 4 benchmarks: i.e., 60% training data, 20% testing data, and 20% validation data."

I am confused about the differences. Could you please help me with this?

Best,
LJ.

How to run tests on generated samples?

Imagine that I have trained something to generate code according the problem description. I would like to measure the quality of such generation in terms of the number of test cases that got a good verdict (AC). How can I run tests?

Cannot get train, valid and test set from the corpus.

Hi,

Thanks for this wonderful project. One question is that when I look through the data, I found that it has not split into train/valid/test. Can you help to split the data accordingly, so we can compare it with the numbers posted in the table.

Thanks and best regards.

AttributeError: Can't pickle local object 'main.<locals>.<lambda>'

Hi,

Trying to replicate GNN-experiment on Host OS-MacOS BigSur v11.3.1 with below python packages.
Please Suggest.

(pyg) Nitins-MacBook-Air:gnn-based-experiments nitinnanda$ python list.py 
['certifi==2021.5.30', 'chardet==4.0.0', 'decorator==4.4.2', 'dill==0.3.4', 'googledrivedownloader==0.4', 'idna==2.10', 'isodate==0.6.0', 'jinja2==3.0.1', 'joblib==1.0.1', 'littleutils==0.2.2', 'markupsafe==2.0.1', 'networkx==2.5.1', 'numpy==1.20.3', 'ogb==1.3.1', 'outdated==0.2.1', 'pandas==1.2.4', 'pillow==8.2.0', 'pip==21.1.2', 'pyparsing==2.4.7', 'python-dateutil==2.8.1', 'python-louvain==0.15', 'pytz==2021.1', 'rdflib==5.0.0', 'requests==2.25.1', 'scikit-learn==0.24.2', 'scipy==1.6.3', 'setuptools==52.0.0.post20210125', 'six==1.16.0', 'threadpoolctl==2.1.0', 'torch-cluster==1.5.9', 'torch-geometric==1.7.1', 'torch-scatter==2.0.7', 'torch-sparse==0.6.10', 'torch-spline-conv==1.2.1', 'torch-summary==1.4.5', 'torch==1.9.0', 'torchsummary==1.5.1', 'torchvision==0.10.0', 'tqdm==4.61.1', 'typing-extensions==3.10.0.0', 'urllib3==1.26.5', 'wheel==0.36.2']
(pyg) Nitins-MacBook-Air:gnn-based-experiments nitinnanda$

(pyg) Nitins-MacBook-Air:gnn-based-experiments nitinnanda$ ./run.sh 
PYTHONPATH: :/Users/nitinnanda/Project_CodeNet/model-experiments/gnn-based-experiments
started experiments

Copyright (c) 2004-2016 California Institute of Technology.
Copyright (c) 2016-2021 The Uncertainty Quantification Foundation.
All rights reserved.

This software is available subject to the conditions and terms laid
out below. By downloading and using this software you are agreeing
to the following conditions.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met::

    - Redistribution of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.

    - Redistribution in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentations and/or other materials provided with the distribution.

    - Neither the names of the copyright holders nor the names of any of
      the contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


saved args: Namespace(batch_size=80, checkpoint='', checkpointing=1, clip=0.25, dataset='python1k', device=0, dir_data='/Users/nitinnanda/Project_CodeNet/model-experiments/gnn-based-experiments/data', dir_results='/Users/nitinnanda/Project_CodeNet/model-experiments/gnn-based-experiments/results', dir_save='/Users/nitinnanda/Project_CodeNet/model-experiments/gnn-based-experiments/saved_models', drop_ratio=0.0, emb_dim=300, epochs=1000, feat_nums='', filename='gcn_lr1e-3', gnn='gcn', lr=0.001, num_layer=5, num_workers=80, patience=20.0, runs=10)
Data loading done!
/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/site-packages/torch/utils/data/dataloader.py:478: UserWarning: This DataLoader will create 80 worker processes in total. Our suggested max number of worker in current system is 4 (`cpuset` is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Let's use 0 GPUs! -- DataParallel running also on CPU only
=====Run 1, Epoch 1
Iteration:   0%|                                                                                                                        | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 321, in <module>
    main()
  File "main.py", line 246, in main
    loss, train_perf = train(model, device, train_loader, optimizer, args, evaluator)
  File "main.py", line 35, in train
    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/local/Caskroom/miniconda/base/envs/pyg/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.<lambda>'
(pyg) Nitins-MacBook-Air:gnn-based-experiments nitinnanda$ 
```

Backdoor detected

Apologies if this has already been reported, but Windows Security is detecting a backdoor threat Backdoor.PHP/Dirtelti.MTF in 4 php files in the download from this repository

BackdoorPHP/Dirtelti.MTF
Alert level: Severe
Status: Active
Date: 2021-06-05 10:58
Category: Backdoor
Details: This program provides remote access to the computer it is
installed on.
Learn more
Affected items:
containerfile: C:\Users\b\DownIoads\Project_CodeNettar.gz
containerfile: C:\Windows\Temp\TMPDB22CA7DF94B81B2
file: C:\Users\b\DownIoads\Project_CodeNet.tar.gz->(GZip)->Project_CodeNet/data/p03844/PHP/s069566612.php
file: C:\Users\b\DownIoads\Project_CodeNet.tar.gz->(GZip)->Project_CodeNet/data/p03844/PHP/s308064656.php
file: C:\Users\b\DownIoads\Project_CodeNet.tar.gz->(GZip)->Project_CodeNet/data/p03844/PHP/s6862216O0.php
file: C:\Users\b\DownIoads\Project_CodeNet.tar.gz->(GZip)->Project_CodeNet/data/p03844/PHP/s967662473.php

Download dataset link is down

Dear Team,

The Link to download dataset, appears to be down.

https://ibmdev1.rtp.raleigh.ibm.com/exchanges/data/all/project-codenet/

Problem inputs specifications

Hi,

Is there somewhere the input specifications or tests inputs to feed to the submitted solution to run them locally ?

Cannot build WALA program

Hi everyone,

I am trying to build WALA tool by running build.sh, but I have encountered some problems.
The detailed log is as below. I guess your documentation is out of date because the packages in the pom.xml are old. Also, I cannot find them in the Maven repository, so I edited the pom.xml to update the package name, but this was not enough to fix.
Thank you very much for any support.
[INFO] Scanning for projects... [INFO] [INFO] -------------------< CodeNet:AnalysisGraphGenerator >------------------- [INFO] Building com.ibm.wala.codeNet 0.0.1-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ AnalysisGraphGenerator --- [INFO] Deleting /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/target [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ AnalysisGraphGenerator --- [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/resources [INFO] [INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ AnalysisGraphGenerator --- [INFO] Changes detected - recompiling the module! [WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent! [INFO] Compiling 6 source files to /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/target/classes [INFO] ------------------------------------------------------------- [ERROR] COMPILATION ERROR : [INFO] ------------------------------------------------------------- [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[25,31] cannot find symbol symbol: class Dependency location: package com.ibm.wala.ipa.slicer [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[165,53] cannot find symbol symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[176,85] cannot find symbol symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[297,50] cannot infer type arguments for com.ibm.wala.util.graph.labeled.SlowSparseNumberedLabeledGraph<> reason: cannot infer type-variable(s) T,U (actual and formal argument lists differ in length) [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[54,29] cannot find symbol symbol: class Dependency location: class com.ibm.wala.codeNet.GraphAugmentor [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[65,60] cannot find symbol symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [INFO] 6 errors [INFO] ------------------------------------------------------------- [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1.328 s [INFO] Finished at: 2022-10-04T16:08:46+07:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:compile (default-compile) on project AnalysisGraphGenerator: Compilation failure: Compilation failure: [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[25,31] cannot find symbol [ERROR] symbol: class Dependency [ERROR] location: package com.ibm.wala.ipa.slicer [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[165,53] cannot find symbol [ERROR] symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) [ERROR] location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[176,85] cannot find symbol [ERROR] symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) [ERROR] location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/WalaToGNNFiles.java:[297,50] cannot infer type arguments for com.ibm.wala.util.graph.labeled.SlowSparseNumberedLabeledGraph<> [ERROR] reason: cannot infer type-variable(s) T,U [ERROR] (actual and formal argument lists differ in length) [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[54,29] cannot find symbol [ERROR] symbol: class Dependency [ERROR] location: class com.ibm.wala.codeNet.GraphAugmentor [ERROR] /home/minhnh46/Project_CodeNet/tools/analysis-graph-generator/src/main/java/com/ibm/wala/codeNet/GraphAugmentor.java:[65,60] cannot find symbol [ERROR] symbol: method getEdgeLabels(com.ibm.wala.ipa.slicer.Statement,com.ibm.wala.ipa.slicer.Statement) [ERROR] location: variable sdg of type com.ibm.wala.ipa.slicer.SDG<? extends com.ibm.wala.ipa.callgraph.propagation.InstanceKey> [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Script for testing Input-Output

In the README file for the input-output data in CodeNet the authors say - "For every problem, input was fed into accepted solution programs in order to check if the output from the solution program matched the output file". Would it be possible for the authors to share the script they used to automate the above process?

Also, another question I had was regarding the "verified" tag given to problems. The README file associated with the input-output data says -

"We considered problem input and output to be "verified" if program output matched the output file for at least one tested program in any of the four programming languages used. As mentioned, there may be differences in formatting of solutions. Thus, "verified" problem input will not always produce output that is identical to the output file in accepted programs."

My question here is, what then is the purpose of the "verified" tag if it does not ensure that the produced output matches the specifications given in the output file?

Any help regarding these issues will be greatly appreciated. Thanks in advance!

can anyone tell me why in the full dataset there are only 221 programs of fortran language when it is written in the data statistics that you would find 8000 codes of fortran language can anyone clear my doubt.

Issue with compiling `tools/analysis-graph-generator`

When I try to run ./build.sh in tools/analysis-graph-generator, the gradle task to build WALA 1.5.7-SNAPSHOT fails.

My Setup

Docker Ubuntu 20.04 LTS
Java 11 (openjdk)

Steps to reproduce

$ chmod +x build.sh
$ ./build.sh
Cloning into 'WALA'...

...

> Task :com.ibm.wala.cast.java.test.data:compileTestJava
/root/analysis-graphs/WALA/com.ibm.wala.cast.java.test.data/src/test/java/javaonepointfive/Varargs.java:63: warning: non-varargs call of varargs method with inexact argument type for last parameter;
                varargs(2, 3, new String[] { "hello", "world" });
                              ^
  cast to Object for a varargs call
  cast to Object[] for a non-varargs call and to suppress this warning
Note: /root/analysis-graphs/WALA/com.ibm.wala.cast.java.test.data/src/test/java/JLex/Main.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /root/analysis-graphs/WALA/com.ibm.wala.cast.java.test.data/src/test/java/JLex/Main.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
1 warning

> Task :com.ibm.wala.core:downloadKawa
Download https://ftp.gnu.org/pub/gnu/kawa/kawa-3.0.zip

> Task :com.ibm.wala.util:compileJavaUsingEcj
----------
1. ERROR in /root/analysis-graphs/WALA/com.ibm.wala.util/src/main/java/com/ibm/wala/util/intset/IntSetUtil.java (at line 33)
        MutableIntSetFactory<?> intSetFactory = intSetFactoryClass.newInstance();
                                                                   ^^^^^^^^^^^^^
The method newInstance() from the type Class<capture#3-of ? extends MutableIntSetFactory<?>> is deprecated
----------
1 problem (1 error)

> Task :com.ibm.wala.util:compileJavaUsingEcj FAILED

> Task :com.ibm.wala.core:compileJava
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':com.ibm.wala.util:compileJavaUsingEcj'.
> Process 'command '/opt/java/openjdk/bin/java'' finished with non-zero exit value 255

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1m 38s
91 actionable tasks: 68 executed, 23 up-to-date

Tokenizer used for masked-language-model experiment

Hi, thank you for the awesome project!

Could you share the scripts used to generate *.toks files in the masked-language-model experiment?
I could run the scripts under the corresponding directory based on the provided tokenized data (link), but it would be great if I could tokenize my own dataset, and tryout the BERT model.

Also, is it the same C-BERT model used in the paper? (of course, after being finetuned with each specific downstream task).

Thank you in advance,
Jihye

Missing information in problem_list.csv

Hello,

I can't seem to be able to find information on rating, tags and complexity in the problem_list.csv file uploaded with the metadata. Would it be possible to obtain an updated copy (or steps on how to obtain information especially on tags). This information would be helpful for my work.
problem_list.csv

Please consider putting the problem descriptions in the "MiniCodeNet" dataset

The MiniCodeNet dataset was a good idea for people on slow connections, but I was disappointed to find it doesn't contain the problem descriptions, since those were really the only parts I cared about. Alternatively, maybe add a download with just the problem descriptions/metadata to the downloads page.

Thanks!

What is type-4 similarity?

A sentence from README. "The problem-submission relationship in Project CodeNet corresponds to type-4 similarity and can be used for code search and clone detection". Some readers (me included) do not know what this refers to. Turning "type-4 similarity" into a hyperlink to a page that explains the concept would be very useful.

How to get the train, valid and test split for Code Similarity and Claasification experiment.

Hi,
Thanks for sharing such a great AI code dataset.

I would like to ask where is the split file for the benchmark (in code similarity and claasification experiment), which is used in Section 8.1 and 8.2. I see that "For each experiment, 20% of the code samples are used for testing, while the rest are split in 4:1 for training and validation, respectively.". Is there any detail?

Thanks a lot.

Best,
Lucas.

ibm / project_codenet Goto Github PK

project_codenet's Introduction

Project CodeNet

Table of Contents

Introduction

Differentiation

Benchmarks

Potential use cases

Usability

Models and experiments

Problem Descriptions

Relevant links

Download the dataset

Dataset overview

Dataset statistics

Data

Metadata

Metadata at the dataset level

Metadata at the problem level

Directory structure and naming convention

Relationships among the metadata and data

Example of getting the source file for a particular submission

Example of getting the metadata for a particular source file

Tools to process source files

Statistics

Access and selection

Pre-processing

Contributors

project_codenet's People

Contributors

Stargazers

Watchers

Forkers

project_codenet's Issues

My Setup

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org