Process the dataset about codebert HOT 4 CLOSED

microsoft commented on September 23, 2024

Process the dataset

from codebert.

Comments (4)

guody5 commented on September 23, 2024

Thanks for opening this repository and providing the dataset to download!
I have the following three questions:

In your paper, you mentioned that you used unimodal code to pretrain the task replaced token detection. Is it possible to download the unimodal code dataset? So far I found the data from codeSearchNet repo is bimodal Data.

You also provide here the cleaned codeSearchNet data. Did you use for the pretraining the original codeSearchNet data and only the cleaned one for the task Code Documentation Generation?

In you paper, you also carried out the task on C# using the dataset of CodeNN. In which way do you tokenize the C# code? In their repository, they replaced part of code to tokens like CODE_STRING, CODE_INTEGER. Did you also do such token replacement for C#?

Thanks a lot in advance for answering these questions!

CodeSearchNet has provided unimodal data in pickle file (https://github.com/github/CodeSearchNet#data-details). Please find it in the link. For unimodal data, the value of field “docstring_tokens” should be empty list.
The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.
We use the same pre-process script as CodeNN to tokenize the C# (https://github.com/sriniiyer/codenn/blob/0f7fbb8b298a84faf4a14a8f76199e39af685e4a/src/model/buildData.py#L81), including replacing some tokens with CODE_STRING, CODE_INTEGER. And then we use the same code as this repo for the task of Code Documentation Generation.

from codebert.

matchlesswei commented on September 23, 2024

Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.

The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.

So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?

from codebert.

guody5 commented on September 23, 2024

Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.

The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.

So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?

The training set of cleaned one is a subset of CodeSearchNet training data. In your example, 412,178 is used to pretrain CodeBert.

from codebert.

matchlesswei commented on September 23, 2024

It's very clear now, thanks! I close this issue.

from codebert.

Process the dataset about codebert HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent