Git Product home page Git Product logo

Comments (4)

guody5 avatar guody5 commented on September 23, 2024

Thanks for opening this repository and providing the dataset to download!
I have the following three questions:

  1. In your paper, you mentioned that you used unimodal code to pretrain the task replaced token detection. Is it possible to download the unimodal code dataset? So far I found the data from codeSearchNet repo is bimodal Data.
  2. You also provide here the cleaned codeSearchNet data. Did you use for the pretraining the original codeSearchNet data and only the cleaned one for the task Code Documentation Generation?
  3. In you paper, you also carried out the task on C# using the dataset of CodeNN. In which way do you tokenize the C# code? In their repository, they replaced part of code to tokens like CODE_STRING, CODE_INTEGER. Did you also do such token replacement for C#?

Thanks a lot in advance for answering these questions!

  1. CodeSearchNet has provided unimodal data in pickle file (https://github.com/github/CodeSearchNet#data-details). Please find it in the link. For unimodal data, the value of field “docstring_tokens” should be empty list.
  2. The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.
  3. We use the same pre-process script as CodeNN to tokenize the C# (https://github.com/sriniiyer/codenn/blob/0f7fbb8b298a84faf4a14a8f76199e39af685e4a/src/model/buildData.py#L81), including replacing some tokens with CODE_STRING, CODE_INTEGER. And then we use the same code as this repo for the task of Code Documentation Generation.

from codebert.

matchlesswei avatar matchlesswei commented on September 23, 2024

Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.

  1. The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.

So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?

from codebert.

guody5 avatar guody5 commented on September 23, 2024

Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.

  1. The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.

So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?

The training set of cleaned one is a subset of CodeSearchNet training data. In your example, 412,178 is used to pretrain CodeBert.

from codebert.

matchlesswei avatar matchlesswei commented on September 23, 2024

It's very clear now, thanks! I close this issue.

from codebert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.