Comments (4)
Thanks for opening this repository and providing the dataset to download!
I have the following three questions:
- In your paper, you mentioned that you used unimodal code to pretrain the task
replaced token detection
. Is it possible to download the unimodal code dataset? So far I found the data from codeSearchNet repo is bimodal Data.- You also provide here the cleaned codeSearchNet data. Did you use for the pretraining the original codeSearchNet data and only the cleaned one for the task Code Documentation Generation?
- In you paper, you also carried out the task on C# using the dataset of CodeNN. In which way do you tokenize the C# code? In their repository, they replaced part of code to tokens like CODE_STRING, CODE_INTEGER. Did you also do such token replacement for C#?
Thanks a lot in advance for answering these questions!
- CodeSearchNet has provided unimodal data in pickle file (https://github.com/github/CodeSearchNet#data-details). Please find it in the link. For unimodal data, the value of field “docstring_tokens” should be empty list.
- The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.
- We use the same pre-process script as CodeNN to tokenize the C# (https://github.com/sriniiyer/codenn/blob/0f7fbb8b298a84faf4a14a8f76199e39af685e4a/src/model/buildData.py#L81), including replacing some tokens with CODE_STRING, CODE_INTEGER. And then we use the same code as this repo for the task of Code Documentation Generation.
from codebert.
Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.
- The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.
So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?
from codebert.
Thanks for the quick reply! Now I'm clear about 1. and 3., but still a bit unsure about the second point.
- The cleaned one is cleaned from original CodeSearchNet data. Therefore, only the training dataset is used to pretrain CodeBERT. Dev and test set of the cleaned one is unseen in the pre-training procedure.
So for the pretraining CodeBert Masked Language Modeling task, is the training data size for example for python 251,820(only cleaned one training set) or 412,178(the original CodeSearchNet bimodel python training data)?
The training set of cleaned one is a subset of CodeSearchNet training data. In your example, 412,178 is used to pretrain CodeBert.
from codebert.
It's very clear now, thanks! I close this issue.
from codebert.
Related Issues (20)
- request for fine-tuned checkpoint of CodeReviewer model HOT 7
- How long does it take to train the code2nl model in the codebert folder? HOT 1
- finetune-msg.sh no step to generate checkpoints? HOT 2
- CodeReviewer: Metadata for downloading github repos HOT 2
- done. HOT 1
- 如何进行文本与代码的匹配? HOT 1
- The Code Reviewer fine-tuning script freezes on multiprocessor functions on Windows. HOT 2
- Sharing human evaluation results for CodeReviewer (informativeness and relevance) HOT 3
- Missing Appendix in CodeReview Paper HOT 2
- Code completion with >=2 masks
- 关于训练时模型的突然失效问题(training loss暴涨,training ppl暴涨)
- Question about CodeReviewer:Does the order of input diff-lines can influence the outcome?
- Questions about additional C/C++ training dataset HOT 1
- Request for Fine-Tuned GraphCodeBert Model for Code Clone Detection
- Questions about LCC dataset license
- Typo in readme of clonedetection
- Best way to finetune CodeReview Task HOT 1
- codesearch FIne-Tune Epoch卡在0%或者在saving cached文件后killed HOT 2
- GraphCodeBERT node vs. token level attention
- CodeReviewer fine-tuning time
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codebert.