Comments (11)
Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time.
from codebert.
Hi @JohnGiorgi ,
CodeBERT is pretrained with masked language model objective and replaced token detection objective. We should finetune CodeBERT to support downstream tasks, while you directly use CodeBERT for semantic text similarity / clustering without finetuning.
We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website.
from codebert.
I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task.
I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models!
from codebert.
Thanks a lot.
I just have two more questions:
- Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
- Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.
from codebert.
Thanks a lot.
I just have two more questions:
- Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
- Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.
- Yes, we fine-tune the model on 2 GPUs. See the last figure of here for training and inference cost.
- POJ-104 dataset contains C++/C programming language, which is mentioned in figures of here.
from codebert.
Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took.
Thanks a lot
from codebert.
Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread!
from codebert.
@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?
from codebert.
@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?
Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl.
from codebert.
@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language.
So natural language like docstring is not present in my case.
Will UniXcoder would still be effective in my case?
from codebert.
If you carefully read the readme, you will know UniXcoder doe sn't need natural language.
from codebert.
Related Issues (20)
- Code completion with >=2 masks
- 关于训练时模型的突然失效问题(training loss暴涨,training ppl暴涨)
- Question about CodeReviewer:Does the order of input diff-lines can influence the outcome?
- Questions about additional C/C++ training dataset HOT 1
- Request for Fine-Tuned GraphCodeBert Model for Code Clone Detection
- Questions about LCC dataset license
- Typo in readme of clonedetection
- Best way to finetune CodeReview Task HOT 1
- codesearch FIne-Tune Epoch卡在0%或者在saving cached文件后killed HOT 2
- GraphCodeBERT node vs. token level attention
- CodeReviewer fine-tuning time
- How to finetune CodeBERT to do a regression prediction task
- CodeReviewer Finetune Script Fails
- How is Input Structured for Comment Generation with CodeT5
- Where do you get the source code to include in jsonl?
- how can i use it for different project
- What can UniXcoder do HOT 2
- Unixcoder fine tune on POJ-104, how to use it for inference?
- Treesitter dependency HOT 3
- how to pre train my own unixcoder from the pre-train unixcoder
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codebert.