Hi, I am interested in using CodeBERT for semantic text similarity /

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks a lot. I just have two more questions: <l

Thanks a lot. I just have two more questions: <ul d

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Using CodeBERT for code based semantic search / clustering about codebert HOT 11 CLOSED

microsoft commented on September 23, 2024 4

Using CodeBERT for code based semantic search / clustering

from codebert.

Comments (11)

fengzhangyin commented on September 23, 2024 2

Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time.

from codebert.

fengzhangyin commented on September 23, 2024

Hi @JohnGiorgi ,
CodeBERT is pretrained with masked language model objective and replaced token detection objective. We should finetune CodeBERT to support downstream tasks, while you directly use CodeBERT for semantic text similarity / clustering without finetuning.

We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website.

from codebert.

JohnGiorgi commented on September 23, 2024

I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task.

I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models!

from codebert.

JohnGiorgi commented on September 23, 2024

Thanks a lot.

I just have two more questions:

Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

from codebert.

guoday commented on September 23, 2024

Thanks a lot.

I just have two more questions:

Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.

Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

Yes, we fine-tune the model on 2 GPUs. See the last figure of here for training and inference cost.
POJ-104 dataset contains C++/C programming language, which is mentioned in figures of here.

from codebert.

shaileshj2803 commented on September 23, 2024

Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took.
Thanks a lot

from codebert.

JohnGiorgi commented on September 23, 2024

Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread!

from codebert.

nashid commented on September 23, 2024

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

from codebert.

guoday commented on September 23, 2024

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl.

from codebert.

nashid commented on September 23, 2024

@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language.

So natural language like docstring is not present in my case.

Will UniXcoder would still be effective in my case?

from codebert.

guoday commented on September 23, 2024

If you carefully read the readme, you will know UniXcoder doe sn't need natural language.

from codebert.

Using CodeBERT for code based semantic search / clustering about codebert HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent