Git Product home page Git Product logo

Comments (11)

fengzhangyin avatar fengzhangyin commented on September 23, 2024 2

Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time.

from codebert.

fengzhangyin avatar fengzhangyin commented on September 23, 2024

Hi @JohnGiorgi ,
CodeBERT is pretrained with masked language model objective and replaced token detection objective. We should finetune CodeBERT to support downstream tasks, while you directly use CodeBERT for semantic text similarity / clustering without finetuning.

We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website.

from codebert.

JohnGiorgi avatar JohnGiorgi commented on September 23, 2024

I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task.

I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models!

from codebert.

JohnGiorgi avatar JohnGiorgi commented on September 23, 2024

Thanks a lot.

I just have two more questions:

  • Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
  • Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

from codebert.

guoday avatar guoday commented on September 23, 2024

Thanks a lot.

I just have two more questions:

  • Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
  • Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.
  1. Yes, we fine-tune the model on 2 GPUs. See the last figure of here for training and inference cost.
  2. POJ-104 dataset contains C++/C programming language, which is mentioned in figures of here.

from codebert.

shaileshj2803 avatar shaileshj2803 commented on September 23, 2024

Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took.
Thanks a lot

from codebert.

JohnGiorgi avatar JohnGiorgi commented on September 23, 2024

Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread!

from codebert.

nashid avatar nashid commented on September 23, 2024

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

from codebert.

guoday avatar guoday commented on September 23, 2024

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl.

from codebert.

nashid avatar nashid commented on September 23, 2024

@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language.

So natural language like docstring is not present in my case.

Will UniXcoder would still be effective in my case?

from codebert.

guoday avatar guoday commented on September 23, 2024

If you carefully read the readme, you will know UniXcoder doe sn't need natural language.

from codebert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.