Hi, Recently I have been looking and experimenting the clone detecti

<a href="https://github.com/microsoft/CodeBERT/files/6775981/CodeBERT-classification.z

Sorry, I have found the bug - -! please change <div class="snippet-clipboard-c

It was that! It is now predicting the 4 classes. Thanks <a class="user-mention notrans

How to finetune CodeBERT to do a 4 class classification task. about codebert HOT 25 CLOSED

microsoft commented on September 23, 2024

How to finetune CodeBERT to do a 4 class classification task.

from codebert.

Comments (25)

guoday commented on September 23, 2024 3

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

from codebert.

guoday commented on September 23, 2024 2

Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to block_size (i.e. 256 in this case) tokens.

I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:

def convert_examples_to_features(js,tokenizer,args):
    #source
    code=' '.join(js['code'].split())
    code_tokens=tokenizer.tokenize(code)[:args.block_size-2]
    source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(js,tokenizer,args):
    #source
    code1=' '.join(js['code1'].split())
    code2=' '.join(js['code2'].split())
    code1_tokens=tokenizer.tokenize(code1)
    code2_tokens=tokenizer.tokenize(code2)
    _truncate_seq_pair( code1_tokens, code2_tokens, args.block_size-3)
    source_tokens =[tokenizer.cls_token]+code1_tokens+[tokenizer.sep_token]+code2_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

from codebert.

guoday commented on September 23, 2024 1

Sorry, I have found the bug - -!
please change

with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
    for example,pred in zip(eval_dataset.examples,preds):
        if pred:
            f.write('1\n')
        else:
            f.write('0\n')

    with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
        for example,pred in zip(eval_dataset.examples,preds):
                f.write(str(pred)+'\n')

from codebert.

PedroEstevesPT commented on September 23, 2024 1

It was that! It is now predicting the 4 classes. Thanks @guoday

from codebert.

PedroEstevesPT commented on September 23, 2024

Thanks for the .zip
Just one question about the input format.

In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label

In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ?

Thanks a lot

from codebert.

PedroEstevesPT commented on September 23, 2024

Thanks a lot for the prompt reply. I will try this out and let you know the result

from codebert.

PedroEstevesPT commented on September 23, 2024

Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ?

from codebert.

guoday commented on September 23, 2024

Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print prob.shape in model.py and you will see (bs,4) shape in prediction.

from codebert.

PedroEstevesPT commented on September 23, 2024

I will try that out, I will also create a very simple dataset:

code1: "a" code2: "a" -> 0
code1: "b" code2: "b" -> 1
code1: "c" code2: "c" -> 2
code1: "d" code2: "d" -> 3

To verify that the problem is not in my dataset (my dataset is unbalanced)

from codebert.

PedroEstevesPT commented on September 23, 2024

The same problem remains...

However a model.bin is indeed being created and saved in saved_models/checkpoint-best-acc . The shape also matches the 4 labels:

It also seems I am receiving a warning saying that model is not fine-tuned.

Right now I am training the model just for 1 epoch, so I can debug faster. Maybe I need to increase those epochs ?

from codebert.

guoday commented on September 23, 2024

My suggestion is:

Fine-tune the model on your dataset. And then, load the checkpoint and print prob. If the probabilities of 3 and 4 label are normal, your dataset may be unbalance.
Or maybe you need to print labels in model.py to see whether there are data with 3 and 4 label in your input.

from codebert.

PedroEstevesPT commented on September 23, 2024

Thank you! I will try that

from codebert.

QiushiSun commented on September 23, 2024

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

from codebert.

guoday commented on September 23, 2024

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

from codebert.

QiushiSun commented on September 23, 2024

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much.

from codebert.

guoday commented on September 23, 2024

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

from codebert.

QiushiSun commented on September 23, 2024

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

Thank you!

from codebert.

patelpooja363 commented on September 23, 2024

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification.
I have 10 softwares and their source codes respective to hash values.
Here is my query-
1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case?
Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not.
2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you.
Pooja K

from codebert.

guoday commented on September 23, 2024

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. I have 10 softwares and their source codes respective to hash values. Here is my query- 1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case? Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not. 2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you. Pooja K

The maximum length of CodeBERT is only 512. It' hard to handle source codes with length in 7 digit number, even if for other neural network.
CodeBERT is also suitable for C source code.

from codebert.

patelpooja363 commented on September 23, 2024

Hi Guoday,
Good morning.
Thanks a lot for reply.

Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.?

from codebert.

guoday commented on September 23, 2024

You don't need to preprocess C source code. Just like other programming language and using original C source code as the input.

from codebert.

patelpooja363 commented on September 23, 2024

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

from codebert.

guoday commented on September 23, 2024

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

please refer to #16

from codebert.

patelpooja363 commented on September 23, 2024

Thank You Guoday.
Its really cleared my doubt.

from codebert.

ap-la commented on September 23, 2024

Is it possible to get prediction scores too?

from codebert.

How to finetune CodeBERT to do a 4 class classification task. about codebert HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent