Git Product home page Git Product logo

Comments (25)

guoday avatar guoday commented on September 23, 2024 3

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

from codebert.

guoday avatar guoday commented on September 23, 2024 2

Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to block_size (i.e. 256 in this case) tokens.

I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:

def convert_examples_to_features(js,tokenizer,args):
    #source
    code=' '.join(js['code'].split())
    code_tokens=tokenizer.tokenize(code)[:args.block_size-2]
    source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

to

def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(js,tokenizer,args):
    #source
    code1=' '.join(js['code1'].split())
    code2=' '.join(js['code2'].split())
    code1_tokens=tokenizer.tokenize(code1)
    code2_tokens=tokenizer.tokenize(code2)
    _truncate_seq_pair( code1_tokens, code2_tokens, args.block_size-3)
    source_tokens =[tokenizer.cls_token]+code1_tokens+[tokenizer.sep_token]+code2_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

from codebert.

guoday avatar guoday commented on September 23, 2024 1

Sorry, I have found the bug - -!
please change

with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
    for example,pred in zip(eval_dataset.examples,preds):
        if pred:
            f.write('1\n')
        else:
            f.write('0\n')    

to

    with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
        for example,pred in zip(eval_dataset.examples,preds):
                f.write(str(pred)+'\n')

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024 1

It was that! It is now predicting the 4 classes. Thanks @guoday

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

Thanks for the .zip
Just one question about the input format.

In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label

In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ?

Thanks a lot

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

Thanks a lot for the prompt reply. I will try this out and let you know the result

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ?

from codebert.

guoday avatar guoday commented on September 23, 2024

Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print prob.shape in model.py and you will see (bs,4) shape in prediction.

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

I will try that out, I will also create a very simple dataset:

code1: "a" code2: "a" -> 0
code1: "b" code2: "b" -> 1
code1: "c" code2: "c" -> 2
code1: "d" code2: "d" -> 3

To verify that the problem is not in my dataset (my dataset is unbalanced)

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

The same problem remains...

However a model.bin is indeed being created and saved in saved_models/checkpoint-best-acc . The shape also matches the 4 labels:
a

It also seems I am receiving a warning saying that model is not fine-tuned.
a

Right now I am training the model just for 1 epoch, so I can debug faster. Maybe I need to increase those epochs ?

from codebert.

guoday avatar guoday commented on September 23, 2024

My suggestion is:

  1. Fine-tune the model on your dataset. And then, load the checkpoint and print prob. If the probabilities of 3 and 4 label are normal, your dataset may be unbalance.
  2. Or maybe you need to print labels in model.py to see whether there are data with 3 and 4 label in your input.

from codebert.

PedroEstevesPT avatar PedroEstevesPT commented on September 23, 2024

Thank you! I will try that

from codebert.

QiushiSun avatar QiushiSun commented on September 23, 2024

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

from codebert.

guoday avatar guoday commented on September 23, 2024

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

from codebert.

QiushiSun avatar QiushiSun commented on September 23, 2024

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much.

from codebert.

guoday avatar guoday commented on September 23, 2024

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

from codebert.

QiushiSun avatar QiushiSun commented on September 23, 2024

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

Thank you!

from codebert.

patelpooja363 avatar patelpooja363 commented on September 23, 2024

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification.
I have 10 softwares and their source codes respective to hash values.
Here is my query-
1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case?
Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not.
2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you.
Pooja K

from codebert.

guoday avatar guoday commented on September 23, 2024

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. I have 10 softwares and their source codes respective to hash values. Here is my query- 1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case? Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not. 2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you. Pooja K

  1. The maximum length of CodeBERT is only 512. It' hard to handle source codes with length in 7 digit number, even if for other neural network.
  2. CodeBERT is also suitable for C source code.

from codebert.

patelpooja363 avatar patelpooja363 commented on September 23, 2024

Hi Guoday,
Good morning.
Thanks a lot for reply.

Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.?

from codebert.

guoday avatar guoday commented on September 23, 2024

You don't need to preprocess C source code. Just like other programming language and using original C source code as the input.

from codebert.

patelpooja363 avatar patelpooja363 commented on September 23, 2024

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

from codebert.

guoday avatar guoday commented on September 23, 2024

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

please refer to #16

from codebert.

patelpooja363 avatar patelpooja363 commented on September 23, 2024

Thank You Guoday.
Its really cleared my doubt.

from codebert.

ap-la avatar ap-la commented on September 23, 2024

Is it possible to get prediction scores too?

from codebert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.