Comments (25)
Change config.num_labels=100
in run.py to config.num_labels=4
from codebert.
Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to block_size
(i.e. 256 in this case) tokens.
I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:
def convert_examples_to_features(js,tokenizer,args):
#source
code=' '.join(js['code'].split())
code_tokens=tokenizer.tokenize(code)[:args.block_size-2]
source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
source_ids = tokenizer.convert_tokens_to_ids(source_tokens)
padding_length = args.block_size - len(source_ids)
source_ids+=[tokenizer.pad_token_id]*padding_length
return InputFeatures(source_tokens,source_ids,js['label'])
to
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def convert_examples_to_features(js,tokenizer,args):
#source
code1=' '.join(js['code1'].split())
code2=' '.join(js['code2'].split())
code1_tokens=tokenizer.tokenize(code1)
code2_tokens=tokenizer.tokenize(code2)
_truncate_seq_pair( code1_tokens, code2_tokens, args.block_size-3)
source_tokens =[tokenizer.cls_token]+code1_tokens+[tokenizer.sep_token]+code2_tokens+[tokenizer.sep_token]
source_ids = tokenizer.convert_tokens_to_ids(source_tokens)
padding_length = args.block_size - len(source_ids)
source_ids+=[tokenizer.pad_token_id]*padding_length
return InputFeatures(source_tokens,source_ids,js['label'])
from codebert.
Sorry, I have found the bug - -!
please change
with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
for example,pred in zip(eval_dataset.examples,preds):
if pred:
f.write('1\n')
else:
f.write('0\n')
to
with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
for example,pred in zip(eval_dataset.examples,preds):
f.write(str(pred)+'\n')
from codebert.
It was that! It is now predicting the 4 classes. Thanks @guoday
from codebert.
Thanks for the .zip
Just one question about the input format.
In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label
In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ?
Thanks a lot
from codebert.
Thanks a lot for the prompt reply. I will try this out and let you know the result
from codebert.
Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ?
from codebert.
Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print prob.shape
in model.py and you will see (bs,4) shape in prediction.
from codebert.
I will try that out, I will also create a very simple dataset:
code1: "a" code2: "a" -> 0
code1: "b" code2: "b" -> 1
code1: "c" code2: "c" -> 2
code1: "d" code2: "d" -> 3
To verify that the problem is not in my dataset (my dataset is unbalanced)
from codebert.
The same problem remains...
However a model.bin is indeed being created and saved in saved_models/checkpoint-best-acc . The shape also matches the 4 labels:
It also seems I am receiving a warning saying that model is not fine-tuned.
Right now I am training the model just for 1 epoch, so I can debug faster. Maybe I need to increase those epochs ?
from codebert.
My suggestion is:
- Fine-tune the model on your dataset. And then, load the checkpoint and print
prob
. If the probabilities of 3 and 4 label are normal, your dataset may be unbalance. - Or maybe you need to print
labels
in model.py to see whether there are data with 3 and 4 label in your input.
from codebert.
Thank you! I will try that
from codebert.
Change
config.num_labels=100
in run.py toconfig.num_labels=4
Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!
from codebert.
CodeBERT-classification.zip
Changeconfig.num_labels=100
in run.py toconfig.num_labels=4
Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!
No. This is only an example.
from codebert.
CodeBERT-classification.zip
Changeconfig.num_labels=100
in run.py toconfig.num_labels=4
Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!
No. This is only an example.
Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much.
from codebert.
Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE
from codebert.
Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE
Thank you!
from codebert.
Hi Guoday,
I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification.
I have 10 softwares and their source codes respective to hash values.
Here is my query-
1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case?
Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not.
2.Is CodeBERT is suitable for C source code also?
It will really help me if you are replying to my query.
Thank you.
Pooja K
from codebert.
Hi Guoday,
I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. I have 10 softwares and their source codes respective to hash values. Here is my query- 1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case? Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not. 2.Is CodeBERT is suitable for C source code also?
It will really help me if you are replying to my query.
Thank you. Pooja K
- The maximum length of CodeBERT is only 512. It' hard to handle source codes with length in 7 digit number, even if for other neural network.
- CodeBERT is also suitable for C source code.
from codebert.
Hi Guoday,
Good morning.
Thanks a lot for reply.
Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.?
from codebert.
You don't need to preprocess C source code. Just like other programming language and using original C source code as the input.
from codebert.
Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?
from codebert.
Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?
please refer to #16
from codebert.
Thank You Guoday.
Its really cleared my doubt.
from codebert.
Is it possible to get prediction scores too?
from codebert.
Related Issues (20)
- Code completion with >=2 masks
- 关于训练时模型的突然失效问题(training loss暴涨,training ppl暴涨)
- Question about CodeReviewer:Does the order of input diff-lines can influence the outcome?
- Questions about additional C/C++ training dataset HOT 1
- Request for Fine-Tuned GraphCodeBert Model for Code Clone Detection
- Questions about LCC dataset license
- Typo in readme of clonedetection
- Best way to finetune CodeReview Task HOT 1
- codesearch FIne-Tune Epoch卡在0%或者在saving cached文件后killed HOT 2
- GraphCodeBERT node vs. token level attention
- CodeReviewer fine-tuning time
- How to finetune CodeBERT to do a regression prediction task
- CodeReviewer Finetune Script Fails
- How is Input Structured for Comment Generation with CodeT5
- Where do you get the source code to include in jsonl?
- how can i use it for different project
- What can UniXcoder do HOT 2
- Unixcoder fine tune on POJ-104, how to use it for inference?
- Treesitter dependency HOT 3
- how to pre train my own unixcoder from the pre-train unixcoder
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codebert.