Comments (3)
>>>from transformers import AutoTokenizer, AutoModel
>>>import torch
>>>tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>>model = AutoModel.from_pretrained("microsoft/codebert-base")
>>>tokens=tokenizer.tokenize("def max(a,b):")
['def', 'Ġmax', '(', 'a', ',', 'b', '):']
>>>tokens=[tokenizer.cls_token]+tokens+[tokenizer.sep_token]
['<s>', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', '</s>']
>>>tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 9232, 19220, 1640, 102, 6, 428, 3256, 2]
>>>context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0][0]
tensor([[-0.1740, 0.2737, 0.0452, ..., -0.2411, -0.2950, 0.2668],
[-1.0550, -0.1229, 0.6714, ..., -0.5628, -0.1209, 0.4683],
[-0.9436, 0.3294, -0.0098, ..., -0.3375, -0.5014, 0.6879],
...,
[-0.3381, 0.4317, 0.4450, ..., -0.4600, -0.4070, 0.6626],
[-0.3735, -0.1088, 0.6358, ..., -0.6854, -0.0860, 0.2248],
[-0.1740, 0.2744, 0.0457, ..., -0.2414, -0.2962, 0.2675]],
grad_fn=<SelectBackward>)
from codebert.
its just the token embedding of code but I want to embed nl-pl pair
from codebert.
>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ġmaximum', 'Ġvalue']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
['<s>', 'return', 'Ġmaximum', 'Ġvalue', '</s>', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '</s>']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183],
[-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033],
[-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216],
...,
[-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031],
[-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730],
[-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]],
grad_fn=<SelectBackward>)
from codebert.
Related Issues (20)
- 如何进行文本与代码的匹配? HOT 1
- The Code Reviewer fine-tuning script freezes on multiprocessor functions on Windows. HOT 2
- Sharing human evaluation results for CodeReviewer (informativeness and relevance) HOT 3
- Missing Appendix in CodeReview Paper HOT 2
- Code completion with >=2 masks
- 关于训练时模型的突然失效问题(training loss暴涨,training ppl暴涨)
- Question about CodeReviewer:Does the order of input diff-lines can influence the outcome?
- Questions about additional C/C++ training dataset HOT 1
- Request for Fine-Tuned GraphCodeBert Model for Code Clone Detection
- Questions about LCC dataset license
- Typo in readme of clonedetection
- Best way to finetune CodeReview Task HOT 1
- codesearch FIne-Tune Epoch卡在0%或者在saving cached文件后killed HOT 2
- GraphCodeBERT node vs. token level attention
- CodeReviewer fine-tuning time
- How to finetune CodeBERT to do a regression prediction task
- CodeReviewer Finetune Script Fails
- How is Input Structured for Comment Generation with CodeT5
- Where do you get the source code to include in jsonl?
- how can i use it for different project
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codebert.