dpfried / incoder Goto Github PK

Generative model for code infilling and synthesis

Python 99.32% Shell 0.68%

incoder's Introduction

InCoder: A Generative Model for Code Infilling and Synthesis

Daniel Fried*, Armen Aghajanyan*, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis

ICLR 2023

This repository hosts example code showing how to use the model using HuggingFace's transformers library. Code to replicate the evaluation results in our paper (in Fairseq, which we used to train the model) is coming soon!

See our project site for more information, or our paper, or examples.

Models

You can obtain the models from HuggingFace's hub:

6.7B parameter model: facebook/incoder-6B
1.3B parameter model: facebook/incoder-1B

Tokenizer

We use a custom tokenizer, which you can load from either "facebook/incoder-1B" or "facebook/incoder-6B" (they are identical). The model was trained with padding on the left-hand side of inputs, using a <pad> token which has ID 1.

tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B") # or "facebook/incoder-6B"
tokenizer.pad_token = "<pad>"
tokenizer.padding_side = "left"

When calling tokenizer.decode, it's important to pass clean_up_tokenization_spaces=False to avoid removing spaces after punctuation. For example:

tokenizer.decode(tokenizer.encode("from ."), clean_up_tokenization_spaces=False)

(Note: encoding prepends the <|endoftext|> token, as this marks the start of a document to our model. This token can be removed from the decoded output by passing skip_special_tokens=True to tokenizer.decode.)

Requirements

pytorch, tokenizers, and transformers. Our model requires HF's tokenizers >= 0.12.1, due to changes in the pretokenizer.

pip install torch
pip install 'tokenizers>=0.12'
pip install transformers

Usage

See example_usage.py for a demo script showing how to use the infilling capability of the model. Set BIG_MODEL = True in the script to use the 6.7B parameter model; the 1.3B will be used otherwise.

For an example of batched generation, see example_batched_usage.py.

Paper

See our paper for research details on the method, training data, models, and experimental results.

Demo

See a demo of the 6.7B model on HF Spaces. [currently not working, apologies!]

License

CC-BY-NC 4.0

Credits

Thanks to Lucile Saulnier, Leandro von Werra, Nicolas Patry, Suraj Patil, Omar Sanseviero, and others at HuggingFace for help with the model release, and to Naman Goyal and Stephen Roller for the code our demo was based on!

incoder's People

Contributors

Stargazers

Watchers

incoder's Issues

Generating SQL statements (Discussion-Topic)

Hi all!

I am trying to generate SQL statements with the InCoder model.
The first attempts look very promising!
Does anyone have an idea how I can prompt the model with my database structure?
Something trivial like what tables and attributes are there.

Thanks for any food for thought!
Greetings, Ole

huggingface Link 404

Can you please update the link to the huggingface model
https://huggingface.co/facebook/incoder-1B
is no longer available

Demo unavailable

The demo on HuggingFace spaces currently produces a runtime error

Half precision 6B model?

I'm having troubles with loading 6B model even on RTX3090 - is there a possibility you would share the model weights in half precision format?

Anyway, thank you for your work, the model looks great.

How to use incoder for code infilling task

Hello😃
A Great job!
I am reaching out to inquire about the possibility of using the pre-trained incoder model provided by your team for code infilling tasks~
However, I have been unable to locate any relevant example code or documentation regarding the usage of the pre-trained incoder model for code infilling. I'm not sure how to get a correctly formatted input〒▽〒
I expected the input like:

fn main(){
let a = 10;
    let b = 20;
    let result: i32;
asm!(
<FILL HERE>
)
}

and the output will be like:

fn main(){
let a = 10;
    let b = 20;
    let result: i32;
asm!(
"add {0}, {1}", // Add the values of a and b
            out(reg) result,
            in(reg) a,
            in(reg) b
)
}

Would the training code be released?

Hi, I am interested in your work and want to train a new model based on my specific dataset. Would the code be released soon? Otherwise I have to implement it by myself :(

Also, could you kindly tell me any public available code of the same task? Thanks.

Failed to load Half precision 6B model

Weird?

return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Float but found Half

How to retrain `incoder` on my own dataset

Your work is impressive! I would like to use this model for my own code completion task.I want to know what the training data and target data look like (It would be even better if you could provide examples of the data you initially trained on).
For example, is the training data looks like below?

input_code =''' 
import pandas as pd
<insert>
df['res'].value_counts()
'''


target_code = '''
import pandas as pd
df=pd.read_csv("t.csv")
df['res'].value_counts()

'''

The `example_usage.py` script also requires to `pip install accelerate`

insertion mode

Hey,

Want to confirm the prompt for insertion mode. Let's say

left context + '<|mask:0|>' + right context

Is this correct?

Running on Human-Eval

Hello, I am trying to reproduce the results of your model on the Human-Eval Dataset and so far I am getting a lower-than-expected performance.
To make everything more clear:

I load the large 6B model from hugging face on its fp16 version.
I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.
I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.
The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Is there a different procedure/way that the Incoder model solves the Human-Eval dataset? Are the results published assuming the full 32bit weights or a different input format?

incoder-6B to ONNX

I was wondering if there's any way to convert Incoder 6B to ONNX?

Tokenizer does not have a padding token.

Hi.

I'm setting up to finetune InCoder via torch with Dynamic Padding as follows:

tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B",use_fast=True,do_lower_case=False)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4)

train_dataset = tokenized_datasets["train"].shuffle(seed=42)
test_dataset = tokenized_datasets["test"].shuffle(seed=42)

data_collator = DataCollatorWithPadding(tokenizer)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8, collate_fn=data_collator)
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=8, collate_fn=data_collator)

I keep on getting this error.: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

I'm hesitant about using this fix: tokenizer.pad_token = tokenizer.eos_token since in this model the |<eos token>| "marks the start of a document to our model. " (unless they mean different to "encoding prepends the <|endoftext|> token")

I plan to go ahead and use 0 as the padding token (tokenizer.add_special_tokens({'pad_token': '[0]'})), as it is the default in most cases, but I was wondering to know what causes the error, as I suppose there is some to do with the token architecture.

Thanks!

cross-entropy loss

Hi, I have a question about cross-entropy loss.

In section 2.1, it says we compute the probability of the sequence auto-regressively and train the model
using cross-entropy loss on all tokens except the mask sentinel tokens <Mask:k>.

In the code base you provided to me two days ago (link), I found that the criterion weight of these sentinel tokens are set to zero: self.criterion_weights[self.sentinel_tokens[i]] = 0.0.

Now, I want to implement this in my own code. I only have one sentinel token MASK, and my code is,

loss = F.cross_entropy(logits, target, ignore_index=MASK)

Is this implementation correct? Why it can ensure that MASK would not be generated during inference time.

Thanks.

Use on HuggingFace API Inference

I'm trying to modify the python code here to work with the HuggingFace Inference API. I'm using a simple example, but I'm not getting the correct response. My prompt for infill is:

print("Hello W<|mask:0|>!")<|mask:0|>

and I'm sending it (trying to follow arguments used in python example from this repo) with:

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = f"https://api-inference.huggingface.co/models/facebook/incoder-1B"
data = dict(inputs='print("Hello W<|mask:0|>!")<|mask:0|>', 
            options=dict(use_gpu=False, use_cache=False, wait_for_model=True),
            parameters=dict(return_full_text = False, temperature=0.5, top_p=0.95, do_sample=True))
web_response = requests.request(
    "POST", API_URL, headers=headers, data=json.dumps(data))
response = json.loads(web_response.content.decode("utf-8"))
print(response)

The infills vary, but are something like '\n\n<|/<|/<|/<|/\n\n<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/<|/' or !!!!!!!!!!!! or \n\n\n\n\n\n. The demo page gives reasonable values, like ho. Confusingly, never orld though. Adding a metadata hint (<| file ext=.py |>) creates worse infills and adding the extra mask <|mask:1|> does not help.

I wonder if any of the maintainers have an idea of what could cause this. I've opened the same issue on huggingface forum , since there may be a problem with the inference API rather than my usage of the model.

StackOverflow data

Hi,

I have a question about your data collection pipeline.

For the GitHub and GitLab data you only collected training data from repos with permissive licences, did you apply similar filtering for the StackOverflow data? Stackoverflow content is licenced under Creative Commons Attribution-Share Alike 4.0 by default, a non-permissive copyleft licence.

Thanks,
-Ali

Evaluation on Code-to-text CodeXGLUE task: references preprocessing

Hi, I was wondering if you could share the preprocessing script of the reference comments/docstrings in the code-to-text task from CodeXGLUE to remove the extra context.

Also sometimes the reference solution is long with many lines while the candidate solution only has one, do you only keep one line for the references too?

Thanks in advance.

Use InCoder for semantic code search

Did you try to use the InCoder model to encode source code into a dense vector (embeddings) for semantic code search?

If it's possible, which layer, outputs better to use for that?

Thank you!