Comments (7)
We have evaluated the model on HumanEval using the evaluation harness, Bf16 and Fp16 give close scores to full precision. You can run the evaluation yourself to check the numbers using the parameters we specify in the paper (for example we use top-p sampling instead of greedy decoding and we strip the prompt before generation + do post-processing to remove eos_token
and any text after some stop-tokens).
As for the playground, it calls the inference endpoint to generate code which is equivalent to doing model.generate
just make sure you use the same parameters as the playground. It uses random sampling so it's normal to not get the same result as the playground:
from starcoder.
Update: I finally made StarCoder output reasonable code by following https://huggingface.co/spaces/bigcode/bigcode-playground/blob/2009abb380464f89aba1603069e720f031735cce/app.py
and replicate a pretty nice pass@1:
The detailed usage is listed here: https://github.com/evalplus/evalplus/blob/694528a1e933ea1d12559f41cebac1a6ad1100dc/codegen/model.py#L494
- Use infilling
- Set
repetition_penalty=1
- let
temperature = 1e-2
instead of 0
Note that maybe not all of them are necessary to make "greedy" decoding work but that is the configuration I tried to be feasible. Thanks for creating the great bigcode project!
from starcoder.
hello, I have the same problem
this is my code
from starcoder.
Facing the same issue, and getting good results on HF inference API but not locally.
from starcoder.
@loubnabnl Thanks for the reply. After some in-depth debugging I found starcoder tend to work better given a higher temperature and seem not to fit situations for greedy decoding and a very low temperature such as 0.1. I am curious if it is unexpected to run StarCoder under a greedy decoding setting for benchmarking as the results are not quite reasonable ... because for many other models I tried they tend to even perform better pass@1 than that from random sampling. Thanks!
from starcoder.
In addition, in the evaluation did you use autoaggressive generation? The playground code seems to use in-filling. I had the same experience with SantaCoder where the AR does not seem to work reasonably but the in-filling mode works fine.
from starcoder.
It's great that it works. Btw I run HumanEval on StarCoder with greedy and the score is pretty high by default (this is left-to-right no infilling, the playground doesn't use infilling by default unless you add the <FILL_HERE>
token to your prompt).
CLI in evaluation harness:
accelerate launch main.py --model bigcode/starcoder --max_length_generation 512 --tasks humaneval --n
_samples 1 --batch_size 1 --temperature 0 --do_sample False --precision bf16 --allow_code_execution --use_auth_token
Result:
{
"humaneval": {
"pass@1": 0.3475609756097561
},
"config": {
"model": "bigcode/starcoder",
"temperature": 0.0,
"n_samples": 1
}
}
from starcoder.
Related Issues (20)
- Generating Embeddings of Code Tokens using StarCoder HOT 1
- Fine-tuning Starcoder or Octocoder for IDE Integration: Instruction Tuning vs Base Model Training Approach HOT 1
- does this support deepspeed zero train?
- inference problem
- Could somebody guide me how to fine-tune with fill-in-middle task based on StarCoderBase? HOT 1
- HuggingFaceH4/oasst1_en - missing dataset HOT 1
- Empty Generations / Failing Reproducing 40% on HumanEval HOT 3
- How many shots are used for evaluating HumanEval? HOT 1
- Fine tuning With SQLcoder-7b
- torch.cuda.OutOfMemoryError on HuhhingFace NVidia 4xA10G Large HOT 2
- Question about Improving Code Generation with Promting
- Better inference based on starcode2-3b model HOT 1
- FileNotFoundError: [Errno 2] No such file or directory: 'checkpoint-100/model-00001-of-00003.safetensors'
- Is finetune.py incompatible with older GPUs?
- What should be masking id . should it be -100 only . giving device side assert triggered
- v0.10.0 of Peft breaks finetune.py
- RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- Removal request & notice: permissive licensing might often still be unsuitable(!) for training set inclusion HOT 2
- zero3 DPO starcoder OOM
- Can starcoder be used to create a structured file format?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from starcoder.