Git Product home page Git Product logo

Comments (7)

loubnabnl avatar loubnabnl commented on August 28, 2024 1

We have evaluated the model on HumanEval using the evaluation harness, Bf16 and Fp16 give close scores to full precision. You can run the evaluation yourself to check the numbers using the parameters we specify in the paper (for example we use top-p sampling instead of greedy decoding and we strip the prompt before generation + do post-processing to remove eos_token and any text after some stop-tokens).

As for the playground, it calls the inference endpoint to generate code which is equivalent to doing model.generate just make sure you use the same parameters as the playground. It uses random sampling so it's normal to not get the same result as the playground:
image

from starcoder.

ganler avatar ganler commented on August 28, 2024 1

Update: I finally made StarCoder output reasonable code by following https://huggingface.co/spaces/bigcode/bigcode-playground/blob/2009abb380464f89aba1603069e720f031735cce/app.py

and replicate a pretty nice pass@1:

image

The detailed usage is listed here: https://github.com/evalplus/evalplus/blob/694528a1e933ea1d12559f41cebac1a6ad1100dc/codegen/model.py#L494

  • Use infilling
  • Set repetition_penalty=1
  • let temperature = 1e-2 instead of 0

Note that maybe not all of them are necessary to make "greedy" decoding work but that is the configuration I tried to be feasible. Thanks for creating the great bigcode project!

from starcoder.

rookielyb avatar rookielyb commented on August 28, 2024

hello, I have the same problem
this is my code

企业微信截图_d75ee79f-5d8b-47b1-8dd2-90f08aedbf16 I predicted 10 times and didn't get one correct result 企业微信截图_0fdde452-8bfa-4f0e-ab56-1e434709f631 But I try to use your api, can get the correct result 企业微信截图_d92508a3-b65c-4886-a2a7-da07b5b8e7f4 企业微信截图_5fe67b3b-5dab-48c4-9f90-124ed0015831 Why is this? I'm having a hard time achieving your results in human eval. Hope to get your reply!

from starcoder.

jithurjacob avatar jithurjacob commented on August 28, 2024

Facing the same issue, and getting good results on HF inference API but not locally.

from starcoder.

ganler avatar ganler commented on August 28, 2024

@loubnabnl Thanks for the reply. After some in-depth debugging I found starcoder tend to work better given a higher temperature and seem not to fit situations for greedy decoding and a very low temperature such as 0.1. I am curious if it is unexpected to run StarCoder under a greedy decoding setting for benchmarking as the results are not quite reasonable ... because for many other models I tried they tend to even perform better pass@1 than that from random sampling. Thanks!

from starcoder.

ganler avatar ganler commented on August 28, 2024

In addition, in the evaluation did you use autoaggressive generation? The playground code seems to use in-filling. I had the same experience with SantaCoder where the AR does not seem to work reasonably but the in-filling mode works fine.

from starcoder.

loubnabnl avatar loubnabnl commented on August 28, 2024

It's great that it works. Btw I run HumanEval on StarCoder with greedy and the score is pretty high by default (this is left-to-right no infilling, the playground doesn't use infilling by default unless you add the <FILL_HERE> token to your prompt).

CLI in evaluation harness:

accelerate launch  main.py   --model bigcode/starcoder   --max_length_generation 512  --tasks humaneval   --n
_samples 1   --batch_size 1   --temperature 0   --do_sample False   --precision bf16   --allow_code_execution   --use_auth_token       

Result:

{
  "humaneval": {
    "pass@1": 0.3475609756097561
  },
  "config": {
    "model": "bigcode/starcoder",
    "temperature": 0.0,
    "n_samples": 1
  }
}

from starcoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.