Git Product home page Git Product logo

bigcode-evaluation-harness's Introduction

Code Generation LM Evaluation Harness

Features

This is a framework for the evaluation of code generation models. This work is inspired from EleutherAI/lm-evaluation-harness for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find contribution guides in docs/guide.md and CONTRIBUTING.md and more documentation in docs/README.md.

Below are the features and tasks of this framework:

  • Features:

    • Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code such as SantaCoder, InCoder and CodeGen.
    • We provide Multi-GPU text generation with accelerate and Dockerfiles for evaluating on Docker containers for security and reproducibility.
  • Tasks:

    • 7 code generation Python tasks (with unit tests): HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
    • HumanEvalPack extends HumanEval to 3 scenarios across 6 languages via human translations and was released with OctoPack.
    • MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
    • Recode applied to the HumanEval benchmark. It evaluates the robustness of code-generation models.
    • Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
    • Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
    • CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
    • Concode for Java code generation (2-shot setting and evaluation with BLEU score).
    • 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.
    • SantaCoder-FIM for evaluating FIM on Python code using Exact Match. Further details are described in SantaCoder. Includes two tasks:
      • StarCoderFIM: which uses the default FIM tokens "<fim_prefix>", "<fim_middle>", "<fim_suffix>", and
      • SantaCoderFIM: which uses SantaCoder FIM tokens "<fim-prefix>", "<fim-middle>", "<fim-suffix>"
    • Mercury for evaluating computational efficiency of Python code generation.

More details about each task can be found in the documentation in docs/README.md.

Setup

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness

Install torch based on your device type, and install the other packages using:

pip install -e .

To run the DS-1000 benchmark, additional constraints must be resolved.

# python version must be 3.7.10
pip install -e ".[ds1000]" # installs all additional dependencies except PyTorch
# torch==1.12.1 required. Download version with relevant GPU support etc., e.g.,
pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

# to suppress any tensorflow optimization warnings, 
# precede call to "accelerate launch" with "TF_CPP_MIN_LOG_LEVEL=3"

# on some systems, tensorflow will attempt to allocate all GPU memory
# to its process at import which will raise a CUDA out-of-memory error
# setting "export TF_FORCE_GPU_ALLOW_GROWTH=true" resolves this

Also make sure you have git-lfs installed and are logged in the Hub

huggingface-cli login

We use accelerate to generate code/text in parallel when multiple GPUs are present (multi-GPU mode). You can configure it using:

accelerate config

This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large models, we recommend specifying the precision of the model using the --precision flag instead of accelerate config to have only one copy of the model in memory. You can also load models in 8bit with the flag --load_in_8bit or 4bit with --load_in_4bit if you have bitsandbytes installed with the required transformers and accelerate versions.

The evaluation part (solutions execution) for MultiPL-E requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section Docker for more details.

Usage

You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.

For more details on how to evaluate on the tasks, please refer to the documentation in docs/README.md.

Generation and evaluation

Below is an example to generate and evaluate on a task.

accelerate launch  main.py \
  --model <MODEL_NAME> \
  --tasks <TASK_NAME> \
  --limit <NUMBER_PROBLEMS> \
  --max_length_generation <MAX_LENGTH> \
  --temperature <TEMPERATURE> \
  --do_sample True \
  --n_samples 100 \
  --batch_size 10 \
  --precision <PRECISION> \
  --allow_code_execution \
  --save_generations
  • limit represents the number of problems to solve, if it's not provided all problems in the benchmark are selected.
  • allow_code_execution is for executing the generated code: it is off by default, read the displayed warning before calling it to enable execution.
  • Some models with custom code on the HF hub like SantaCoder require calling --trust_remote_code, for private models add --use_auth_token.
  • save_generations saves the post-processed generations in a json file at save_generations_path (by default generations.json). You can also save references by calling --save_references
  • max_length_generation is the maximum token length of generation including the input token length. The default is 512, but for some tasks like GSM8K and GSM-Hard, the complete prompt with 8 shot examples (as used in PAL) take up ~1500 tokens, hence the value should be greater than that and the recommended value of max_length_generation is 2048 for these tasks.

Some tasks don't require code execution such as codexglue_code_to_text-<LANGUAGE>/codexglue_code_to_text-python-left/conala/concode that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use n_samples=1 and batch_size=1. (Note that batch_size should always be equal or less than n_samples).

  • For APPS tasks, you can use n_samples=1 for strict and average accuracies (from the original APPS paper) and n_samples>1 for pass@k.

Generation only

If you want to generate solutions without executing and evaluating the code, call --generation_only, in addition to the instructions above. This will save the solutions in a json file provided in save_generation_path in the working directory.

This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine or docker container for the execution.

Evaluation only

If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the load_generations_path argument. You may need to reconfigure accelerate to use multiple CPUs.

Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that model value here only serves for documenting the experiment. Also add --n_samples to specify the number of samples to evaluate per problem (usually the same value used in generation).

accelerate launch  main.py   --tasks mbpp  --allow_code_execution  --load_generations_path generations.json  --model incoder-temperature-08

Docker containers

For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in generations.json for example by adding the flag --generation_only to the command. Then use the Docker image that we provide:

$ docker pull ghcr.io/bigcode-project/evaluation-harness
$ docker tag ghcr.io/bigcode-project/evaluation-harness evaluation-harness

If you want to evaluate on MultiPL-E, we have a different Dockerfile since it requires more dependencies, use:

$ docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
$ docker tag ghcr.io/bigcode-project/evaluation-harness-multiple evaluation-harness-multiple

Building Docker images

If you modify the evaluation harness, you may want to rebuild the docker images.

Here's how to build a docker image for the evaluation harness:

$ sudo make DOCKERFILE=Dockerfile  all

This creates an image called evaluation-harness, and runs a test on it. To skip the test remove all form the command.

For MultiPL-E:

$ sudo make DOCKERFILE=Dockerfile-multiple all

This creates an image called evaluation-harness-multiple.

Evaluating inside a container

Suppose you generated text with the bigcode/santacoder model and saved it in generations_py.json with:

accelerate launch  main.py \
    --model bigcode/santacoder  \
    --tasks multiple-py  \
    --max_length_generation 650 \
    --temperature 0.8   \
    --do_sample True  \
    --n_samples 200  \
    --batch_size 200  \
    --trust_remote_code \
    --generation_only \
    --save_generations \
    --save_generations_path generations_py.json

To run the container (here from image evaluation-harness-multiple) to evaluate on generations_py.json, or another file mount it with -v, specify n_samples and allow code execution with --allow_code_execution (and add the number of problems --limit if it was used during generation):

$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
    --model bigcode/santacoder \
    --tasks multiple-py \
    --load_generations_path /app/generations_py.json \
    --allow_code_execution  \
    --temperature 0.8 \
    --n_samples 200

Implementing new tasks

To implement a new task in this evaluation harness, see the guide in docs/guide. The are also contribution guidelines in this CONTRIBUTING.md

Documentation

We provide documentation for the existing benchmarks and how to run the evaluation in docs/README.md.

Remarks

  • Currenltly, we use data parallel evaluation across multiple GPUs using accelerate, this assumes that you can fit the model in one GPU.

Acknowledgements

We thank EleutherAI for their work on the lm-evaluation harness from which this repository is inspired.

Cite as

@misc{bigcode-evaluation-harness,
  author       = {Ben Allal, Loubna and
                  Muennighoff, Niklas and
                  Kumar Umapathi, Logesh and
                  Lipkin, Ben and
                  von Werra, Leandro},
  title = {A framework for the evaluation of code generation models},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},
  year = 2022,
}

bigcode-evaluation-harness's People

Contributors

andre15silva avatar anil-gurbuz avatar arjunguha avatar armelrandy avatar benlipkin avatar cassanof avatar changwangss avatar chiyeunglaw avatar didier-durand avatar elfsong avatar ganler avatar icsawyer avatar infinitylogesh avatar iq179 avatar keytoyze avatar loubnabnl avatar lvwerra avatar manandey avatar maxmatical avatar meher-m avatar mitya52 avatar muennighoff avatar raymondli0 avatar sedrickkeh avatar shehrozek-cerebras avatar siviltaram avatar terryyz avatar thomwolf avatar vikparuchuri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigcode-evaluation-harness's Issues

Llama 7B fails for Human Eval

Running human_eval with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.

To reproduce, simply run

accelerate launch  main.py \
  --model huggyllama/llama-7b \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution

which returns

"humaneval": {
    "pass@1": 0.0,
    "pass@10": 0.0
  },

I reduced n_samples from 200 to 20 to make this run in about 1 hour on a single A100.

Running the same human_eval with CodeCapybara's repo gets the correct values {'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}

Could be related to huggingface/transformers#22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.

If anyone has successfully run it here, would appreciate some tips!

main() crashes with --allow-code-execution=True

The call to .generate() in utils.py complete_code() seems to be mis-configured, since it produces the stack trace below.

Here I use model='hf-internal-testing/tiny-random-gpt2' (but codeparrot fails in the same way), and allow-code-execution=True

Traceback (most recent call last):
  File "~/bigcode-evaluation-harness/main.py", line 147, in <module>
    main()
  File "~/bigcode-evaluation-harness/main.py", line 132, in main
    results[task] = evaluator.evaluate(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 193, in evaluate
    generations, references = self.generate_text(task)
  File "~/bigcode-evaluation-harness/lm_eval/evaluator.py", line 70, in generate_text
    generations = parallel_generations(
  File "~/bigcode-evaluation-harness/lm_eval/generation.py", line 140, in parallel_generations
    generations = complete_code(
  File "~/bigcode-evaluation-harness/lm_eval/utils.py", line 177, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1320, in generate
    return self.sample(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/generation_utils.py", line 1938, in sample
    outputs = self(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1048, in forward
    transformer_outputs = self.transformer(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 835, in forward
    position_embeds = self.wpe(position_ids)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/Users/marco/mambaforge/envs/BigCode/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Suggest tasks for the Evaluation Harness

Creating an Evaluation Harness for code LLMs

We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!

Existing list

Please take a look at the existing sheet of evaluation benchmarks here.

Contribute

Please use the following template to suggest new tasks for the Evaluation Harness.

Name Link Number of samples Languages Available on the HF Hub
HumanEval https://github.com/openai/human-eval 164 Python Yes

Here's the Markdown snippet that you can copy/paste:

|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |

HumanEval post-processing

For the HumanEval task, we remove the last block, based on the stop tokens: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/tasks/humaneval.py#L70

If no stopword is found in the generation (for example if by chance the generation ends exactly at the function's last return statement, or before), then remove_last_block would remove the entire generation and return an empty string.

It seems to me that we should rather: remove anything that is after the first block, if there ever is a match with one of the stop tokens

If this issue makes sense, happy to create a PR for that.

error: list index out of range, when testing in multi-gpu?

bigcode-evaluation-harness/lm_eval/utils.py:388 in │
│ complete_code │
│ │
│ 385 │ │ │ if not INFILL_MODE: │
│ 386 │ │ │ │ gen_code = gen_code[len(prefix) :] │
│ 387 │ │ │ if postprocess: │
│ ❱ 388 │ │ │ │ code_gens[sample].append( │
│ 389 │ │ │ │ │ task.postprocess_generation(gen_code, int(sample))

Cannot run eval with local model directory

Hi. Thank you for your hard work!

I am trying to run bigcode/starcoder model on a server. I downloaded the huggingface repo with git and transferred the folder to the server.

To be extra sure I wasn't passing the path wrong, I modified main.py:

    parser.add_argument(
        "--model",
        default="/home/ubuntu/.cache/huggingface/hub/models--bigcode--starcoder/snapshots/8a57e3930912e5d22ddc4d5f46b4b99f169afbe9",
        help="Model to evaluate, provide a repo name in Hugging Face hub or a local path",
    )

Once I run python main.py I get:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:230 in <module>                       │
│                                                                                                  │
│   227                                                                                            │
│   228                                                                                            │
│   229 if __name__ == "__main__":                                                                 │
│ ❱ 230 │   main()                                                                                 │
│   231                                                                                            │
│                                                                                                  │
│ /home/ubuntu/star-coder/bigcode-evaluation-harness/main.py:176 in main                           │
│                                                                                                  │
│   173 │   │   │   │   f"Non valid precision {args.precision}, choose from: fp16, fp32, bf16"     │
│   174 │   │   │   )                                                                              │
│   175 │   │   print(f"Loading tokenizer and model (in {args.precision})")                        │
│ ❱ 176 │   │   model = AutoModelForCausalLM.from_pretrained(                                      │
│   177 │   │   │   args.model,                                                                    │
│   178 │   │   │   revision=args.revision,                                                        │
│   179 │   │   │   torch_dtype=dict_precisions[args.precision],                                   │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:434 in  │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   431 │   │   ]                                                                                  │
│   432 │   │   hub_kwargs = {name: kwargs.pop(name) for name in hub_kwargs_names if name in kwa   │
│   433 │   │   if not isinstance(config, PretrainedConfig):                                       │
│ ❱ 434 │   │   │   config, kwargs = AutoConfig.from_pretrained(                                   │
│   435 │   │   │   │   pretrained_model_name_or_path,                                             │
│   436 │   │   │   │   return_unused_kwargs=True,                                                 │
│   437 │   │   │   │   trust_remote_code=trust_remote_code,                                       │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:8 │
│ 29 in from_pretrained                                                                            │
│                                                                                                  │
│   826 │   │   │   )                                                                              │
│   827 │   │   │   return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)   │
│   828 │   │   elif "model_type" in config_dict:                                                  │
│ ❱ 829 │   │   │   config_class = CONFIG_MAPPING[config_dict["model_type"]]                       │
│   830 │   │   │   return config_class.from_dict(config_dict, **unused_kwargs)                    │
│   831 │   │   else:                                                                              │
│   832 │   │   │   # Fallback: use pattern matching on the string.                                │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:5 │
│ 36 in __getitem__                                                                                │
│                                                                                                  │
│   533 │   │   if key in self._extra_content:                                                     │
│   534 │   │   │   return self._extra_content[key]                                                │
│   535 │   │   if key not in self._mapping:                                                       │
│ ❱ 536 │   │   │   raise KeyError(key)                                                            │
│   537 │   │   value = self._mapping[key]                                                         │
│   538 │   │   module_name = model_type_to_module_name(key)                                       │
│   539 │   │   if module_name not in self._modules:                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'gpt_bigcode'

Any idea on how to fix this issue? Had trouble authenticating my huggingface that is why I am trying to load a local model.

Show metric in outfile

{
  "codexglue_code_to_text-python-left": 0.06565988797511521,
  "config": {
    "model": "bigcode/christmas-models"
  }
}

would be better to also have the metric imo

Any plan for attaching release tag?

I would like to express my sincere gratitude for the ease of code eval provided by your repository.
I see that tasks and features are being added rapidly thanks to lots of contributors.
Do you have any plan for attaching release tags for version control?

add TransCoder task for code translation

Add this code translation (with unit test) task: https://github.com/facebookresearch/TransCoder. The C++ -> Python subsset was used in PaLM. This requires:

  • adding the evaluation metric to HuggingFace evaluate https://huggingface.co/docs/evaluate/index
  • adding the TransCoder dataset to HuggingFace hub, there already is this dataset but make sure it matches the original dataset in the GitHub repo.
  • adding the benchmark to the evaluation harness in a few-shot setting (similarily to PaLM approach)

santacoder fp16 causes NaN on humaneval?

Just wondering if we need to use fp32 for evaluation of santacoder?
I tried fp16 evaluation because I fine-tuned santacoder on the stack-dedup python dataset for 1000 steps with fp16 precision. But when I ran fp16 evaluation on humaneval, it leads to the following error (for both --model=bigcode/santacoder and --model=myfp16_finetuned_santacoder),

File "/home/ywen/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2583, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The error went away if I use --precision=fp32, leading to 37.19% pass@100 on humaneval which is kinda close to the number reported in the paper. This is the command I used to run fp16 evaluation on humaneval.

accelerate launch main.py \
    --model bigcode/santacoder \
    --max_length_generation 368 \
    --tasks humaneval \
    --temperature 0.4 \
    --n_samples 100 \
    --batch_size 20 \
    --allow_code_execution \
    --trust_remote_code \
    --use_auth_token \
    --generation_only \
    --precision fp16 \
    --save_generations 

requirements.txt doesn't support newer models (KeyError)

(Related issue here: #73)
The requirements.txt file lists transformers==4.25.1, which doesn't support a lot of the newer models such as bigcode/starcoder and huggyllama/llama-7b (it gives errors such as KeyError: 'gpt_bigcode'). This should be simple to fix from the user side (just install a newer version of transformers), but just thought I'd flag it here since it's probably best if the requirements.txt can accommodate these newer models.

improve the prompt examples of one-shot setting in APPS evaluation

Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).

We want to improve these examples:

  • analyse the different question types of APPS and build 2 or 3 examples to cover these types (make sure they aren't in the test set)
  • see how models behave given different examples (you can play with the model demos/spaces in this org there's codeparrot, incoder and codegen)
  • the prompt shouldn't end up being too long

Problem launching evaluation

Hi, I am trying to run the evaluation of Santacoder using the script provided, but I am getting the following error which I am not able to find out why:

File "/home/kcdharma/ndec/eval/bin/accelerate", line 8, in <module>
  sys.exit(main())
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
  args.func(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 910, in launch_command
  simple_launcher(args)
File "/home/kcdharma/ndec/eval/lib/python3.10/site-packages/accelerate/commands/launch.py", line 397, in simple_launcher
  process = subprocess.Popen(cmd, env=current_env)
File "/usr/lib/python3.10/subprocess.py", line 966, in __init__
  self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1762, in _execute_child
  env_list.append(k + b'=' + os.fsencode(v))
File "/usr/lib/python3.10/os.py", line 810, in fsencode
  filename = fspath(filename)  # Does type-checking of `filename`.

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any help is appreciated.

support for batch size > 1 for single problem generations (n_samples=1)

The below works when setting batch_size 1 🧐

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 840.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 949.22it/s]
number of problems for this task is 14918
0it [00:06, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 83, in complete_code
    for step, batch in tqdm(enumerate(dataloader)):
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/data_loader.py", line 491, in __iter__
    observed_batch_size = find_batch_size(batch)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/utils/operations.py", line 177, in find_batch_size
    raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
TypeError: Can only find the batch size of tensors but got <class 'NoneType'>.

Probably related:

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --limit 8 --max_length_generation 512 --do_sample False --n_samples 100 --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 782.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 901.68it/s]
number of problems for this task is 8
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 87, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/transformers/generation/utils.py", line 1513, in generate
    raise ValueError(
ValueError: num_return_sequences has to be 1, but is 16 when doing greedy search.

[Minor] Conflicting dependencies in requirements.txt

Running pip install -r requirements.txt gives me

ERROR: Cannot install -r requirements.txt (line 1) and huggingface_hub==0.8.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested huggingface_hub==0.8.1
    transformers 4.25.1 depends on huggingface-hub<1.0 and >=0.10.0

Instead,

huggingface_hub>=0.10.0

fixed it for me and hasn't broken anything so far.

Add Reasoning tasks to the evaluation

In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.

As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM

PAL Datasets:

Query around n_samples argument

Hi, I am performing code generations using the following command

accelerate launch  main.py --model bigcode/santacoder --tasks humaneval --max_length_generation 256 \
--temperature 0.8 --top_p 0.95 --do_sample True --generation_only --n_samples 100 --batch_size 32 \
--output_generations generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json \
--save_generations --allow_code_execution --trust_remote_code

I am expecting the number of candidate generations per task to be around 100. However, on inspecting the generations/santacoder_temperature_0.8_top_p_0.95_task_humaneval.json file I see that there are 96 generations per task.

Is there something I am missing? Thanks

failed evaluation on GSM8K

I try to run your code in a docker container from ghcr.io/bigcode-project/evaluation-harness.

The exact bash command is

accelerate launch  main.py \
  --model bigcode/starcoder \
  --use_auth_token \
  --max_length_generation 512 \
  --tasks pal-gsm8k-greedy \
  --n_samples 1 \
  --temperature 0 \
  --batch_size 1 \
  --do_sample False \
  --allow_code_execution \
  --save_generations \
  --save_generations_path ./output/starcoder_on_gsm8k.json

However, it returns the following:

Evaluating generations...
{
  "pal-gsm8k-greedy": {
    "accuracy": 0.0,
    "num_failed_execution": 1319
  },
  "config": {
    "model": "bigcode/starcoder",
    "revision": null,
    "temperature": 0.0,
    "n_samples": 1
  }
}

where the saved generation contents are like:
Screenshot 2023-06-20 at 14 37 40

Any solutions?

When I evaluated the dataset APPS, I got the error RuntimeError: stack.size() >= frames.back().function->n_inputs INTERNAL ASSERT FAILED

I've tried many nodes and this error is reported. According to this link it seems that the torch version needs to be upgraded, but the highest supported torch for python 3.7 is 1.13.1, so it looks like this is a dead end. How do I avoid this problem, since someone has successfully reviewed it? My evaluation command is as follows
accelerate launch main.py \ --model bigcode/starcoder \ --tasks apps-introductory \ --max_length_generation 2048 \ --temperature 0.8 \ --n_samples 1 \ --batch_size 32 \ --save_generations \ --precision bf16 \ --save_generations_path generations.json \ --metric_output_path evaluation_results.json \ --allow_code_execution

MultiPL-E Integration

As part of the integration of MultiPL-E benchmark create Dockerfile/Docker image with all dependencies required to execute the code generations for different programming languages

Design prompts for few-shot evaluation tasks

We do not have natural language prompts for all tasks in the Evaluation Harness. We would either like to find prompts which have been adopted by other research groups or design prompts that work well for the task at hand. For example, we would appreciate help with designing prompts for the following tasks:

  • APPS
  • CoNaLA
  • Concode

Variable max_length_generation

Allow max_length_generation to change from batch to batch to speed-up tasks where length changes a lot.
For tasks scored with exact match, we even know the maximum length for each sample, so it would be nice to just limited the max length for those samples. Need to be careful to make it work with batching.

Getting Zeros for StarCoder on multiple-js

I am running the following :

accelerate launch  main.py   \
  --model bigcode/starcoder   \
  --max_length_generation 512  \
  --tasks multiple-js   \
  --n_samples 120  \
  --batch_size 10  \
  --temperature 0.2  \
  --precision bf16  \
  --allow_code_execution   --use_auth_token

The results is :

{
  "multiple-js": {
    "pass@1": 0.0,
    "pass@10": 0.0,
    "pass@100": 0.0
  },
  "config": {
    "model": "bigcode/starcoderbase",
    "temperature": 0.1,
    "n_samples": 120
  }
}

Is their any other parameters that I might be missing ?

Error Running Odex Integration Code

Hi, I am trying to test the PR submitted for Odex and Conala tasks support. The repository is here . I am able to successfully run the bigcode-evaluation-harness code for inference. However, using the same setup throws me an error when I run the PR code.

Here is the accelerate config used

$ accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                                                                                                
multi-GPU                                                                                                                                                                                                           
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                          
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                                                                                   
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                                                                                          
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                                                                                                           
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:1
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
bf16                                                                                            

The command used is

$ accelerate launch main.py --model Salesforce/codegen-350M-mono --tasks odex-en --temperature 0.8 --top_p 0.95 --do_sample True --n_samples 100 --batch_size 10 --save_generations --allow_code_execution

This is the error I am getting

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "main.py", line 10, in <module>
    from lm_eval.evaluator import Evaluator
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/evaluator.py", line 5, in <module>
    from lm_eval import tasks
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/__init__.py", line 3, in <module>
    from . import apps, codexglue_code_to_text, conala, concode, humaneval, mbpp, codexglue_text_to_text, odex, mconala
  File "/home/rudra/bigcode-evaluation-harness/lm_eval/tasks/codexglue_code_to_text.py", line 56, in <module>
    def compute_codexglue_code_to_text_bleu(gold_and_predicted_items: list[tuple[str, str]]):
TypeError: 'type' object is not subscriptable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2237438) of binary: /home/rudra/.cache/A100/bin/python
Traceback (most recent call last):
  File "/home/rudra/.cache/A100/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/rudra/.cache/A100/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-23_02:34:25
  host      : cccxc578.pok.ibm.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2237438)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Consider a refactoring

Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:

  • how can we make it as easy as possible to add new metrics. it's possible that we may want to add a few dozen more datasets each with some quirks. we can look at other frameworks like the lm-evaluation-harness to see how it's done there and if it make sense to build on top of it or just take inspiration. e.g. i think it would be nice if adding a new evaluation would require changes in as few places as possible.
  • going for multilinguality we might need to run the code execution in different environments. maybe we should decouple generation and execution by saving the results on disk in between.
  • for the execution part we probably will need to think about docker environments to execute code in different frameworks.

These are just a few thoughts, let me know if you think this makes sense @loubnabnl.

Would be better to save generations on the fly

This would be a bigger refactoring but imo it'd be better to save generations after each generation is done & along with that offer restarting from previously unfinished generations (e.g. if it's interrupted or sth)

just leaving this here if someone is interested

Add tests to the evaluation harness

Add tests to the existing evaluation benchmarks to make sure they are not broken by new additions. (e.g: ensure fixed generations for a specific model using greedy sampling with a fixed seed)

MBPP eval extremely slow for CodeGen2 and Replit-Code

Hi, I have been trying to evaluate CodeGen2 and Replit-Code models on the mbpp task, but the code runs extremely slow. While the corresponding eval time for other models is around 2 hours, the ETA for these 2 models varies significantly and sometimes goes up to > 90 hrs. Any help to resolve this issue? Thanks!

8-bit models unsupported

Currently, the harness raises an exception when used with 8-bit models:

Traceback (most recent call last):
  File "bigcode-evaluation-harness/main.py", line 233, in <module>
    main()
  File "bigcode-evaluation-harness/main.py", line 216, in main
    results[task] = evaluator.evaluate(task)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 67, in evaluate
    generations, references = self.generate_text(task_name)
  File "bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "bigcode-evaluation-harness/lm_eval/generation.py", line 83, in parallel_generations
    model = model.to(accelerator.device)
  File "/root/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1873, in to
    raise ValueError(
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

for context, this is the model i've been trying to eval: https://huggingface.co/cassanof/santacoder-lua/tree/main

seems like a check is needed for every .to call... any suggestions?

How to evaluate the model memory efficiently?

Thanks for the great work and convenient benchmarking tool!

I would like to evaluate CodeGen-16B model on the humaneval benchmark. At my disposal there is A6000 GPUs with 48Gb of memory. The evaluation script crashes due to CUDA out of memory here (i.e accelerator.prepare) even with the smallest batch size - 1.

Since it is model evaluation I would expect that most of the memory is occupied by the model params (no optimizer states).
Naively, this model should fit into a single GPU if loaded in half precision, since 2x 16 = 32 < 48. However, when setting in the accelerate launch mixed precision with fp16 I still face OOM error.

What measures would you suggest to fit the model onto a single GPU?

Reproducing the performance of HumanEval on starcoder

Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.

But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.

{
  "humaneval": {
    "pass@1": 0.3011280487804878,
    "pass@10": 0.41708568124396794,
    "pass@100": 0.5175640419344132
  },
  "config": {
    "model": "../ckpt/starcoder",
    "temperature": 0.2,
    "n_samples": 200
  }
}

It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.