Comments (5)
I also tried with evaluations of llama3-8b-instruct and llama3-70b-instruct. I used default generation configs from generation_config.json in llama3 files:
{
"bos_token_id": 128000,
"eos_token_id": [128001, 128009],
"do_sample": true,
"temperature": 0.6,
"max_length": 4096,
"top_p": 0.9,
"transformers_version": "4.40.0.dev0"
}
Here are my results on GSM8K:
GSM8K 8-shot Strict-Match | Official GSM8K 8-shot-COT | |
---|---|---|
8b-instruct | 76.42 | 79.6 |
70b-instruct | 90.35 | 93.0 |
and BBH:
BBH 3-shot-cot Exact-Match | Official BBH 3-shot-cot (BASE model) | |
---|---|---|
8b-instruct | 63.17 | 61.1 |
70b-instruct | 49.35 | 81.3 |
Please note the performance of 70b-instruct on BBH is pretty low. Fortunately, I found the reason is due to the version of VLLM backend in handling specific two EOS tokens in Llama3, and Transformers patch of relevant EOS stop criteria. Therefore, I modified default vllm version in lm_eval to 0.4.2, and upgrade transformers to 4.40.2. Then I got reasonable results:
GSM8K:
GSM8K 8-shot Strict-Match | Official GSM8K 8-shot-COT | |
---|---|---|
8b-instruct | 75.44 | 79.6 |
70b-instruct | 91.05 | 93.0 |
BBH:
BBH 3-shot-cot Exact-Match | Official BBH 3-shot-cot (BASE model) | |
---|---|---|
8b-instruct | 64.6 | 61.1 |
70b-instruct | 83.38 | 81.3 |
from lm-evaluation-harness.
Hi there!
could you share more about what you ran / what scores you got?
On the Instruct Llama3-8B model, gsm8k_cot
should be the same prompt that they use as stated in https://github.com/meta-llama/llama3/blob/main/eval_details.md#gsm8k . One difference is you may need to pass --gen_kwargs max_gen_toks=512
since this is mentioned in the linked eval_details.md, as I believe we default to 256 generated tokens maximum.
from lm-evaluation-harness.
Hello! This is the result I got from running on llama3 with task gsm8k and gsm8k_cot. thanks for your reply!
from lm-evaluation-harness.
Could you try the suggested changes (higher maximum tokens to generate)? And this is with the instruct model, right?
from lm-evaluation-harness.
Hello! I tried llama3-8b-instruct with max_gen_toks=512 setting in gsm8k_cot with the following results:
The version of transformers is 4.38.0.thanks for your reply!
from lm-evaluation-harness.
Related Issues (20)
- Fewshot seed only set when overriding num_fewshot HOT 1
- Evaluation for MegatronT5 Model HOT 3
- social_iqa choices do not use actual answers HOT 2
- Add TensorRT-LLM support HOT 1
- accuracy precision HOT 3
- Add New Benchmark HOT 2
- output_path may break postprocessing HOT 3
- build commit_id=b281b09, I cannot find lm-eval command. HOT 1
- OOM Issue HOT 2
- --hf_hub_log_args causes IndexError
- --trust_remote_code does it actually do anything? HOT 3
- Parallel GPU evaluation using simple_evaluate /evaluate functions?
- Parallel GPU evaluation using simple_evaluate /evaluate functions? #1934 HOT 1
- High Number of Tokens for openai-completions Models
- Format of Personal Defined Dataset for Evaluation HOT 1
- Regarding decontamination
- Save `fewshot_as_multiturn` argument in `results.json`
- Results is weird for Qwen2-1.5B HOT 3
- The output of ceval is not as the same format at the official version?
- Add MMLU-Pro Dataset
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lm-evaluation-harness.