Hello, the results of llama3 at GSM8K reproduced with this framework are quite differe

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

llama3 baseline reproduction problem about lm-evaluation-harness HOT 5 CLOSED

fmm170 commented on June 11, 2024

llama3 baseline reproduction problem

from lm-evaluation-harness.

Comments (5)

LuJunru commented on June 11, 2024 4

I also tried with evaluations of llama3-8b-instruct and llama3-70b-instruct. I used default generation configs from generation_config.json in llama3 files:

{
  "bos_token_id": 128000,
  "eos_token_id": [128001, 128009],
  "do_sample": true,
  "temperature": 0.6,
  "max_length": 4096,
  "top_p": 0.9,
  "transformers_version": "4.40.0.dev0"
}

Here are my results on GSM8K:

	GSM8K 8-shot Strict-Match	Official GSM8K 8-shot-COT
8b-instruct	76.42	79.6
70b-instruct	90.35	93.0

and BBH:

	BBH 3-shot-cot Exact-Match	Official BBH 3-shot-cot (BASE model)
8b-instruct	63.17	61.1
70b-instruct	49.35	81.3

Please note the performance of 70b-instruct on BBH is pretty low. Fortunately, I found the reason is due to the version of VLLM backend in handling specific two EOS tokens in Llama3, and Transformers patch of relevant EOS stop criteria. Therefore, I modified default vllm version in lm_eval to 0.4.2, and upgrade transformers to 4.40.2. Then I got reasonable results:

GSM8K:

	GSM8K 8-shot Strict-Match	Official GSM8K 8-shot-COT
8b-instruct	75.44	79.6
70b-instruct	91.05	93.0

BBH:

	BBH 3-shot-cot Exact-Match	Official BBH 3-shot-cot (BASE model)
8b-instruct	64.6	61.1
70b-instruct	83.38	81.3

from lm-evaluation-harness.

haileyschoelkopf commented on June 11, 2024

Hi there!

could you share more about what you ran / what scores you got?

On the Instruct Llama3-8B model, gsm8k_cot should be the same prompt that they use as stated in https://github.com/meta-llama/llama3/blob/main/eval_details.md#gsm8k . One difference is you may need to pass --gen_kwargs max_gen_toks=512 since this is mentioned in the linked eval_details.md, as I believe we default to 256 generated tokens maximum.

from lm-evaluation-harness.

fmm170 commented on June 11, 2024

Hello! This is the result I got from running on llama3 with task gsm8k and gsm8k_cot. thanks for your reply!

from lm-evaluation-harness.

haileyschoelkopf commented on June 11, 2024

Could you try the suggested changes (higher maximum tokens to generate)? And this is with the instruct model, right?

from lm-evaluation-harness.

fmm170 commented on June 11, 2024

Hello! I tried llama3-8b-instruct with max_gen_toks=512 setting in gsm8k_cot with the following results:

The version of transformers is 4.38.0.thanks for your reply!

from lm-evaluation-harness.

Recommend Projects

llama3 baseline reproduction problem about lm-evaluation-harness HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent