the prompt template in Github is comparing two models instead of scoring

Only support baseline=True and pairwise=True? about arena-hard-auto HOT 1 CLOSED

lm-sys commented on August 13, 2024

Only support baseline=True and pairwise=True?

from arena-hard-auto.

Comments (1)

CodingWithTim commented on August 13, 2024 1

Hi! If you haven't already, please check out the evaluation process detailed in blog post.

Unlike MT Bench, Arena Hard v0.1 uses an enhanced pairwise comparison method to evaluate models. We found this method, to work better than single score judging. But, baseline and pairwise don't have to be set to true.

If you set baseline=False and remove \n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|> in the judge_config's prompt_template, then your judge will only look at 1 answers. This way you can easily judge MT bench as well once you change the system prompt and the regex pattern to work with MT bench single score judgment.

If you set pairwise=False, then you will only evaluate each prompt once. Due to positional bias and variances, we recommend evaluate each prompt twice, once with baseline answer positioned as the first answer and once with baseline as the second answer. But you can set pairwise=False to save cost when evaluating many checkpoints.

Hope this is helpful!

from arena-hard-auto.

Only support baseline=True and pairwise=True? about arena-hard-auto HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent