logikon-ai / cot-eval Goto Github PK

View Code? Open in Web Editor NEW

7.0 2.0 1.0 1.45 MB

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Home Page: https://huggingface.co/spaces/logikon/open_cot_leaderboard

License: MIT License

Python 6.93% Shell 0.97% Jupyter Notebook 91.95% Dockerfile 0.15%

chain-of-thought gen-ai leaderboard llm llms-benchmarking llms-reasoning

cot-eval's People

Contributors

Stargazers

Watchers

Forkers

josephrp

cot-eval's Issues

Evaluate: NousResearch/Nous-Hermes-Llama2-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: 01-ai/Yi-1.5-34B-Chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-1.5-34B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: CohereForAI/c4ai-command-r-v01

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=CohereForAI/c4ai-command-r-v01
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

Note: Will require 2 A100-80.

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: tiiuae/falcon-11B

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=tiiuae/falcon-11B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

wandb integration

to optionally log GPUSTATS (and more...)

Evaluate: core42/jais-XX

For XX in [13b, 13b-chat, 30b-v3, 30b-chat-v3]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=core42/jais-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: Qwen/Qwen1.5-XX-Chat

For {XX} in [0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-XX-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: internlm/internlm2-math-XX

For XX in [20b, 7b]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=internlm/internlm2-math-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openbmb/Eurus-70b-sft

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-70b-sft
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

running evals in parallel

make sure that running cot-evals in parallel doesn't create conflicts when uploading final results (or earlier on)

Evaluate: allenai/OLMo-1B

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/OLMo-1B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-70b-chat-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: allenai/tulu-2-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

not enough swap space issue

when evaluating microsoft/orca-7b

Evaluate: mistralai/Mixtral-8x7B-v0.1

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: google/gemma-7b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: google/gemma-7b-it

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b-it
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: 01-ai/Yi-34B-Chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

check validity of token early on, not after traces have been generated

Evaluate: meta-llama/Meta-Llama-3-XXX

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

With XXX in

"8B"
"8B-Instruct"
"70B"
"70B-Instruct"

NEXT_MODEL_PATH=meta-llama/Meta-Llama-3-XXX
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=12

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: jetmoe/jetmoe-8b | -8b-sft | -8b-chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.
Wait for jetmoe support in VLLM

Parameters:

NEXT_MODEL_PATH=jetmoe/jetmoe-8b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.5
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-13b-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: upstage/SOLAR-10.7B-Instruct-v1.0

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-Instruct-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Transformers version

Make sure to have latest release of transformers in next docker image.

Evaluate: databricks/dbrx-instruct

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for dbrx support in vllm, update container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=databricks/dbrx-instruct
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=16

Note: Will probably need 6+ A100-80G

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: nvidia/nemotron-3-8b-XXX

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters (for XXX in ["base-4k", "chat-4k-sft", "chat-4k-rlhf", "chat-4k-steerlm"]):

NEXT_MODEL_PATH=nvidia/nemotron-3-8b-XXX
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=12

ToDos:

Wait for VLLM support / ported model
Run cot-eval pipeline
Merge pull requests for cot-eval results datatset (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Together.ai support

@ggbetz Adding this so we don't forget.

This includes

integrating together.ai with langchain (you said this was already done)
Making compatible with eval-harness (my thought is to use together.ai OpenAI feature, which should work with harness).

Evaluate: upstage/SOLAR-10.7B-v1.0

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: internlm/internlm2-XX

For XX in [7B, 20B, Chat-7B, Chat-20B]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=internlm/internlm2-chat-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: allenai/tulu-2-dpo-7b

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: 01-ai/Yi-34B

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

Evaluate: meta-llama/Llama-2-13b-chat-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: meta-llama/Llama-2-70b-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: Qwen/Qwen1.5-XX

For {XX} in [0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B]:

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

Evaluate: allenai/tulu-2-dpo-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: openbmb/Eurus-7b-kto

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-7b-kto
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Why are many CoT reasoning traces empty?

In the DatasetViewer of cot-leaderboard/cot-eval-traces-2.0, it appears that many reasoning traces. Is this a bug?

Evaluate: NousResearch/Nous-Hermes-Llama2-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=NousResearch/Nous-Hermes-Llama2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: allenai/tulu-2-dpo-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: allenai/tulu-2-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=64

Evaluate: databricks/dbrx-base

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for dbrx support in vllm, update container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=16

Note: will probably need 6+ A100-80G
ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: microsoft/Phi-3

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

For XXX in:

128k
4k

Parameters:

NEXT_MODEL_PATH=microsoft/Phi-3-mini-XXX-instruct
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: mistralai/Mixtral-8x7B-Instruct-v0.1

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-Instruct-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: Qwen/Qwen-72B-Chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen-72B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

add pipeline-parallel-size

add --pipeline-parallel-size for vllm, to efficiently use multi-gpu resources

Evaluate: ai21labs/Jamba-v0.1

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for Jamba support in vllm or implement HF transformers evaluation, and update cot-eval container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: Qwen/Qwen1.5-MoE-XX

For XX in [A2.7B-Chat, A2.7B]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-MoE-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: CohereForAI/c4ai-command-r-plus

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=CohereForAI/c4ai-command-r-plus
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openchat/openchat-3.5-0106-gemma

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openchat/openchat-3.5-0106-gemma
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

propagate vllm_kwargs to cot-configs

cot-configs have vllm_kwargs sub-dict, but this is static and does not reflect the vllm arguments given in config.env

logikon-ai / cot-eval Goto Github PK

cot-eval's People

Contributors

Stargazers

Watchers

Forkers

cot-eval's Issues

Recommend Projects

Recommend Topics

Recommend Org