evaltest's Introduction

evaltest

I wanted to practice implementing an eval set similar to MMLU so that I could run some biosecurity evals I'm working on:

A call for a quantitative report card for AI bioterrorism threat models

The current goal is to support ChatGPT and Claude. I don't have the resources to test other LLMs.

The actual biosecurity evals are not in here for security reasons. You should put your own in data/test.csv, and the format should be:

"question?","correct choice","incorrect 1", "incorrect 2", "incorrect 3"
"What is 1+2?","3","4","-3","0"

data/fewshot.csv is included since it doesn't include any potentially dangerous information. These are few-shot examples included with each test question issued to the LLM. The format is the same as data/test.csv, but adds one field which is the example step-by-step reasoning.

When sending the evals to the model, the choices are shuffled.

Developers

This project uses a Python virtualenv and assumes it's in the .venv/ directory. I haven't included any continuous integration scripts because I don't have the resources to run these. Before pushing, you should do, at minimum,

.venv/bin/pip install -r requirements.txt
.venv/bin/pytest .
.venv/bin/mypy . --strict

To run, do

OPENAI_API_KEY="your-key-here" ANTHROPIC_API_KEY="your-other-key-here" .venv/bin/python eval.py

Note that it only queries one of Claude or ChatGPT at a time, and you need to set that option in eval.py.

License

MIT license

Recommend Projects

translunar / evaltest Goto Github PK

evaltest's Introduction

evaltest

Developers

License

evaltest's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent