Git Product home page Git Product logo

autoevals's Introduction

Autoevals

Autoevals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

Autoevals is developed by the team at BrainTrust.

Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Installation

Autoevals is distributed as a Python library on PyPI and Node.js library on NPM.

Python

pip install autoevals

Node.js

npm install autoevals

Example

Use Autoevals to model-grade an example LLM completion using the factuality prompt. By default, Autoevals uses your OPENAI_API_KEY environment variable to authenticate with OpenAI's API.

Python

from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

Node.js

import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata.rationale}`);
})();

Using Braintrust with Autoevals

Once you grade an output using Autoevals, it's convenient to use BrainTrust to log and compare your evaluation results.

Python

from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Set up an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

# Set up a BrainTrust experiment to log our eval to
experiment = braintrust.init(
    project="Autoevals", api_key="YOUR_BRAINTRUST_API_KEY"
)

# Start a span and run our evaluator
with experiment.start_span() as span:
    result = evaluator(output, expected, input=input)

    # The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
    print(f"Factuality score: {result.score}")
    print(f"Factuality metadata: {result.metadata['rationale']}")

    span.log(
        inputs={"query": input},
        output=output,
        expected=expected,
        scores={
            "factuality": result.score,
        },
        metadata={
            "factuality": result.metadata,
        },
    )

print(experiment.summarize())

Node.js

Create a file named example.eval.js (it must end with .eval.js or .eval.js):

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Autoevals", {
  data: () => [
    {
      input: "Which country has the highest population?",
      expected: "China",
    },
  ],
  task: () => "People's Republic of China",
  scores: [Factuality],
});

Then, run

npx braintrust run example.eval.js

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Moderation
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers

RAGAS

  • Context precision
  • Context relevancy
  • Context recall
  • Context entities recall
  • Faithfullness
  • Answer relevance
  • Answer semantic similarity
  • Answer correctness
  • Aspect critique

Composite

  • Semantic list contains
  • JSON validity

Embeddings

  • Embedding similarity
  • BERTScore

Heuristic

  • Levenshtein distance
  • Numeric difference
  • JSON diff
  • Jaccard distance

Statistical

  • BLEU
  • ROUGE
  • METEOR

Custom Evaluation Prompts

Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

Python

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    name="TitleQuality",
    prompt_template=prompt_prefix,
    choice_scores=output_scores,
    use_cot=True,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

Node.js

import { LLMClassifierFromTemplate } from "autoevals";

(async () => {
  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}`;

  const choiceScores = { 1: 1, 2: 0 };

  const evaluator = LLMClassifierFromTemplate({
    name: "TitleQuality",
    promptTemplate,
    choiceScores,
    useCoT: true,
  });

  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
  const expected = `Standardize Error Responses across APIs`;

  const response = await evaluator({ input, output, expected });

  console.log("Score", response.score);
  console.log("Metadata", response.metadata);
})();

Creating custom scorers

You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana' is in the output, you can use the following:

Python

from autoevals import Score


def banana_scorer(output, expected, input):
    return Score(name="banana_scorer", score=1 if "banana" in output else 0)


input = "What is 1 banana + 2 bananas?"
output = "3"
expected = "3 bananas"

result = banana_scorer(output, expected, input)

print(f"Banana score: {result.score}")

Node.js

import { Score } from "autoevals";

const bananaScorer = ({ output, expected, input }): Score => {
  return { name: "banana_scorer", score: output.includes("banana") ? 1 : 0 };
};

(async () => {
  const input = "What is 1 banana + 2 bananas?";
  const output = "3";
  const expected = "3 bananas";

  const result = await bananaScorer({ output, expected, input });
  console.log(`Banana score: ${result.score}`);
})();

Why does this library exist?

There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:

  • Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in number.py to see how it's done for numeric differences.
  • Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.
  • Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in input, output, and expected values through a bunch of different evaluation methods.

Documentation

The full docs are available here.

autoevals's People

Contributors

abrenneke avatar ankrgyl avatar aphinx avatar bardia-pourvakil avatar dashk avatar davidatbraintrust avatar ecatkins avatar manugoyal avatar tara-nagar avatar transitive-bullshit avatar wong-codaio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

autoevals's Issues

[Feat] Can you read openai api key from the .env

the docs on autoevals did not show that openai.api_key needs to be set

my code snippet looks like this currently, it would be nice if auto evals could read OPENAI_API_KEY from my env and use it for the request

from autoevals.llm import *
import autoevals

# litellm completion call
import litellm
question = "which country has the highest population"
response = litellm.completion(
    model = "gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": question
        }
    ],
)


# use the auto eval Factuality() evaluator
evaluator = Factuality()
openai.api_key = "" # set your openai api key for evaluator
result = evaluator(
    output=response.choices[0]["message"]["content"],
    expected="India",
    input=question
)

print(result)

General Question about the Evaluator LLM

Hi! I am wondering if it's possible to use open source or self-deployed llms (and not only open ai) as the judge or evaluator? If yes, could you please refer to an example or part of the docs explaining that? Thanks!

Factuality Evaluator failing

Tried this code snippet:

from autoevals.llm import *
import openai

openai.api_key = "sk-"
 
# Create a new LLM-based evaluator
evaluator = Factuality()
 
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
 
result = evaluator(output, expected, input=input)
print(result)
 
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

I see this error:
Score(name='Factuality', score=0, metadata={}, error=KeyError('usage'))
Factuality score: 0

Traceback (most recent call last):
  File "/Users/ishaanjaffer/Github/litellm/litellm/tests/test_autoeval.py", line 19, in <module>
    print(f"Factuality metadata: {result.metadata['rationale']}")
KeyError: 'rationale'

Any suggestions on how i can debug this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.