The beta version of Evalite v1 is now available! Install with pnpm add evalite@beta • View beta docs →

Variant Comparison

Overview

evalite.each() enables comparing multiple task variants (models, prompts, configs) within a single eval. This lets you:

Compare different models on the same dataset
A/B test prompt strategies
Test different config parameters (temperature, system prompts, etc.)

Basic Usage

import { evalite } from "evalite";
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
import { Factuality, Levenshtein } from "autoevals";

evalite.each([
  { name: "GPT-4o mini", input: { model: "gpt-4o-mini", temp: 0.7 } },
  { name: "GPT-4o", input: { model: "gpt-4o", temp: 0.7 } },
  { name: "Claude Sonnet", input: { model: "claude-3-5-sonnet", temp: 1.0 } },
])("Compare models", {
  data: async () => [
    { input: "What's the capital of France?", expected: "Paris" },
    { input: "What's the capital of Germany?", expected: "Berlin" },
  ],
  task: async (input, variant) => {
    return generateText({
      model: openai(variant.model),
      temperature: variant.temp,
      prompt: input,
    });
  },
  scorers: [Factuality, Levenshtein],
});

Example: Prompt Comparison

evalite.each([
  {
    name: "Direct",
    input: {
      system: "Answer concisely.",
    },
  },
  {
    name: "Chain of Thought",
    input: {
      system: "Think step by step, then answer.",
    },
  },
  {
    name: "Few-Shot",
    input: {
      system: `Examples:
Q: What's 2+2? A: 4
Q: What's 5+3? A: 8
Now answer the question.`,
    },
  },
])("Prompt Strategies", {
  data: async () => [
    { input: "What's 12 * 15?", expected: "180" },
    // ...
  ],
  task: async (input, variant) => {
    return generateText({
      model: openai("gpt-4o-mini"),
      system: variant.system,
      prompt: input,
    });
  },
  scorers: [Levenshtein],
});