Skip to content

CI/CD

Evalite integrates seamlessly into CI/CD pipelines, allowing you to validate LLM-powered features as part of your automated testing workflow.

Static UI Export

Export eval results as a static HTML bundle for viewing in CI artifacts without running a live server.

Basic Usage

Terminal window
evalite export

Exports latest full run to ./evalite-export directory.

Options

Custom output directory:

Terminal window
evalite export --output=./my-export

Export specific run:

Terminal window
evalite export --run-id=123

Export Structure

Generated bundle contains:

  • index.html - Standalone UI (works without server)
  • data/*.json - Pre-computed API responses
  • files/* - Images, audio, etc. from eval results
  • assets/* - UI JavaScript/CSS

Viewing Exports

Local preview:

Terminal window
npx serve -s ./evalite-export

Static hosting: Upload to artifact.ci, S3, GitHub Pages, etc.

CI Integration Example

GitHub Actions workflow exporting UI to artifacts:

name: Run Evals
on: [push, pull_request]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: "22"
- run: npm install
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: npx evalite --threshold=70
- name: Export UI
run: npx evalite export --output=./ui-export
- name: Upload static UI
uses: actions/upload-artifact@v3
with:
name: evalite-ui
path: ui-export

View results by downloading artifact and running npx serve -s ./ui-export.

Running on CI

Run Evalite in run-once mode (default):

Terminal window
evalite

Executes all evals and exits.

Score Thresholds

Fail CI builds if scores fall below threshold:

Terminal window
evalite --threshold=70

Exits with code 1 if average score < 70.

JSON Export

For programmatic analysis, export raw JSON:

Terminal window
evalite --outputPath=./results.json

Export Format

Typed hierarchical structure:

import type { Evalite } from "evalite";
type Output = Evalite.Exported.Output;

Contains:

  • run: Metadata (id, runType, createdAt)
  • evals: Array of evaluations with:
    • Basic info (name, filepath, duration, status, averageScore)
    • results: Individual test results with:
      • Test data (input, output, expected)
      • scores: Scorer results
      • traces: LLM call traces

Use Cases

  • Analytics: Import into dashboards for performance tracking
  • Archiving: Store historical results for comparison
  • Custom tooling: Build scripts around eval data