Skip to main content
Evals

Experiments

Experiments are the core abstraction for running evaluations in VoltAgent. They define how to test your agents, what data to use, and how to measure success.

Creating Experiments

Use createExperiment from @voltagent/evals to define an evaluation experiment:

import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";

export default createExperiment({
id: "customer-support-quality",
label: "Customer Support Quality",
description: "Evaluate customer support agent responses",

// Reference a dataset by name
dataset: {
name: "support-qa-dataset",
},

// Define the runner function to evaluate
runner: async ({ item, index, total }) => {
// Access the dataset item
const input = item.input;
const expected = item.expected;

// Run your evaluation logic
const response = await myAgent.generateText(input);

// Return the output
return {
output: response.text,
metadata: {
processingTime: Date.now(),
modelUsed: "gpt-4o-mini",
},
};
},

// Configure scorers
scorers: [
scorers.exactMatch,
{
scorer: scorers.levenshtein,
threshold: 0.8,
},
],

// Pass criteria
passCriteria: {
type: "meanScore",
min: 0.7,
},
});

Experiment Configuration

Required Fields

interface ExperimentConfig {
// Unique identifier for the experiment
id: string;

// The runner function that executes for each dataset item
runner: ExperimentRunner;

// Optional but recommended
label?: string;
description?: string;
}

Runner Function

The runner function is what you're evaluating. It receives a context object and produces output:

type ExperimentRunner = (context: ExperimentRunnerContext) => Promise<ExperimentRunnerReturn>;

interface ExperimentRunnerContext {
item: ExperimentDatasetItem; // Current dataset item
index: number; // Item index
total?: number; // Total items (if known)
signal?: AbortSignal; // For cancellation
voltOpsClient?: any; // VoltOps client if configured
runtime?: {
runId?: string;
startedAt?: number;
tags?: readonly string[];
};
}

Example runners:

// Simple text generation
runner: async ({ item }) => {
const result = await processInput(item.input);
return {
output: result,
metadata: {
confidence: 0.95,
},
};
};

// Using expected value for comparison
runner: async ({ item }) => {
const prompt = `Question: ${item.input}\nExpected answer format: ${item.expected}`;
const result = await generateResponse(prompt);
return { output: result };
};

// With error handling
runner: async ({ item, signal }) => {
try {
const result = await processWithTimeout(item.input, signal);
return { output: result };
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
};

// Accessing runtime context
runner: async ({ item, index, total, runtime }) => {
console.log(`Processing item ${index + 1}/${total}`);
console.log(`Run ID: ${runtime?.runId}`);

const result = await process(item.input);
return {
output: result,
};
};

Dataset Configuration

Experiments can use datasets in multiple ways:

// Reference registered dataset by name
dataset: {
name: "my-dataset"
}

// Reference by ID
dataset: {
id: "dataset-uuid",
versionId: "version-uuid" // Optional specific version
}

// Limit number of items
dataset: {
name: "large-dataset",
limit: 100 // Only use first 100 items
}

// Inline items
dataset: {
items: [
{
id: "1",
input: { prompt: "What is 2+2?" },
expected: "4"
},
{
id: "2",
input: { prompt: "Capital of France?" },
expected: "Paris"
}
]
}

// Dynamic resolver
dataset: {
resolve: async ({ limit, signal }) => {
const items = await fetchDatasetItems(limit);
return {
items,
total: items.length,
dataset: {
name: "Dynamic Dataset",
description: "Fetched at runtime"
}
};
}
}

Dataset Item Structure

interface ExperimentDatasetItem {
id: string; // Unique item ID
label?: string; // Optional display name
input: any; // Input data (your format)
expected?: any; // Expected output (optional)
extra?: Record<string, any>; // Additional data
metadata?: Record<string, any>; // Item metadata

// Automatically added if from registered dataset
datasetId?: string;
datasetVersionId?: string;
datasetName?: string;
}

Scorers Configuration

Configure how experiments use scorers:

import { scorers } from "@voltagent/scorers";

scorers: [
// Use prebuilt scorer directly
scorers.exactMatch,

// Configure scorer with threshold
{
scorer: scorers.levenshtein,
threshold: 0.9,
name: "String Similarity",
},

// Custom scorer with metadata
{
scorer: myCustomScorer,
threshold: 0.7,
metadata: {
category: "custom",
version: "1.0.0",
},
},
];

Pass Criteria

Define success conditions for your experiments:

// Single criterion - mean score
passCriteria: {
type: "meanScore",
min: 0.8,
label: "Average Quality",
scorerId: "exact-match" // Optional: specific scorer
}

// Single criterion - pass rate
passCriteria: {
type: "passRate",
min: 0.9,
label: "90% Pass Rate",
severity: "error" // "error" or "warn"
}

// Multiple criteria (all must pass)
passCriteria: [
{
type: "meanScore",
min: 0.7,
label: "Overall Quality"
},
{
type: "passRate",
min: 0.95,
label: "Consistency Check",
scorerId: "exact-match"
}
]

VoltOps Integration

Configure VoltOps for cloud-based tracking:

voltOps: {
client: voltOpsClient, // VoltOps client instance
triggerSource: "ci", // Source identifier
autoCreateRun: true, // Auto-create eval runs
autoCreateScorers: true, // Auto-register scorers
tags: ["nightly", "regression"] // Tags for filtering
}

Experiment Binding

Link experiments to VoltOps experiments:

experiment: {
name: "production-quality-check", // VoltOps experiment name
id: "exp-uuid", // Or use existing ID
autoCreate: true // Create if doesn't exist
}

Running Experiments

Via CLI

Save your experiment to a file:

// experiments/support-quality.ts
import { createExperiment } from "@voltagent/evals";

export default createExperiment({
id: "support-quality",
dataset: { name: "support-dataset" },
runner: async ({ item }) => {
// evaluation logic
return { output: "response" };
},
});

Run with:

npm run volt eval run --experiment ./experiments/support-quality.ts

Programmatically

import { runExperiment } from "@voltagent/evals";
import experiment from "./experiments/support-quality";

const summary = await runExperiment(experiment, {
concurrency: 5, // Run 5 items in parallel

onItemComplete: (event) => {
console.log(`Completed item ${event.index}/${event.total}`);
console.log(`Score: ${event.result.scores[0]?.score}`);
},

onComplete: (summary) => {
console.log(`Experiment completed: ${summary.passed ? "PASSED" : "FAILED"}`);
console.log(`Mean score: ${summary.meanScore}`);
},
});

Complete Example

Here's a complete example from the codebase:

import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";
import { Agent } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";

const supportAgent = new Agent({
name: "Support Agent",
instructions: "You are a helpful customer support agent.",
model: openai("gpt-4o-mini"),
});

export default createExperiment({
id: "support-agent-eval",
label: "Support Agent Evaluation",
description: "Evaluates support agent response quality",

dataset: {
name: "support-qa-v2",
limit: 100, // Test on first 100 items
},

runner: async ({ item, index, total }) => {
console.log(`Processing ${index + 1}/${total}`);

try {
const response = await supportAgent.generateText({
messages: [{ role: "user", content: item.input.prompt }],
});

return {
output: response.text,
metadata: {
model: "gpt-4o-mini",
tokenUsage: response.usage,
},
};
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
},

scorers: [
{
scorer: scorers.exactMatch,
threshold: 1.0,
},
{
scorer: scorers.levenshtein,
threshold: 0.8,
name: "String Similarity",
},
],

passCriteria: [
{
type: "meanScore",
min: 0.75,
label: "Overall Quality",
},
{
type: "passRate",
min: 0.9,
scorerId: "exact-match",
label: "Exact Match Rate",
},
],

experiment: {
name: "support-agent-regression",
autoCreate: true,
},

voltOps: {
autoCreateRun: true,
tags: ["regression", "support"],
},
});

Result Structure

When running experiments, you get a summary with this structure:

interface ExperimentSummary {
experimentId: string;
runId: string;
status: "completed" | "failed" | "cancelled";
passed: boolean;

startedAt: number;
completedAt: number;
durationMs: number;

results: ExperimentItemResult[];

// Aggregate metrics
totalItems: number;
completedItems: number;
meanScore: number;
passRate: number;

// Pass criteria results
criteriaResults?: {
label?: string;
passed: boolean;
value: number;
threshold: number;
}[];

metadata?: Record<string, unknown>;
}

Best Practices

1. Use Descriptive IDs

id: "gpt4-customer-support-accuracy-v2"; // Good
id: "test1"; // Bad

2. Handle Errors Gracefully

runner: async ({ item }) => {
try {
const result = await process(item.input);
return { output: result };
} catch (error) {
// Return error info for analysis
return {
output: null,
metadata: {
error: error.message,
errorType: error.constructor.name,
},
};
}
};

3. Add Meaningful Metadata

runner: async ({ item, runtime }) => {
const startTime = Date.now();
const result = await process(item.input);

return {
output: result,
metadata: {
processingTimeMs: Date.now() - startTime,
runId: runtime?.runId,
itemCategory: item.metadata?.category,
},
};
};

4. Use Appropriate Concurrency

// For rate-limited APIs
await runExperiment(experiment, {
concurrency: 2, // Low concurrency
});

// For local processing
await runExperiment(experiment, {
concurrency: 10, // Higher concurrency
});

5. Tag Experiments Properly

voltOps: {
tags: ["model:gpt-4", "version:2.1.0", "type:regression", "priority:high"];
}

Next Steps

Table of Contents