Experiments
Experiments are the core abstraction for running evaluations in VoltAgent. They define how to test your agents, what data to use, and how to measure success.
Creating Experiments
Use createExperiment
from @voltagent/evals
to define an evaluation experiment:
import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";
export default createExperiment({
id: "customer-support-quality",
label: "Customer Support Quality",
description: "Evaluate customer support agent responses",
// Reference a dataset by name
dataset: {
name: "support-qa-dataset",
},
// Define the runner function to evaluate
runner: async ({ item, index, total }) => {
// Access the dataset item
const input = item.input;
const expected = item.expected;
// Run your evaluation logic
const response = await myAgent.generateText(input);
// Return the output
return {
output: response.text,
metadata: {
processingTime: Date.now(),
modelUsed: "gpt-4o-mini",
},
};
},
// Configure scorers
scorers: [
scorers.exactMatch,
{
scorer: scorers.levenshtein,
threshold: 0.8,
},
],
// Pass criteria
passCriteria: {
type: "meanScore",
min: 0.7,
},
});
Experiment Configuration
Required Fields
interface ExperimentConfig {
// Unique identifier for the experiment
id: string;
// The runner function that executes for each dataset item
runner: ExperimentRunner;
// Optional but recommended
label?: string;
description?: string;
}
Runner Function
The runner function is what you're evaluating. It receives a context object and produces output:
type ExperimentRunner = (context: ExperimentRunnerContext) => Promise<ExperimentRunnerReturn>;
interface ExperimentRunnerContext {
item: ExperimentDatasetItem; // Current dataset item
index: number; // Item index
total?: number; // Total items (if known)
signal?: AbortSignal; // For cancellation
voltOpsClient?: any; // VoltOps client if configured
runtime?: {
runId?: string;
startedAt?: number;
tags?: readonly string[];
};
}
Example runners:
// Simple text generation
runner: async ({ item }) => {
const result = await processInput(item.input);
return {
output: result,
metadata: {
confidence: 0.95,
},
};
};
// Using expected value for comparison
runner: async ({ item }) => {
const prompt = `Question: ${item.input}\nExpected answer format: ${item.expected}`;
const result = await generateResponse(prompt);
return { output: result };
};
// With error handling
runner: async ({ item, signal }) => {
try {
const result = await processWithTimeout(item.input, signal);
return { output: result };
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
};
// Accessing runtime context
runner: async ({ item, index, total, runtime }) => {
console.log(`Processing item ${index + 1}/${total}`);
console.log(`Run ID: ${runtime?.runId}`);
const result = await process(item.input);
return {
output: result,
};
};
Dataset Configuration
Experiments can use datasets in multiple ways:
// Reference registered dataset by name
dataset: {
name: "my-dataset"
}
// Reference by ID
dataset: {
id: "dataset-uuid",
versionId: "version-uuid" // Optional specific version
}
// Limit number of items
dataset: {
name: "large-dataset",
limit: 100 // Only use first 100 items
}
// Inline items
dataset: {
items: [
{
id: "1",
input: { prompt: "What is 2+2?" },
expected: "4"
},
{
id: "2",
input: { prompt: "Capital of France?" },
expected: "Paris"
}
]
}
// Dynamic resolver
dataset: {
resolve: async ({ limit, signal }) => {
const items = await fetchDatasetItems(limit);
return {
items,
total: items.length,
dataset: {
name: "Dynamic Dataset",
description: "Fetched at runtime"
}
};
}
}
Dataset Item Structure
interface ExperimentDatasetItem {
id: string; // Unique item ID
label?: string; // Optional display name
input: any; // Input data (your format)
expected?: any; // Expected output (optional)
extra?: Record<string, any>; // Additional data
metadata?: Record<string, any>; // Item metadata
// Automatically added if from registered dataset
datasetId?: string;
datasetVersionId?: string;
datasetName?: string;
}
Scorers Configuration
Configure how experiments use scorers:
import { scorers } from "@voltagent/scorers";
scorers: [
// Use prebuilt scorer directly
scorers.exactMatch,
// Configure scorer with threshold
{
scorer: scorers.levenshtein,
threshold: 0.9,
name: "String Similarity",
},
// Custom scorer with metadata
{
scorer: myCustomScorer,
threshold: 0.7,
metadata: {
category: "custom",
version: "1.0.0",
},
},
];
Pass Criteria
Define success conditions for your experiments:
// Single criterion - mean score
passCriteria: {
type: "meanScore",
min: 0.8,
label: "Average Quality",
scorerId: "exact-match" // Optional: specific scorer
}
// Single criterion - pass rate
passCriteria: {
type: "passRate",
min: 0.9,
label: "90% Pass Rate",
severity: "error" // "error" or "warn"
}
// Multiple criteria (all must pass)
passCriteria: [
{
type: "meanScore",
min: 0.7,
label: "Overall Quality"
},
{
type: "passRate",
min: 0.95,
label: "Consistency Check",
scorerId: "exact-match"
}
]
VoltOps Integration
Configure VoltOps for cloud-based tracking:
voltOps: {
client: voltOpsClient, // VoltOps client instance
triggerSource: "ci", // Source identifier
autoCreateRun: true, // Auto-create eval runs
autoCreateScorers: true, // Auto-register scorers
tags: ["nightly", "regression"] // Tags for filtering
}
Experiment Binding
Link experiments to VoltOps experiments:
experiment: {
name: "production-quality-check", // VoltOps experiment name
id: "exp-uuid", // Or use existing ID
autoCreate: true // Create if doesn't exist
}
Running Experiments
Via CLI
Save your experiment to a file:
// experiments/support-quality.ts
import { createExperiment } from "@voltagent/evals";
export default createExperiment({
id: "support-quality",
dataset: { name: "support-dataset" },
runner: async ({ item }) => {
// evaluation logic
return { output: "response" };
},
});
Run with:
npm run volt eval run --experiment ./experiments/support-quality.ts
Programmatically
import { runExperiment } from "@voltagent/evals";
import experiment from "./experiments/support-quality";
const summary = await runExperiment(experiment, {
concurrency: 5, // Run 5 items in parallel
onItemComplete: (event) => {
console.log(`Completed item ${event.index}/${event.total}`);
console.log(`Score: ${event.result.scores[0]?.score}`);
},
onComplete: (summary) => {
console.log(`Experiment completed: ${summary.passed ? "PASSED" : "FAILED"}`);
console.log(`Mean score: ${summary.meanScore}`);
},
});
Complete Example
Here's a complete example from the codebase:
import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";
import { Agent } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
const supportAgent = new Agent({
name: "Support Agent",
instructions: "You are a helpful customer support agent.",
model: openai("gpt-4o-mini"),
});
export default createExperiment({
id: "support-agent-eval",
label: "Support Agent Evaluation",
description: "Evaluates support agent response quality",
dataset: {
name: "support-qa-v2",
limit: 100, // Test on first 100 items
},
runner: async ({ item, index, total }) => {
console.log(`Processing ${index + 1}/${total}`);
try {
const response = await supportAgent.generateText({
messages: [{ role: "user", content: item.input.prompt }],
});
return {
output: response.text,
metadata: {
model: "gpt-4o-mini",
tokenUsage: response.usage,
},
};
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
},
scorers: [
{
scorer: scorers.exactMatch,
threshold: 1.0,
},
{
scorer: scorers.levenshtein,
threshold: 0.8,
name: "String Similarity",
},
],
passCriteria: [
{
type: "meanScore",
min: 0.75,
label: "Overall Quality",
},
{
type: "passRate",
min: 0.9,
scorerId: "exact-match",
label: "Exact Match Rate",
},
],
experiment: {
name: "support-agent-regression",
autoCreate: true,
},
voltOps: {
autoCreateRun: true,
tags: ["regression", "support"],
},
});
Result Structure
When running experiments, you get a summary with this structure:
interface ExperimentSummary {
experimentId: string;
runId: string;
status: "completed" | "failed" | "cancelled";
passed: boolean;
startedAt: number;
completedAt: number;
durationMs: number;
results: ExperimentItemResult[];
// Aggregate metrics
totalItems: number;
completedItems: number;
meanScore: number;
passRate: number;
// Pass criteria results
criteriaResults?: {
label?: string;
passed: boolean;
value: number;
threshold: number;
}[];
metadata?: Record<string, unknown>;
}
Best Practices
1. Use Descriptive IDs
id: "gpt4-customer-support-accuracy-v2"; // Good
id: "test1"; // Bad
2. Handle Errors Gracefully
runner: async ({ item }) => {
try {
const result = await process(item.input);
return { output: result };
} catch (error) {
// Return error info for analysis
return {
output: null,
metadata: {
error: error.message,
errorType: error.constructor.name,
},
};
}
};
3. Add Meaningful Metadata
runner: async ({ item, runtime }) => {
const startTime = Date.now();
const result = await process(item.input);
return {
output: result,
metadata: {
processingTimeMs: Date.now() - startTime,
runId: runtime?.runId,
itemCategory: item.metadata?.category,
},
};
};
4. Use Appropriate Concurrency
// For rate-limited APIs
await runExperiment(experiment, {
concurrency: 2, // Low concurrency
});
// For local processing
await runExperiment(experiment, {
concurrency: 10, // Higher concurrency
});
5. Tag Experiments Properly
voltOps: {
tags: ["model:gpt-4", "version:2.1.0", "type:regression", "priority:high"];
}
Next Steps
- Datasets - Learn about creating and managing datasets
- Building Custom Scorers - Create domain-specific scorers
- Prebuilt Scorers - Explore available scorers
- CLI Reference - Run experiments from the command line