Live Evaluations
Live evaluations run scorers against real-time agent interactions. Attach scorers to agents during initialization to sample production traffic, enforce safety guardrails, and monitor conversation quality without running separate evaluation jobs.
Configuring Live Scorers
Define scorers in the eval
config when creating an agent:
import { Agent, VoltAgentObservability } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";
const observability = new VoltAgentObservability();
const agent = new Agent({
name: "support-agent",
instructions: "Answer customer questions about products.",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 0.1 },
scorers: {
moderation: {
scorer: createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5,
}),
},
},
},
});
Scorers execute asynchronously after the agent response is generated. Scoring does not block the user-facing response.
Eval Configuration
Required Fields
None - all fields are optional. If no scorers are defined, evaluation is disabled.
Optional Fields
triggerSource
Tags the evaluation run with a trigger identifier. Use to distinguish between environments or traffic sources.
triggerSource: "production"; // live traffic
triggerSource: "staging"; // pre-production
triggerSource: "manual"; // manual testing
Default: "live"
when unspecified.
environment
Labels the evaluation with an environment tag. Appears in telemetry and VoltOps dashboards.
environment: "prod-us-east";
environment: "local-dev";
sampling
Controls what percentage of interactions are scored. Use sampling to reduce latency and LLM costs on high-volume agents.
Ratio-based:
sampling: {
type: "ratio",
rate: 0.1, // score 10% of interactions
}
Count-based:
sampling: {
type: "count",
rate: 100, // score every 100th interaction
}
Always sample:
sampling: { type: "ratio", rate: 1 } // 100%
When unspecified, sampling defaults to scoring every interaction (rate: 1
).
Sampling decisions are made independently for each scorer. Set sampling at the eval level (applies to all scorers) or per-scorer to override.
scorers
Map of scorer configurations. Each key identifies a scorer instance, and the value defines the scorer function and parameters.
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
},
keyword: {
scorer: keywordMatchScorer,
params: { keyword: "refund" },
},
}
redact
Function to remove sensitive data from evaluation payloads before storage. Called synchronously before scoring.
redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
output: payload.output?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
});
The redacted payload is stored in observability but scoring uses the original unredacted version.
Scorer Configuration
Each entry in the scorers
map has this structure:
{
scorer: LocalScorerDefinition | (() => Promise<LocalScorerDefinition>),
params?: Record<string, unknown> | ((payload: AgentEvalContext) => Record<string, unknown>),
sampling?: SamplingPolicy,
id?: string,
onResult?: (result: AgentEvalResult) => void | Promise<void>,
}
Fields
scorer
(required)
The scoring function. Use prebuilt scorers from @voltagent/scorers
or custom implementations via buildScorer
.
Prebuilt scorer:
import { createModerationScorer } from "@voltagent/scorers";
scorer: createModerationScorer({ model, threshold: 0.5 });
Custom scorer:
import { buildScorer } from "@voltagent/core";
const customScorer = buildScorer({
id: "length-check",
type: "agent",
label: "Response Length",
})
.score(({ payload }) => {
const length = payload.output?.length ?? 0;
return { score: length > 50 ? 1 : 0 };
})
.build();
Lazy-loaded scorer:
scorer: async () => {
const { createAnswerCorrectnessScorer } = await import("@voltagent/scorers");
return createAnswerCorrectnessScorer();
};
params
Static or dynamic parameters passed to the scorer.
Static:
params: {
keyword: "refund",
threshold: 0.8,
}
Dynamic:
params: (payload) => ({
keyword: extractKeyword(payload.input),
threshold: 0.8,
});
Dynamic params are resolved before each scorer invocation.
sampling
Override the global sampling policy for this scorer.
sampling: { type: "ratio", rate: 0.05 } // 5% for this scorer only
id
Override the scorer's default ID. Useful when using the same scorer multiple times with different params.
scorers: {
keywordRefund: {
scorer: keywordScorer,
id: "keyword-refund",
params: { keyword: "refund" },
},
keywordReturn: {
scorer: keywordScorer,
id: "keyword-return",
params: { keyword: "return" },
},
}
onResult
Callback invoked after scoring completes. Use for custom logging, alerting, or side effects.
onResult: async (result) => {
if (result.score !== null && result.score < 0.5) {
await alertingService.send({
message: `Low score: ${result.scorerName} = ${result.score}`,
});
}
};
Scorer Context
Scorers receive an AgentEvalContext
object with these properties:
interface AgentEvalContext {
agentId: string;
agentName: string;
operationId: string;
operationType: "generateText" | "streamText" | string;
input: string | null; // normalized string
output: string | null; // normalized string
rawInput: unknown; // original input value
rawOutput: unknown; // original output value
userId?: string;
conversationId?: string;
traceId: string;
spanId: string;
timestamp: string;
metadata?: Record<string, unknown>;
rawPayload: AgentEvalPayload;
}
Use input
and output
for text-based scorers. Access rawInput
and rawOutput
for structured data.
Building Custom Scorers
Use buildScorer
to create scorers with custom logic:
import { buildScorer } from "@voltagent/core";
const lengthScorer = buildScorer({
id: "response-length",
type: "agent",
label: "Response Length Check",
})
.score(({ payload, params }) => {
const minLength = (params.minLength as number) ?? 50;
const length = payload.output?.length ?? 0;
return {
score: length >= minLength ? 1 : 0,
metadata: { actualLength: length, minLength },
};
})
.reason(({ score, params }) => {
const minLength = (params.minLength as number) ?? 50;
return {
reason:
score >= 1
? `Response meets minimum length of ${minLength} characters.`
: `Response is shorter than ${minLength} characters.`,
};
})
.build();
Builder Methods
.score(fn)
Defines the scoring function. Return { score, metadata? }
or just the numeric score.
.score(({ payload, params, results }) => {
const match = payload.output?.includes(params.keyword);
return {
score: match ? 1 : 0,
metadata: { keyword: params.keyword, matched: match },
};
})
Context properties:
payload
-AgentEvalContext
with input/outputparams
- Resolved parametersresults
- Shared results object for multi-stage scoring
.reason(fn)
(optional)
Generates human-readable explanations. Return { reason: string }
.
.reason(({ score, params }) => ({
reason: score >= 1 ? "Match found" : "No match",
}))
.build()
Returns the LocalScorerDefinition
object.
LLM Judge Scorers
Use AI SDK's generateObject
to build LLM-based evaluators:
import { buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { generateObject } from "ai";
import { z } from "zod";
const JUDGE_SCHEMA = z.object({
score: z.number().min(0).max(1).describe("Score from 0 to 1"),
reason: z.string().describe("Detailed explanation"),
});
const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness Judge",
})
.score(async ({ payload }) => {
const prompt = `Rate the response for clarity and helpfulness.
User Input: ${payload.input}
Assistant Response: ${payload.output}
Provide a score from 0 to 1 with an explanation.`;
const response = await generateObject({
model: openai("gpt-4o-mini"),
schema: JUDGE_SCHEMA,
prompt,
maxTokens: 200,
});
return {
score: response.object.score,
metadata: {
reason: response.object.reason,
},
};
})
.build();
The judge calls the LLM with a structured schema, ensuring consistent scoring output.
Prebuilt Scorers
Moderation
import { createModerationScorer } from "@voltagent/scorers";
createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5, // fail if score < 0.5
});
Flags unsafe content (toxicity, bias, etc.) using LLM-based classification.
Answer Correctness
import { createAnswerCorrectnessScorer } from "@voltagent/scorers";
const scorer = createAnswerCorrectnessScorer({
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
expected: params.expectedAnswer,
}),
});
Evaluates factual accuracy. Requires expected
in params. Users implement scoring logic.
Answer Relevancy
import { createAnswerRelevancyScorer } from "@voltagent/scorers";
const scorer = createAnswerRelevancyScorer({
strictness: 3,
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
context: params.referenceContext,
}),
});
Checks if the output addresses the input. Strictness controls evaluation level.
Keyword Match
import { buildScorer } from "@voltagent/core";
const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0 };
})
.build();
// Usage:
scorers: {
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
}
VoltOps Integration
When a VoltOps client is configured globally, live scorer results are forwarded automatically:
import VoltAgent, { Agent, VoltAgentObservability } from "@voltagent/core";
import { VoltOpsClient } from "@voltagent/sdk";
const voltOpsClient = new VoltOpsClient({
publicKey: process.env.VOLTAGENT_PUBLIC_KEY,
secretKey: process.env.VOLTAGENT_SECRET_KEY,
});
const observability = new VoltAgentObservability();
new VoltAgent({
agents: { support: agent },
observability,
voltOpsClient, // enables automatic forwarding
});
The framework creates evaluation runs, registers scorers, appends results, and finalizes summaries. Each batch of scores (per agent interaction) becomes a separate run in VoltOps.
Sampling Strategies
Ratio Sampling
Sample a percentage of interactions:
sampling: { type: "ratio", rate: 0.1 } // 10% of traffic
Use for high-volume agents where scoring every interaction is expensive.
Count Sampling
Sample every Nth interaction:
sampling: { type: "count", rate: 100 } // every 100th interaction
Use when you need predictable sampling intervals or rate-limiting.
Per-Scorer Sampling
Override sampling for specific scorers:
eval: {
sampling: { type: "ratio", rate: 1 }, // default: score all
scorers: {
moderation: {
scorer: moderationScorer,
sampling: { type: "ratio", rate: 1 }, // always run moderation
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.05 }, // 5% for expensive LLM judge
},
},
}
Error Handling
If a scorer throws an exception, the result is marked status: "error"
and the error message is captured in errorMessage
. Other scorers continue executing.
.score(({ payload, params }) => {
if (!params.keyword) {
throw new Error("keyword parameter is required");
}
// ...
})
The error appears in observability storage and VoltOps telemetry.
Best Practices
Use Sampling for Expensive Scorers
LLM judges and embedding-based scorers consume tokens and add latency. Sample aggressively:
sampling: { type: "ratio", rate: 0.05 } // 5% for LLM judges
Combine Fast and Slow Scorers
Run lightweight scorers (keyword match, length checks) on all interactions. Sample LLM judges at lower rates.
scorers: {
keyword: {
scorer: keywordScorer,
sampling: { type: "ratio", rate: 1 }, // 100%
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.1 }, // 10%
},
}
Use Redaction for PII
Strip sensitive data before storage:
redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
output: payload.output?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
});
Scorers receive unredacted data. Only storage and telemetry are redacted.
Use Thresholds for Alerts
Set thresholds and trigger alerts on failures:
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.7 }),
onResult: async (result) => {
if (result.score !== null && result.score < 0.7) {
await alertingService.send({
severity: "high",
message: `Moderation failed: ${result.score}`,
});
}
},
},
}
Tag Environments
Use environment
to distinguish between deployments:
environment: process.env.NODE_ENV === "production" ? "prod" : "staging";
Filter telemetry by environment in VoltOps dashboards.
Examples
Moderation + Keyword Matching
import { Agent, VoltAgentObservability, buildScorer } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";
const moderationModel = openai("gpt-4o-mini");
const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0, metadata: { keyword, matched } };
})
.build();
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
sampling: { type: "ratio", rate: 1 },
scorers: {
moderation: {
scorer: createModerationScorer({ model: moderationModel, threshold: 0.5 }),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
},
},
});
LLM Judge for Helpfulness
import { Agent, buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const HELPFULNESS_SCHEMA = z.object({
score: z.number().min(0).max(1),
reason: z.string(),
});
const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness",
})
.score(async ({ payload }) => {
const agent = new Agent({
name: "helpfulness-judge",
model: openai("gpt-4o-mini"),
instructions: "You rate responses for helpfulness",
});
const prompt = `Rate the response for clarity, accuracy, and helpfulness.
User Input: ${payload.input}
Assistant Response: ${payload.output}
Provide a score from 0 to 1 with an explanation.`;
const response = await agent.generateObject(prompt, HELPFULNESS_SCHEMA);
const rawResults = (payload as any).results?.raw ?? {};
rawResults.helpfulnessJudge = response.object;
return {
score: response.object.score,
metadata: { reason: response.object.reason },
};
})
.reason(({ results }) => {
const judge = results.raw?.helpfulnessJudge as { reason?: string };
return { reason: judge?.reason ?? "No explanation provided." };
})
.build();
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
sampling: { type: "ratio", rate: 0.1 }, // 10% sampling
scorers: {
helpfulness: { scorer: helpfulnessScorer },
},
},
});
Multiple Scorers with Different Sampling
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 1 }, // default: score everything
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
sampling: { type: "ratio", rate: 1 }, // always run
},
answerCorrectness: {
scorer: createAnswerCorrectnessScorer(),
sampling: { type: "ratio", rate: 0.05 }, // 5% (expensive)
params: (payload) => ({
expectedAnswer: lookupExpectedAnswer(payload.input),
}),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
sampling: { type: "ratio", rate: 1 }, // cheap, always run
},
},
},
});
Combining Offline and Live Evaluations
Use live evals for real-time monitoring and offline evals for regression testing:
- Live: Sample 5-10% of production traffic with fast scorers (moderation, keyword match)
- Offline: Run comprehensive LLM judges on curated datasets nightly
Both share the same scorer definitions. Move scorers between eval types as needed.
Next Steps
- Offline Evaluations - Regression testing and CI integration
- Prebuilt Scorers - Full catalog of prebuilt scorers
- Building Custom Scorers - Create your own evaluation scorers