Skip to main content

Alerts



Alerts notify your team when AI agent metrics exceed defined thresholds. VoltOps evaluates alert conditions every minute and triggers notifications through configured channels.

Alert Components

An alert consists of:

  • Metric: What to monitor (error rate, latency)
  • Condition: Threshold and condition type (count or percent)
  • Time Window: Evaluation period (5, 15, 30, or 60 minutes)
  • Filters: Scope the alert to specific traces
  • Channels: Where to send notifications (webhook, Slack)
  • Cooldown: Minimum time between notifications

Creating an Alert

Navigate to the Alerts page in VoltOps and click "Create Alert".

Selecting a Metric

MetricDescription
Errored RunsCounts traces with status: error or error_count > 0
LatencyCalculates average trace duration in the time window

Condition Types

For error rate alerts:

  • Count: Trigger when error count exceeds N runs
  • Percent: Trigger when error percentage exceeds N%

For latency alerts:

  • Trigger when average latency exceeds N seconds

Time Windows

Available windows: 5, 15, 30, or 60 minutes. The alert evaluates all traces within this rolling window.

Cooldown Period

After an alert triggers, VoltOps waits for the cooldown period before sending another notification. Available options: 5, 15, 30, 60, or 120 minutes.

Filters

Filters narrow the scope of an alert to specific traces. Multiple filters are combined with AND logic.

Available Filter Fields

FieldTypeOperatorsDescription
Statusselecteq, neqTrace status: error, success, in_progress
Latency (ms)numbergt, ltTrace duration in milliseconds
Modeltexteq, neqLLM model name
User IDtexteq, neqUser identifier from trace
InputtextcontainsTrace input content
OutputtextcontainsTrace output content
Error Messagetexteq, containsError message from failed traces
Agent/Workflow Nametexteq, neq, containsName of the root span
Entity Typeselecteq, neqagent or workflow
Metadatatexteq, neq, containsCustom metadata key-value pairs

Filter Operators

OperatorDescription
eqEquals
neqNot equals
gtGreater than
ltLess than
containsContains substring (case-insensitive)

Metadata Filters

Filter by custom metadata fields using dot notation:

  • metadata.environment - Direct metadata key
  • metadata.context.region - Nested context key

Notification Channels

Webhook

Send HTTP POST requests to any URL when an alert triggers.

Configuration:

  • URL: Endpoint to receive the webhook
  • Headers: Optional HTTP headers (JSON format)
  • Body: Optional custom payload (JSON format)

Default Payload:

{
"alert_id": "uuid",
"alert_name": "High Error Rate",
"metric": "error_rate",
"value": 15.5,
"threshold": 10,
"timestamp": "2024-01-15T10:30:00Z"
}

Slack

Send notifications to a Slack channel using Incoming Webhooks.

Setup:

  1. Create an Incoming Webhook in your Slack workspace (Slack documentation)
  2. Copy the webhook URL (format: https://hooks.slack.com/services/...)
  3. Paste the URL in the Slack channel configuration

Slack notifications include:

  • Alert name and metric
  • Current value and threshold
  • Link to view the incident
  • Link to view a sample trace (when available)

Testing Notifications

Click "Send Test Notification" to verify your channel configuration. Test webhooks include an is_test: true field.

Incidents

When an alert triggers, VoltOps creates an incident. Incidents track the lifecycle of an alert from trigger to resolution.

Incident Statuses

StatusDescription
OpenAlert triggered, awaiting response
AcknowledgedTeam member is investigating
SnoozedTemporarily muted until a specified time
ResolvedIssue addressed, incident closed

Incident Workflow

  1. Alert triggers → Incident created with status open
  2. Team member takes ownership → Status changes to acknowledged
  3. Investigation complete → Status changes to resolved

Alternatively:

  • Snooze: Temporarily silence the incident. When the snooze period expires, the incident reopens if conditions still trigger.

Incident Details

Each incident contains:

  • Payload: Metric value at trigger time, threshold, and sample trace ID
  • Assignee: Team member responsible for resolution
  • Notes: Comments added during investigation
  • Timestamps: Triggered at, resolved at

Dashboard

The Alerts dashboard displays:

MetricDescription
Total IncidentsNumber of incidents in the selected period
Active IncidentsCurrently open, acknowledged, or snoozed incidents
Avg Resolve TimeMean time from trigger to resolution
SparklineDaily incident counts over the period

Toggle between 7-day and 30-day views using the period selector.

Alert Evaluation

VoltOps runs a scheduled job every minute that:

  1. Queries traces within each alert's time window
  2. Applies the configured filters
  3. Calculates the metric value
  4. Compares against the threshold
  5. Creates an incident if triggered and no open incident exists
  6. Sends notifications respecting the cooldown period

If an incident is snoozed and the snooze period expires, the incident reopens and notifications resume.

Table of Contents