Alertmanager¶
Alert routing, notification channels, and inhibition rules for the home security intelligence system.
Key Files:
monitoring/alertmanager.yml:1-215- Alertmanager configurationmonitoring/alerting-rules.yml:1-1116- Alert rule definitionsmonitoring/prometheus-rules.yml:1-169- Recording rules for SLIsmonitoring/grafana/provisioning/alerting/log-alerts.yml- Grafana log-based alerts
Overview¶
Alertmanager receives alerts from Prometheus based on metric thresholds and routes them to appropriate notification channels. The system uses severity-based routing with escalation paths, grouping to reduce alert noise, and inhibition rules to prevent cascading alerts during major incidents.
Alerts are categorized by severity (critical, warning, info) and component (pipeline, infrastructure, ai-services, database). Route matching determines notification channel: critical alerts trigger webhooks and multiple channels, warnings go to standard channels, and info alerts are logged only.
The configuration supports both self-hosted and cloud deployments with flexible notification targets including webhooks, email, Slack, and PagerDuty.
Architecture¶
graph TD
subgraph "Alert Sources"
PROM[Prometheus<br/>alerting-rules.yml]
GRAF[Grafana Alerts<br/>log-alerts.yml]
end
subgraph "Alertmanager"
REC[Receiver<br/>alertmanager.yml:1-20]
ROUTE[Route Matching<br/>alertmanager.yml:95-180]
GROUP[Grouping]
INHIB[Inhibition Rules<br/>alertmanager.yml:181-216]
end
subgraph "Notification Channels"
WH[Webhook<br/>Backend API]
SLACK[Slack Integration]
EMAIL[Email SMTP]
PD[PagerDuty]
end
PROM --> |alert| REC
GRAF --> |alert| REC
REC --> ROUTE
ROUTE --> GROUP
GROUP --> INHIB
INHIB --> |critical| WH
INHIB --> |critical| PD
INHIB --> |warning| SLACK
INHIB --> |warning| EMAIL Alert Configuration¶
Global Settings¶
Base configuration (monitoring/alertmanager.yml:1-30):
global:
# Time to wait before declaring an alert resolved
resolve_timeout: 5m
# SMTP configuration for email alerts
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
# Slack API URL (can be overridden per receiver)
slack_api_url: '${SLACK_WEBHOOK_URL}'
# PagerDuty service key
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
Receiver Configuration¶
Define notification targets (monitoring/alertmanager.yml:35-90):
receivers:
# Critical alerts: webhook + PagerDuty
- name: 'critical-alerts'
webhook_configs:
- url: 'http://backend:8000/api/webhooks/alerts'
send_resolved: true
http_config:
basic_auth:
username: 'alertmanager'
password_file: '/etc/alertmanager/webhook_password'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: 'critical'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
# Warning alerts: Slack
- name: 'warning-alerts'
slack_configs:
- channel: '#hsi-alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
# Info alerts: log only (no notification)
- name: 'info-alerts'
webhook_configs:
- url: 'http://backend:8000/api/webhooks/alerts?severity=info'
send_resolved: false
# Null receiver for silenced alerts
- name: 'null'
Route Configuration¶
Route matching rules (monitoring/alertmanager.yml:95-180):
route:
# Default receiver for unmatched alerts
receiver: 'warning-alerts'
# Time to wait before sending initial alert
group_wait: 30s
# Time between alert batches for same group
group_interval: 5m
# Time before re-sending if alert still firing
repeat_interval: 4h
# Group alerts by these labels
group_by: ['alertname', 'severity', 'component']
# Child routes (evaluated in order, first match wins)
routes:
# Critical infrastructure alerts
- match:
severity: critical
component: infrastructure
receiver: 'critical-alerts'
group_wait: 10s
repeat_interval: 1h
# Critical pipeline alerts
- match:
severity: critical
component: pipeline
receiver: 'critical-alerts'
group_wait: 10s
# Critical AI service alerts
- match:
severity: critical
component: ai-services
receiver: 'critical-alerts'
group_wait: 10s
# Warning alerts by component
- match:
severity: warning
receiver: 'warning-alerts'
# Info alerts (logging only)
- match:
severity: info
receiver: 'info-alerts'
# Database alerts to dedicated channel
- match_re:
alertname: 'Database.*'
receiver: 'critical-alerts'
group_by: ['alertname', 'instance']
Inhibition Rules¶
Prevent alert cascades (monitoring/alertmanager.yml:181-215):
inhibit_rules:
# If backend is down, suppress all dependent alerts
- source_match:
alertname: 'BackendDown'
severity: 'critical'
target_match:
component: 'pipeline'
equal: ['instance']
# If GPU is unavailable, suppress AI service alerts
- source_match:
alertname: 'GPUUnavailable'
target_match:
component: 'ai-services'
# If Redis is down, suppress cache-related alerts
- source_match:
alertname: 'RedisDown'
target_match_re:
alertname: 'Cache.*|Queue.*'
# If Prometheus is down, suppress metric-based alerts
- source_match:
alertname: 'PrometheusDown'
target_match_re:
alertname: '.*Latency.*|.*Rate.*|.*Queue.*'
# Critical severity inhibits warning for same alertname
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Alert Rule Definitions¶
Infrastructure Alerts¶
Service availability (monitoring/alerting-rules.yml:15-80):
| Alert | Condition | Severity | For |
|---|---|---|---|
BackendDown | up{job="hsi-backend"} == 0 | critical | 1m |
RedisDown | up{job="redis"} == 0 | critical | 1m |
PrometheusDown | up{job="prometheus"} == 0 | critical | 1m |
PostgresDown | up{job="postgres"} == 0 | critical | 1m |
JaegerDown | up{job="jaeger"} == 0 | warning | 5m |
Example rule:
- alert: BackendDown
expr: up{job="hsi-backend-metrics"} == 0
for: 1m
labels:
severity: critical
component: infrastructure
annotations:
summary: 'Backend service is down'
description: 'The HSI backend service has been unreachable for more than 1 minute'
runbook_url: 'https://docs.hsi.local/runbooks/backend-down'
GPU Alerts¶
GPU health monitoring (monitoring/alerting-rules.yml:85-150):
| Alert | Condition | Severity | For |
|---|---|---|---|
GPUUnavailable | hsi_gpu_available != 1 | critical | 2m |
GPUTemperatureHigh | hsi_gpu_temperature > 85 | critical | 5m |
GPUTemperatureWarning | hsi_gpu_temperature > 75 | warning | 10m |
GPUMemoryHigh | hsi_gpu_memory_used_mb / hsi_gpu_memory_total_mb > 0.95 | warning | 5m |
GPUUtilizationLow | hsi_gpu_utilization < 10 | info | 30m |
Example rule:
- alert: GPUTemperatureHigh
expr: hsi_gpu_temperature > 85
for: 5m
labels:
severity: critical
component: infrastructure
annotations:
summary: 'GPU temperature critical: {{ $value }}C'
description: 'GPU temperature has exceeded 85C for 5 minutes. Thermal throttling imminent.'
Pipeline Alerts¶
Processing health (monitoring/alerting-rules.yml:155-280):
| Alert | Condition | Severity | For |
|---|---|---|---|
DetectionQueueBacklog | hsi_detection_queue_depth > 100 | warning | 5m |
DetectionQueueCritical | hsi_detection_queue_depth > 500 | critical | 2m |
AnalysisQueueBacklog | hsi_analysis_queue_depth > 50 | warning | 5m |
PipelineLatencyHigh | hsi_detect_latency_p95_ms > 60000 | warning | 10m |
PipelineErrorRateHigh | rate(hsi_pipeline_errors_total[5m]) > 0.1 | warning | 5m |
NoDetectionsProcessed | rate(hsi_detections_processed_total[15m]) == 0 | critical | 15m |
NoEventsCreated | rate(hsi_events_created_total[30m]) == 0 | warning | 30m |
Example rule:
- alert: DetectionQueueBacklog
expr: hsi_detection_queue_depth > 100
for: 5m
labels:
severity: warning
component: pipeline
annotations:
summary: 'Detection queue backlog: {{ $value }} items'
description: 'Detection queue has more than 100 items waiting for 5+ minutes'
AI Service Alerts¶
Model health (monitoring/alerting-rules.yml:285-400):
| Alert | Condition | Severity | For |
|---|---|---|---|
YOLO26LatencyHigh | P95 > 5s | warning | 10m |
NemotronLatencyHigh | P95 > 60s | warning | 10m |
EnrichmentErrorRate | Error rate > 5% | warning | 5m |
LLMContextOverflow | Truncation rate > 10% | warning | 15m |
AIServiceUnavailable | Service down | critical | 2m |
Example rule:
- alert: NemotronLatencyHigh
expr: |
histogram_quantile(0.95,
rate(hsi_ai_request_duration_seconds_bucket{service="nemotron"}[5m])
) > 60
for: 10m
labels:
severity: warning
component: ai-services
annotations:
summary: 'Nemotron P95 latency high: {{ $value | humanizeDuration }}'
description: 'LLM inference P95 latency exceeds 60s for 10+ minutes'
Database Alerts¶
PostgreSQL health (monitoring/alerting-rules.yml:405-500):
| Alert | Condition | Severity | For |
|---|---|---|---|
DatabaseConnectionsHigh | Active > 80% max | warning | 5m |
DatabaseConnectionsCritical | Active > 95% max | critical | 2m |
SlowQueriesHigh | rate > 1/min | warning | 10m |
DatabaseDiskUsageHigh | Usage > 80% | warning | 15m |
DatabaseDiskUsageCritical | Usage > 95% | critical | 5m |
Cache Alerts¶
Redis health (monitoring/alerting-rules.yml:505-580):
| Alert | Condition | Severity | For |
|---|---|---|---|
CacheHitRateLow | Hit rate < 50% | warning | 15m |
CacheEvictionsHigh | Evictions > 1000/min | warning | 5m |
RedisMemoryHigh | Used > 80% max | warning | 10m |
RedisMemoryCritical | Used > 95% max | critical | 5m |
SLO Burn Rate Alerts¶
Multi-window SLO monitoring (monitoring/alerting-rules.yml:585-700):
| Alert | Condition | Severity | Windows |
|---|---|---|---|
APIAvailabilityBurnRateFast | 14.4x burn rate | critical | 1h, 5m |
APIAvailabilityBurnRateMedium | 6x burn rate | critical | 6h, 30m |
APIAvailabilityBurnRateSlow | 3x burn rate | warning | 1d, 6h |
Example multi-window burn rate rule:
- alert: APIAvailabilityBurnRateFast
expr: |
(
hsi:burn_rate:api_availability_1h > 14.4
and
hsi:burn_rate:api_availability_5m > 14.4
)
for: 2m
labels:
severity: critical
component: slo
annotations:
summary: 'API availability SLO burn rate critical'
description: 'Fast burn: consuming 14.4x error budget. Will exhaust 30d budget in 2 days at current rate.'
Recording Rules for Alerts¶
Pre-computed metrics for alerting (monitoring/prometheus-rules.yml:119-169):
# Error budget calculations
- record: hsi:error_budget:api_availability_remaining
expr: |
1 - (
(1 - hsi:api_availability:ratio_rate30d)
/
(1 - 0.995)
)
# Burn rate calculations
- record: hsi:burn_rate:api_availability_1h
expr: |
(1 - hsi:api_availability:ratio_rate1h) / (1 - 0.995)
- record: hsi:burn_rate:api_availability_6h
expr: |
(1 - hsi:api_availability:ratio_rate6h) / (1 - 0.995)
- record: hsi:burn_rate:api_availability_1d
expr: |
(1 - hsi:api_availability:ratio_rate1d) / (1 - 0.995)
Grafana Log-Based Alerts¶
Log pattern alerting (monitoring/grafana/provisioning/alerting/log-alerts.yml):
| Alert | LogQL | Severity |
|---|---|---|
HighErrorRate | ERROR/CRITICAL > 5% | warning |
CriticalLogSpike | CRITICAL count > 10/5m | critical |
NoLogsReceived | count == 0 for 10m | critical |
Example:
- alert: HighErrorRate
expr: |
sum(count_over_time({container=~"backend|ai-.*"} |~ "ERROR|CRITICAL" [5m]))
/ sum(count_over_time({container=~"backend|ai-.*"} [5m])) > 0.05
for: 5m
labels:
severity: warning
source: logs
annotations:
summary: 'High error rate in logs'
description: 'Error rate exceeds 5% for 5 minutes'
Alert Labels¶
Standard labels for routing and filtering:
| Label | Values | Purpose |
|---|---|---|
severity | critical, warning, info | Route selection, escalation |
component | infrastructure, pipeline, ai-services, database, cache, slo | Team routing |
instance | Service instance | Deduplication, inhibition |
source | prometheus, grafana, logs | Alert origin |
Alert Annotations¶
Standard annotations for context:
| Annotation | Purpose |
|---|---|
summary | Brief alert title |
description | Detailed explanation |
runbook_url | Link to remediation docs |
dashboard_url | Link to relevant dashboard |
value | Current metric value |
Testing Alerts¶
Verify alert rules:
# Check Prometheus rule syntax
promtool check rules monitoring/alerting-rules.yml
# Test rule evaluation
promtool test rules monitoring/alerting-rules-test.yml
# Query Prometheus for active alerts
curl -s http://prometheus:9090/api/v1/alerts | jq '.data.alerts'
# Check Alertmanager status
curl -s http://alertmanager:9093/api/v2/status | jq
Silences¶
Create temporary silences for maintenance:
# Create silence via API
curl -X POST http://alertmanager:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "GPUTemperatureWarning", "isRegex": false}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T12:00:00Z",
"createdBy": "admin",
"comment": "GPU maintenance window"
}'
Related Documents¶
- Prometheus Metrics - Metric definitions used in alerts
- Grafana Dashboards - Alert visualization
- Structured Logging - Log-based alerting