Prometheus Alerting¶
Configure alerts for AI pipeline failures, infrastructure issues, and SLO violations.
This guide covers the alerting rules and Alertmanager configuration for Home Security Intelligence. The monitoring stack is optional and enabled with --profile monitoring.
Overview¶
The alerting system consists of three components:
- Prometheus - Evaluates alerting rules against collected metrics
- Alertmanager - Routes, groups, and delivers alerts to receivers
- Backend Webhook - Receives alerts for in-app notification and logging
Alert Severity Levels¶
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| critical | System down, data loss imminent | Immediate | AI detector unavailable, GPU overheating |
| warning | Performance degradation, approaching limits | Within hours | High error rate, queue backlog |
| info | Informational, worth monitoring | Best effort | Prometheus target down |
Quick Start¶
Enable Monitoring Stack¶
# Start with monitoring profile
docker compose --profile monitoring -f docker-compose.prod.yml up -d
# Verify services are running
docker compose -f docker-compose.prod.yml ps | grep -E "(prometheus|alertmanager|grafana)"
Access Points¶
| Service | URL | Purpose |
|---|---|---|
| Prometheus | http://localhost:9090 | Metrics and alert status |
| Alertmanager | http://localhost:9093 | Alert routing and silencing |
| Grafana | http://localhost:3002 | Dashboards and visualization |
View Active Alerts¶
# Prometheus alerts
curl http://localhost:9090/api/v1/alerts | jq
# Alertmanager alerts
curl http://localhost:9093/api/v2/alerts | jq
Pre-Configured Alerts¶
AI Pipeline Alerts¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
AIDetectorUnavailable | YOLO26 health check fails | 2 min | critical |
AIBackendDown | Backend API unreachable | 1 min | critical |
AINemotronTimeout | P95 inference > 120s | 5 min | warning |
AIDetectorSlow | P95 detection > 5s | 5 min | warning |
AIHighErrorRate | Error rate > 10% | 5 min | warning |
AIPipelineErrorSpike | > 50 errors in 5 min | 2 min | warning |
Example alert definition:
- alert: AIDetectorUnavailable
expr: hsi_ai_healthy == 0
for: 2m
labels:
severity: critical
component: ai
service: yolo26
annotations:
summary: 'AI detector service is unavailable'
description: 'YOLO26 object detection service has been unhealthy for > 2 minutes.'
runbook_url: 'https://github.com/.../wiki/Runbooks#aidetectorunavailable'
GPU Resource Alerts¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
AIGPUOverheating | Temperature > 85C | 2 min | critical |
AIGPUTemperatureWarning | Temperature > 75C | 5 min | warning |
AIGPUMemoryCritical | VRAM usage > 95% | 2 min | critical |
AIGPUMemoryWarning | VRAM usage > 85% | 5 min | warning |
GPU memory pressure formula:
Queue Depth Alerts¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
AIDetectionQueueBacklog | Queue depth > 100 | 5 min | warning |
AIAnalysisQueueBacklog | Queue depth > 50 | 5 min | warning |
Queue backlog indicates that processing cannot keep up with incoming images.
Infrastructure Alerts¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
DatabaseUnhealthy | PostgreSQL health check fails | 2 min | critical |
RedisUnhealthy | Redis health check fails | 2 min | critical |
PrometheusTargetDown | Scrape target unreachable | 5 min | warning |
System Health Alerts¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
AISystemDegraded | System health = 0.5 | 5 min | warning |
AISystemUnhealthy | System health = 0 | 2 min | critical |
Prometheus Self-Monitoring Alerts¶
Alerts for monitoring Prometheus itself to ensure observability infrastructure health.
| Alert | Condition | Duration | Severity |
|---|---|---|---|
PrometheusNotScrapingSelf | Self-scrape target down | 2 min | critical |
PrometheusConfigReloadFailed | Config reload unsuccessful | 5 min | critical |
PrometheusRuleEvaluationFailures | Rule evaluation errors | 5 min | warning |
PrometheusRuleEvaluationSlow | Rule eval > interval duration | 10 min | warning |
PrometheusScrapeFailuresHigh | Scrape sync failures > 10% | 5 min | critical |
PrometheusTargetsUnhealthy | > 20% targets down | 5 min | warning |
PrometheusNotificationQueueFull | Notification queue > 90% capacity | 5 min | warning |
PrometheusNotificationsFailing | > 5 notification failures in 5min | 5 min | critical |
PrometheusTSDBCompactionsFailing | TSDB compaction failures | 5 min | warning |
PrometheusTSDBHeadTruncationsFailing | TSDB head truncation failures | 5 min | critical |
PrometheusTSDBWALCorruptions | WAL corruptions detected | 1 min | warning |
PrometheusStorageFillingUp | TSDB storage > 80% full | 15 min | warning |
PrometheusQueryLoadHigh | Avg query duration > 10s | 10 min | warning |
PrometheusRestarted | Instance restarted | 0 min | info |
PrometheusAlertmanagerDown | No Alertmanager discovered | 5 min | warning |
PrometheusSamplesRejected | Out-of-order or duplicate samples | 10 min | warning |
Example self-monitoring alert:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 5m
labels:
severity: critical
component: monitoring
service: prometheus
annotations:
summary: 'Prometheus configuration reload failed'
description: 'Configuration reload has been failing for > 5 minutes. New rules are not being applied.'
Alert Routing (Alertmanager)¶
Default Configuration¶
Alerts are routed based on severity and component labels:
route:
receiver: 'default-receiver'
group_by: ['alertname', 'component', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts - immediate
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
repeat_interval: 1h
# Warning alerts - batched
- match:
severity: warning
receiver: 'warning-receiver'
group_wait: 2m
repeat_interval: 6h
Alert Grouping¶
Alerts are grouped to reduce notification noise:
| Parameter | Value | Purpose |
|---|---|---|
group_by | [alertname, component, severity] | Group similar alerts together |
group_wait | 30s | Wait before first notification |
group_interval | 5m | Wait before notifying new alerts in group |
repeat_interval | 4h | Wait before resending notification |
Inhibition Rules¶
Higher-severity alerts suppress related lower-severity alerts:
| Source Alert | Suppresses |
|---|---|
HSIPipelineDown | All HSI* alerts |
HSIDatabaseUnhealthy | Queue and latency alerts |
HSIRedisUnhealthy | Queue alerts |
HSIGPUMemoryHigh | HSIGPUMemoryElevated |
HSICriticalErrorRate | HSIHighErrorRate |
Configuring Notification Channels¶
Webhook (Default)¶
All alerts are sent to the backend webhook for in-app notification:
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://backend:8000/api/webhooks/alerts'
send_resolved: true
Slack Integration¶
Uncomment and configure in monitoring/alertmanager.yml:
receivers:
- name: 'critical-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#hsi-alerts'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
Required setup:
- Create a Slack webhook: https://api.slack.com/messaging/webhooks
- Set
slack_api_urlin Alertmanager global config or per-receiver - Configure channel and message format
Email Integration¶
Uncomment and configure SMTP settings:
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@hsi.local'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password' # pragma: allowlist secret
receivers:
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
PagerDuty Integration¶
For on-call rotation and escalation:
receivers:
- name: 'critical-receiver'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' # pragma: allowlist secret
severity: critical
Custom Alert Configuration¶
Adding Custom Alerts¶
Edit monitoring/prometheus_rules.yml:
groups:
- name: custom_alerts
interval: 15s
rules:
- alert: HighDetectionLatency
expr: |
histogram_quantile(0.95,
rate(hsi_detection_duration_seconds_bucket[5m])
) > 10
for: 5m
labels:
severity: warning
component: ai
service: pipeline
annotations:
summary: 'Detection latency is high'
description: 'P95 detection latency exceeded 10 seconds for 5 minutes.'
Validating Rules¶
Use promtool to validate rules before deployment:
# Validate rule file syntax
docker compose exec prometheus promtool check rules /etc/prometheus/prometheus_rules.yml
# Test rule expressions
docker compose exec prometheus promtool test rules /etc/prometheus/test_rules.yml
Reloading Configuration¶
After editing rules or Alertmanager config:
# Reload Prometheus rules
curl -X POST http://localhost:9090/-/reload
# Reload Alertmanager config
curl -X POST http://localhost:9093/-/reload
SLI/SLO Recording Rules¶
Pre-computed Service Level Indicators for dashboard efficiency:
API Availability¶
# Success rate (non-5xx responses)
hsi:api_requests:success_rate_5m
# Availability ratios
hsi:api_availability:ratio_rate1h
hsi:api_availability:ratio_rate6h
hsi:api_availability:ratio_rate1d
hsi:api_availability:ratio_rate30d
Detection Latency¶
# P95 and P99 latency
hsi:detection_latency:p95_5m
hsi:detection_latency:p99_5m
# Within SLO (< 2s)
hsi:detection_latency:within_slo_rate5m
Analysis Latency¶
# P95 and P99 latency
hsi:analysis_latency:p95_5m
hsi:analysis_latency:p99_5m
# Within SLO (< 30s)
hsi:analysis_latency:within_slo_rate5m
Error Budget¶
# Remaining error budget (target 99.5% availability)
hsi:error_budget:api_availability_remaining
# Burn rates
hsi:burn_rate:api_availability_1h
hsi:burn_rate:api_availability_6h
Alert Silencing¶
Temporary Silence via UI¶
- Open Alertmanager UI: http://localhost:9093
- Click "Silences" tab
- Click "New Silence"
- Configure matchers (e.g.,
alertname=AIDetectorSlow) - Set duration and comment
Silence via API¶
# Create a 2-hour silence for detector slow alerts
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [{"name": "alertname", "value": "AIDetectorSlow", "isRegex": false}],
"startsAt": "2025-01-09T00:00:00Z",
"endsAt": "2025-01-09T02:00:00Z",
"createdBy": "operator",
"comment": "Planned maintenance"
}'
List Active Silences¶
Runbooks¶
Each alert includes a runbook_url annotation linking to resolution steps. Create runbooks in your wiki:
Example Runbook: AIDetectorUnavailable¶
Symptoms:
- YOLO26 health check fails
- No new detections processing
Diagnosis:
# Check container status
docker compose -f docker-compose.prod.yml ps ai-yolo26
# Check container logs
docker compose -f docker-compose.prod.yml logs --tail=100 ai-yolo26
# Check GPU availability
nvidia-smi
Resolution:
- Container crashed: Restart container
- GPU OOM: Check GPU memory and reduce concurrent inferences
- Model loading failure: Check model path and permissions
Troubleshooting¶
Alerts Not Firing¶
- Check if rule is loaded:
- Verify metric exists:
- Test expression manually:
Alerts Not Being Delivered¶
- Check Alertmanager status:
- View pending alerts:
- Check receiver configuration:
Too Many Alerts (Alert Fatigue)¶
- Increase
forduration to filter transient issues - Adjust thresholds to reduce false positives
- Use inhibition rules to suppress related alerts
- Increase
group_intervalandrepeat_interval
Missing Metrics¶
- Check scrape targets:
- Verify backend metrics endpoint:
Configuration Files¶
| File | Purpose |
|---|---|
monitoring/prometheus.yml | Main Prometheus configuration |
monitoring/prometheus_rules.yml | Alerting rules |
monitoring/prometheus-rules.yml | SLI/SLO recording rules |
monitoring/alertmanager.yml | Alert routing and receivers |
monitoring/alerting-rules.yml | Additional alerting rules |
See Also¶
- Monitoring and Observability - GPU monitoring, token tracking, tracing
- SLO Definitions - Service Level Objectives
- Troubleshooting Index - Common issues
- Prometheus Documentation
- Alertmanager Documentation