SLI/SLO Framework for Home Security Intelligence¶

This document defines the Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the Home Security Intelligence platform.

Overview¶

The SLI/SLO framework provides quantifiable measures of service reliability and performance, enabling data-driven decisions about system changes and incident response.

Service Level Objectives¶

SLO 1: API Availability¶

Metric	Value
Target	99.5%
Window	30-day rolling
SLI	Ratio of successful HTTP responses (non-5xx) to total requests
Measurement	`sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`

Error Budget: 0.5% = 3.6 hours/month of allowed unavailability

SLO 2: Event Processing Latency¶

Metric	Value
Target	P95 < 5 seconds
Window	30-day rolling
SLI	95th percentile of event processing time
Measurement	`histogram_quantile(0.95, rate(hsi_event_processing_duration_seconds_bucket[5m]))`

Error Budget: 5% of events may exceed 5s latency

SLO 3: Detection Latency¶

Metric	Value
Target	P95 < 2 seconds
Window	30-day rolling
SLI	95th percentile of YOLO26 detection inference time
Measurement	`histogram_quantile(0.95, rate(hsi_detection_duration_seconds_bucket[5m]))`

Error Budget: 5% of detections may exceed 2s latency

SLO 4: Analysis Latency¶

Metric	Value
Target	P95 < 30 seconds
Window	30-day rolling
SLI	95th percentile of Nemotron LLM analysis time
Measurement	`histogram_quantile(0.95, rate(hsi_analysis_duration_seconds_bucket[5m]))`

Error Budget: 5% of analyses may exceed 30s latency

SLO 5: WebSocket Availability¶

Metric	Value
Target	99%
Window	30-day rolling
SLI	Ratio of successful WebSocket connections to total connection attempts
Measurement	`sum(rate(hsi_websocket_connections_successful[5m])) / sum(rate(hsi_websocket_connection_attempts[5m]))`

Error Budget: 1% = 7.2 hours/month of allowed unavailability

Error Budget Policy¶

Error Budget Consumption Flowchart¶

flowchart TD
    Start([Check Error Budget]) --> Calculate[Calculate budget consumed<br/>for 30-day window]
    Calculate --> Check{Budget Consumed?}

    Check -->|"< 50%"| Green[Normal Operations]
    Check -->|"50-75%"| Yellow[Caution Zone]
    Check -->|"75-90%"| Orange[Feature Freeze]
    Check -->|"> 90%"| Red[Emergency Response]

    Green --> GreenActions["Continue feature development<br/>Normal release cadence<br/>Standard monitoring"]
    Yellow --> YellowActions["Increase monitoring frequency<br/>Delay risky changes<br/>Review recent deployments"]
    Orange --> OrangeActions["Halt new features<br/>Focus on reliability<br/>Root cause analysis required"]
    Red --> RedActions["All hands on reliability<br/>Incident response mode<br/>Rollback consideration"]

    GreenActions --> Monitor[Continue Monitoring]
    YellowActions --> Monitor
    OrangeActions --> Monitor
    RedActions --> Monitor

    Monitor --> Start

    style Green fill:#c8e6c9,stroke:#2e7d32
    style Yellow fill:#fff9c4,stroke:#f9a825
    style Orange fill:#ffe0b2,stroke:#ef6c00
    style Red fill:#ffcdd2,stroke:#c62828
    style GreenActions fill:#e8f5e9
    style YellowActions fill:#fffde7
    style OrangeActions fill:#fff3e0
    style RedActions fill:#ffebee

Consumption Thresholds¶

Threshold	Action
< 50%	Normal operations, feature development continues
50-75%	Increased monitoring, caution with risky changes
75-90%	Feature freeze, focus on reliability improvements
> 90%	Emergency response, all hands on reliability

Burn Rate Alerting¶

We use multi-window burn rate alerting to detect SLO violations early:

Window	Burn Rate	Alert Severity	Time to Exhaust Budget
1h	14.4x	Critical	2 hours
6h	6x	Critical	5 hours
1d	3x	Warning	10 days
3d	1x	Info	30 days

Burn Rate Alerting Windows Visualization¶

flowchart LR
    subgraph "Multi-Window Burn Rate Detection"
        direction TB

        subgraph "1h Window"
            W1[1 hour] --> B1["14.4x burn rate"]
            B1 --> A1["CRITICAL<br/>2h to exhaust"]
        end

        subgraph "6h Window"
            W2[6 hours] --> B2["6x burn rate"]
            B2 --> A2["CRITICAL<br/>5h to exhaust"]
        end

        subgraph "1d Window"
            W3[1 day] --> B3["3x burn rate"]
            B3 --> A3["WARNING<br/>10d to exhaust"]
        end

        subgraph "3d Window"
            W4[3 days] --> B4["1x burn rate"]
            B4 --> A4["INFO<br/>30d to exhaust"]
        end
    end

    subgraph "Alert Logic"
        A1 --> Page["Page on-call immediately"]
        A2 --> Page
        A3 --> Notify["Notify team, investigate"]
        A4 --> Log["Log for review"]
    end

    style A1 fill:#ffcdd2,stroke:#c62828
    style A2 fill:#ffcdd2,stroke:#c62828
    style A3 fill:#fff9c4,stroke:#f9a825
    style A4 fill:#e3f2fd,stroke:#1976d2
    style Page fill:#ffebee
    style Notify fill:#fffde7
    style Log fill:#e3f2fd

How it works:

Short windows (1h, 6h) detect rapid budget consumption requiring immediate action
Long windows (1d, 3d) detect gradual degradation for proactive investigation
Both conditions must be true to fire an alert (prevents false positives from spikes)

Recording Rules¶

Pre-computed metrics for efficient dashboard queries:

# SLI Recording Rules (prometheus-rules.yml)
- record: hsi:api_availability:ratio_rate5m
  expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

- record: hsi:detection_latency:p95_5m
  expr: histogram_quantile(0.95, rate(hsi_detection_duration_seconds_bucket[5m]))

- record: hsi:analysis_latency:p95_5m
  expr: histogram_quantile(0.95, rate(hsi_analysis_duration_seconds_bucket[5m]))

Alert Rules¶

Critical Alerts¶

Alert Name	Condition	For
HSIPipelineDown	All backend replicas unavailable	1m
HSIDatabaseUnhealthy	PostgreSQL connection failures	2m
HSIRedisUnhealthy	Redis connection failures	2m
HSIGPUMemoryHigh	GPU memory > 90%	5m

Warning Alerts¶

Alert Name	Condition	For
HSIDetectionQueueHigh	Detection queue > 100 items	5m
HSIAnalysisQueueHigh	Analysis queue > 50 items	5m
HSIHighErrorRate	Error rate > 5%	5m
HSISlowDetection	P95 detection latency > 2s	10m
HSISlowAnalysis	P95 analysis latency > 30s	10m

Dashboard¶

The SLO dashboard (monitoring/grafana/dashboards/slo.json) provides:

SLO Compliance Gauges - Current compliance for each SLO
Error Budget Remaining - Time-based visualization of remaining budget
Burn Rate Trends - Multi-window burn rate graphs
Historical SLI Trends - 30-day rolling SLI values

Implementation Notes¶

Metric Sources¶

API metrics: FastAPI middleware via Prometheus client
Detection metrics: YOLO26 service instrumentation
Analysis metrics: Nemotron service instrumentation
WebSocket metrics: WebSocket handler instrumentation
Infrastructure metrics: Redis exporter, PostgreSQL exporter

Data Retention¶

Raw metrics: 15 days
Recording rules (aggregated): 90 days
Dashboard snapshots: 365 days

Prometheus Rules
Alerting Rules
Alertmanager Configuration
SLO Dashboard (see dashboards directory)