Skip to content

Prometheus Metrics

Custom Prometheus metrics for pipeline monitoring, AI service performance, and business intelligence.

Key Files:

  • backend/core/metrics.py:1-3240 - All metric definitions and helpers
  • backend/core/sanitization.py - Label sanitization for cardinality control
  • monitoring/prometheus.yml:1-410 - Prometheus scrape configuration
  • monitoring/prometheus-rules.yml:1-169 - Recording rules for SLIs

Overview

The system exposes Prometheus metrics via the /api/metrics endpoint. Metrics cover the complete AI pipeline from image detection through LLM analysis, including queue depths, latencies, error rates, and business metrics like events by risk level.

All metrics use the hsi_ prefix (home security intelligence) and follow Prometheus naming conventions: counters end with _total, histograms measuring duration end with _seconds, and gauges use descriptive names without suffix.

Label cardinality is controlled through sanitization functions that validate values against allowlists before recording. This prevents unbounded metric growth from unexpected values.

Architecture

graph TD
    subgraph "Application"
        SVC[Services] --> MS[MetricsService<br/>metrics.py:696-1244]
        MS --> COUNTER[Counters<br/>metrics.py:259-365]
        MS --> HIST[Histograms<br/>metrics.py:247-295]
        MS --> GAUGE[Gauges<br/>metrics.py:107-224]
    end

    subgraph "Exposition"
        COUNTER --> REG[Prometheus Registry]
        HIST --> REG
        GAUGE --> REG
        REG --> EP[/api/metrics<br/>metrics.py:1724-1730]
    end

    subgraph "Collection"
        EP --> PROM[Prometheus Server]
        PROM --> RULES[Recording Rules<br/>prometheus-rules.yml]
    end

    subgraph "Visualization"
        PROM --> GRAF[Grafana Dashboards]
        RULES --> GRAF
    end

Metric Categories

Queue Depth Gauges

Monitor pipeline backpressure (backend/core/metrics.py:107-117):

Metric Type Description
hsi_detection_queue_depth Gauge Images waiting for detection
hsi_analysis_queue_depth Gauge Batches waiting for LLM analysis

PromQL Examples:

# Current detection queue depth
hsi_detection_queue_depth

# Queue backed up (>100 items for 5 minutes)
hsi_detection_queue_depth > 100

Stage Duration Histograms

Track pipeline latency (backend/core/metrics.py:232-253):

Metric Labels Buckets
hsi_stage_duration_seconds stage 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0

Stage values: detect, batch, analyze

PromQL Examples:

# P95 detection latency over 5 minutes
histogram_quantile(0.95, sum(rate(hsi_stage_duration_seconds_bucket{stage="detect"}[5m])) by (le))

# Average analysis time
rate(hsi_stage_duration_seconds_sum{stage="analyze"}[5m]) / rate(hsi_stage_duration_seconds_count{stage="analyze"}[5m])

AI Service Request Duration

Track external AI service latency (backend/core/metrics.py:276-295):

Metric Labels Buckets
hsi_ai_request_duration_seconds service 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0

Service values: yolo26, nemotron, florence, clip, enrichment

PromQL Examples:

# P99 Nemotron request latency
histogram_quantile(0.99, sum(rate(hsi_ai_request_duration_seconds_bucket{service="nemotron"}[5m])) by (le))

# Average YOLO26 inference time
rate(hsi_ai_request_duration_seconds_sum{service="yolo26"}[5m]) / rate(hsi_ai_request_duration_seconds_count{service="yolo26"}[5m])

Event and Detection Counters

Track throughput (backend/core/metrics.py:259-269):

Metric Type Description
hsi_events_created_total Counter Security events created
hsi_detections_processed_total Counter Detections through YOLO26

PromQL Examples:

# Events per minute
rate(hsi_events_created_total[1m]) * 60

# Detections per second
rate(hsi_detections_processed_total[5m])

Detection Class Distribution

Track what objects are detected (backend/core/metrics.py:313-318):

Metric Labels Description
hsi_detections_by_class_total object_class Detections by COCO class

Object classes are sanitized to COCO vocabulary (backend/core/sanitization.py).

PromQL Examples:

# Top 5 detected classes
topk(5, sum by (object_class) (rate(hsi_detections_by_class_total[1h])))

# Person detections per minute
rate(hsi_detections_by_class_total{object_class="person"}[1m]) * 60

Detection Confidence Histogram

Track model confidence distribution (backend/core/metrics.py:322-329):

Metric Buckets
hsi_detection_confidence 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99

PromQL Examples:

# Median confidence score
histogram_quantile(0.5, rate(hsi_detection_confidence_bucket[5m]))

# Percentage of detections with >90% confidence
sum(rate(hsi_detection_confidence_bucket{le="0.9"}[5m])) / sum(rate(hsi_detection_confidence_count[5m]))

Risk Score Distribution

Track LLM-assigned risk scores (backend/core/metrics.py:343-357):

Metric Labels Buckets/Description
hsi_risk_score - 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
hsi_events_by_risk_level_total level Counter by risk level

Level values: low, medium, high, critical

PromQL Examples:

# Average risk score
histogram_quantile(0.5, rate(hsi_risk_score_bucket[1h]))

# Critical events per hour
increase(hsi_events_by_risk_level_total{level="critical"}[1h])

# High-risk event rate
rate(hsi_events_by_risk_level_total{level=~"high|critical"}[5m])

LLM Token Metrics

Track Nemotron usage (backend/core/metrics.py:599-624):

Metric Labels Description
hsi_nemotron_tokens_input_total camera_id Input tokens sent
hsi_nemotron_tokens_output_total camera_id Output tokens received
hsi_nemotron_tokens_per_second - Current throughput gauge
hsi_nemotron_token_cost_usd_total camera_id Estimated cost

PromQL Examples:

# Tokens per second (current)
hsi_nemotron_tokens_per_second

# Total tokens in last hour
increase(hsi_nemotron_tokens_input_total[1h]) + increase(hsi_nemotron_tokens_output_total[1h])

# Daily estimated cost
sum(increase(hsi_nemotron_token_cost_usd_total[24h]))

LLM Context Utilization

Track context window usage (backend/core/metrics.py:373-402):

Metric Labels Buckets/Description
hsi_llm_context_utilization - 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0
hsi_llm_context_utilization_ratio model Current utilization gauge
hsi_prompts_truncated_total - Prompts requiring truncation
hsi_prompts_high_utilization_total - Prompts exceeding warning threshold

PromQL Examples:

# P95 context utilization
histogram_quantile(0.95, rate(hsi_llm_context_utilization_bucket[5m]))

# Truncation rate per minute
rate(hsi_prompts_truncated_total[5m]) * 60

Cache Metrics

Track Redis cache effectiveness (backend/core/metrics.py:537-594):

Metric Labels Description
hsi_cache_hits_total cache_type Cache hits
hsi_cache_misses_total cache_type Cache misses
hsi_cache_invalidations_total cache_type, reason Cache invalidations
hsi_cache_stale_hits_total cache_type Stale-while-revalidate hits
hsi_cache_background_refresh_total cache_type, status Background refreshes

Cache types: event_stats, cameras, system, dashboard_stats

PromQL Examples:

# Cache hit ratio
sum(rate(hsi_cache_hits_total[5m])) / (sum(rate(hsi_cache_hits_total[5m])) + sum(rate(hsi_cache_misses_total[5m])))

# Cache invalidations by reason
sum by (reason) (rate(hsi_cache_invalidations_total[1h]))

Pipeline Error Counters

Track errors by type (backend/core/metrics.py:301-306):

Metric Labels Description
hsi_pipeline_errors_total error_type Pipeline errors by type

Error types are sanitized to an allowlist including: connection_error, timeout_error, validation_error, rate_limit_error, unknown_error

PromQL Examples:

# Error rate by type
sum by (error_type) (rate(hsi_pipeline_errors_total[5m]))

# Total errors per minute
sum(rate(hsi_pipeline_errors_total[1m])) * 60

Worker Pool Metrics

Track pipeline worker state (backend/core/metrics.py:119-202):

Metric Labels Description
hsi_worker_restarts_total worker_name Worker restart count
hsi_worker_crashes_total worker_name Worker crash count
hsi_worker_status worker_name Current status (0-4)
hsi_pipeline_worker_state worker_name State (0=stopped, 1=running, 2=restarting, 3=failed)
hsi_pipeline_worker_consecutive_failures worker_name Consecutive failure count
hsi_pipeline_worker_uptime_seconds worker_name Uptime since last start
hsi_worker_active_count - Total active workers
hsi_worker_busy_count - Workers processing tasks
hsi_worker_idle_count - Workers waiting for tasks

PromQL Examples:

# All workers running
count(hsi_pipeline_worker_state == 1)

# Workers in failed state
count(hsi_pipeline_worker_state == 3)

# Worker utilization
hsi_worker_busy_count / hsi_worker_active_count

Enrichment Model Metrics

Track Model Zoo performance (backend/core/metrics.py:433-477):

Metric Labels Description
hsi_enrichment_model_calls_total model Calls per model
hsi_enrichment_model_duration_seconds model Inference duration histogram
hsi_enrichment_model_errors_total model Errors per model
hsi_enrichment_success_rate model Success rate gauge (0-1)
hsi_enrichment_partial_batches_total - Batches with partial success
hsi_enrichment_batch_status_total status Batch outcomes

Model values: brisque, violence, clothing, vehicle, pet, depth, pose, action, weather, fashion-clip

PromQL Examples:

# P95 enrichment latency by model
histogram_quantile(0.95, sum by (model, le) (rate(hsi_enrichment_model_duration_seconds_bucket[5m])))

# Enrichment error rate
sum(rate(hsi_enrichment_model_errors_total[5m])) / sum(rate(hsi_enrichment_model_calls_total[5m]))

Cost Tracking Metrics

Track inference costs (backend/core/metrics.py:630-689):

Metric Labels Description
hsi_gpu_seconds_total model GPU time consumed
hsi_estimated_cost_usd_total service Estimated cloud-equivalent cost
hsi_event_analysis_cost_usd_total camera_id Cost per event
hsi_daily_cost_usd - Current daily cost gauge
hsi_monthly_cost_usd - Current monthly cost gauge
hsi_budget_utilization_ratio period Budget utilization (0-1+)
hsi_cost_per_detection_usd - Average cost per detection
hsi_cost_per_event_usd - Average cost per event

PromQL Examples:

# Daily cost
hsi_daily_cost_usd

# Budget utilization
hsi_budget_utilization_ratio{period="monthly"}

# Cost per event
hsi_cost_per_event_usd

Queue Overflow Metrics

Track backpressure handling (backend/core/metrics.py:505-531):

Metric Labels Description
hsi_queue_overflow_total queue_name, policy Overflow events
hsi_queue_items_moved_to_dlq_total queue_name Items to dead-letter queue
hsi_queue_items_dropped_total queue_name Items dropped
hsi_queue_items_rejected_total queue_name Items rejected

PromQL Examples:

# Overflow events by policy
sum by (policy) (rate(hsi_queue_overflow_total[1h]))

# DLQ rate
rate(hsi_queue_items_moved_to_dlq_total[5m])

MetricsService Class

The MetricsService (backend/core/metrics.py:696-1244) provides a centralized interface for recording metrics with automatic sanitization:

# From backend/core/metrics.py:696-720
class MetricsService:
    """Centralized service for recording Prometheus metrics."""

    def record_event_created(self) -> None:
        EVENTS_CREATED_TOTAL.inc()

    def record_detection_by_class(self, object_class: str) -> None:
        safe_class = sanitize_object_class(object_class)
        DETECTIONS_BY_CLASS_TOTAL.labels(object_class=safe_class).inc()

    def observe_stage_duration(self, stage: str, duration_seconds: float) -> None:
        STAGE_DURATION_SECONDS.labels(stage=stage).observe(duration_seconds)

Usage:

from backend.core.metrics import get_metrics_service

metrics = get_metrics_service()
metrics.record_event_created()
metrics.observe_stage_duration("detect", 0.245)
metrics.record_detection_by_class("person")

Recording Rules

Pre-computed SLI metrics (monitoring/prometheus-rules.yml):

Rule Expression Purpose
hsi:api_requests:success_rate_5m avg_over_time(probe_success{job="blackbox-http-ready"}[5m]) API availability
hsi:detection_latency:p95_5m histogram_quantile(0.95, ...) Detection P95 latency
hsi:analysis_latency:p95_5m histogram_quantile(0.95, ...) Analysis P95 latency
hsi:gpu:memory_utilization hsi_gpu_memory_used_mb / hsi_gpu_memory_total_mb GPU memory %
hsi:error_budget:api_availability_remaining Budget calculation SLO error budget
hsi:burn_rate:api_availability_1h Burn rate calculation SLO burn rate

Configuration

Prometheus scrape configuration (monitoring/prometheus.yml:35-45):

- job_name: 'hsi-backend-metrics'
  metrics_path: /api/metrics
  scrape_interval: 15s
  static_configs:
    - targets:
        - 'backend:8000'
  relabel_configs:
    - target_label: service
      replacement: 'home-security-intelligence'

Histogram Bucket Selection

Buckets are designed for the expected latency distributions:

Use Case Buckets Rationale
Stage durations 10ms - 60s Covers fast detections to slow analyses
AI requests 100ms - 120s Includes long LLM generation
Confidence 0.5 - 0.99 Focus on high-confidence detections
Risk scores 10 - 100 Full 0-100 range in 10-point increments
Context utilization 0.5 - 1.0 Focus on high utilization

Testing

Run metrics tests:

uv run pytest backend/tests/unit/core/test_metrics.py -v
Test Purpose
test_record_event_created Counter increment
test_observe_stage_duration Histogram observation
test_label_sanitization Cardinality protection
test_metrics_service_singleton Single instance