System Operations API¶

This guide covers system health monitoring, configuration, alerts, logging, notification preferences, and service management.

Total Endpoints: 61 (Health: 6, GPU: 2, Config: 2, Statistics: 8, Cleanup: 3, Severity: 2, Alerts: 8, Logs: 5, Notifications: 10, Services: 5, Models: 4, Anomaly Config: 2, Circuit Breakers: 2, Metrics: 1, WebSocket Registry: 1)

System Operations Overview¶

The following diagram illustrates the key system operations components and their relationships:

Mermaid source (click to expand)

flowchart TB
    subgraph HealthChecks["Health Check Hierarchy"]
        LIVE["/health<br/>Liveness Probe"]
        READY["/api/system/health/ready<br/>Readiness Probe"]
        FULL["/api/system/health<br/>Full Health Check"]
    end

    subgraph Services["Monitored Services"]
        DB[(PostgreSQL)]
        REDIS[(Redis)]
        AI[AI Services<br/>YOLO26 + Nemotron]
    end

    subgraph Workers["Pipeline Workers"]
        GPU[gpu_monitor]
        CLEAN[cleanup_service]
        DET[detection_worker]
        ANAL[analysis_worker]
    end

    LIVE -->|"Process alive?"| READY
    READY -->|"Services + Workers"| FULL
    FULL --> DB
    FULL --> REDIS
    FULL --> AI
    READY --> GPU
    READY --> CLEAN
    READY --> DET
    READY --> ANAL

    style LIVE fill:#4ADE80,color:#000
    style READY fill:#FFB800,color:#000
    style FULL fill:#3B82F6,color:#fff
    style DB fill:#FFB800,color:#000
    style REDIS fill:#DC382D,color:#fff
    style AI fill:#A855F7,color:#fff

Health Checks¶

The system exposes multiple health endpoints for monitoring and orchestration.

Endpoints¶

Method	Endpoint	Description
GET	`/health`	Liveness probe
GET	`/ready`	Simple readiness probe
GET	`/api/system/health`	Detailed health check
GET	`/api/system/health/live`	Fast liveness probe (<10ms)
GET	`/api/system/health/ready`	Readiness probe with details
GET	`/api/system/health/full`	Comprehensive health check
GET	`/api/system/health/websocket`	WebSocket broadcaster health

Liveness Probe¶

Simple check that the process is running:

GET /health

Response:

{ "status": "alive" }

Always returns 200 OK if the HTTP server is responding.

Fast Liveness Probe¶

Ultra-fast liveness check for Kubernetes (responds in under 10ms):

GET /api/system/health/live

Response:

{
  "status": "alive",
  "timestamp": "2025-12-23T10:30:00Z"
}

This endpoint performs NO external checks - it immediately returns "alive" to indicate the process is running. Use this for Kubernetes liveness probes that need to detect hung processes quickly without waiting for dependency checks.

Simple Readiness Probe¶

Check if the system is ready to receive traffic:

GET /ready

Response: 200 OK (ready) or 503 Service Unavailable (not ready)

{
  "ready": true,
  "status": "ready"
}

This is the canonical readiness probe endpoint for Kubernetes/Docker health checks.

Detailed Health Check¶

Comprehensive check of all services:

GET /api/system/health

Response: 200 OK (healthy) or 503 Service Unavailable (degraded/unhealthy)

{
  "status": "healthy",
  "services": {
    "database": {
      "status": "healthy",
      "message": "Database operational",
      "details": null
    },
    "redis": {
      "status": "healthy",
      "message": "Redis connected",
      "details": { "redis_version": "7.0.0" }
    },
    "ai": {
      "status": "healthy",
      "message": "AI services operational",
      "details": { "yolo26": "healthy", "nemotron": "healthy" }
    }
  },
  "timestamp": "2025-12-23T10:30:00Z"
}

Status Values:

Status	Description
`healthy`	All services functioning normally
`degraded`	Some services impaired, limited operation
`unhealthy`	Critical services down

Health Status State Machine¶

The following state diagram shows how the system transitions between health states:

Mermaid source (click to expand)

stateDiagram-v2
    [*] --> Healthy: System Startup Complete

    Healthy --> Degraded: Non-critical service fails
    Healthy --> Unhealthy: Critical service fails

    Degraded --> Healthy: All services recover
    Degraded --> Unhealthy: Critical service fails

    Unhealthy --> Degraded: Critical service recovers<br/>but non-critical still down
    Unhealthy --> Healthy: All services recover

    Healthy: All Services Operational
    Healthy: Database: healthy
    Healthy: Redis: healthy
    Healthy: AI: healthy

    Degraded: Partial Operation
    Degraded: Some services impaired
    Degraded: Core functionality available

    Unhealthy: Service Unavailable
    Unhealthy: Returns 503
    Unhealthy: Requires intervention

Readiness Probe¶

Check if system can accept requests:

GET /api/system/health/ready

Response: 200 OK (ready) or 503 Service Unavailable (not ready)

{
  "ready": true,
  "status": "ready",
  "services": { ... },
  "workers": [
    { "name": "gpu_monitor", "running": true, "message": null },
    { "name": "cleanup_service", "running": true, "message": null },
    { "name": "detection_worker", "running": true, "message": null },
    { "name": "analysis_worker", "running": true, "message": null }
  ],
  "timestamp": "2025-12-23T10:30:00Z"
}

Readiness Criteria:

Database must be healthy
Redis must be healthy
Pipeline workers must be running

Full Health Check¶

Comprehensive health status including all system components:

GET /api/system/health/full

Response: 200 OK (healthy) or 503 Service Unavailable (critical services unhealthy)

{
  "status": "healthy",
  "infrastructure": {
    "postgres": { "status": "healthy", "latency_ms": 5.2 },
    "redis": { "status": "healthy", "latency_ms": 1.1 }
  },
  "ai_services": {
    "yolo26": { "status": "healthy", "loaded": true },
    "nemotron": { "status": "healthy", "loaded": true },
    "florence": { "status": "healthy", "loaded": false },
    "clip": { "status": "healthy", "loaded": false },
    "enrichment": { "status": "healthy" }
  },
  "circuit_breakers": {
    "yolo26": { "state": "closed", "failures": 0 },
    "nemotron": { "state": "closed", "failures": 0 }
  },
  "workers": {
    "gpu_monitor": { "running": true },
    "cleanup_service": { "running": true },
    "detection_worker": { "running": true },
    "analysis_worker": { "running": true }
  },
  "timestamp": "2025-12-23T10:30:00Z"
}

WebSocket Health¶

Check WebSocket broadcaster circuit breaker status:

GET /api/system/health/websocket

Response:

{
  "event_broadcaster": {
    "state": "closed",
    "failures": 0,
    "last_failure": null
  },
  "system_broadcaster": {
    "state": "closed",
    "failures": 0,
    "last_failure": null
  },
  "timestamp": "2025-12-23T10:30:00Z"
}

Circuit Breaker States:

State	Description
`closed`	Normal operation, events flowing
`open`	Failures detected, events may be delayed
`half_open`	Testing recovery, limited operations allowed

GPU Monitoring¶

Monitor GPU statistics for the AI pipeline.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/gpu`	Current GPU statistics
GET	`/api/system/gpu/history`	GPU stats time series

Current GPU Stats¶

GET /api/system/gpu

Response:

{
  "gpu_name": "NVIDIA RTX A5500",
  "utilization": 75.5,
  "memory_used": 12000,
  "memory_total": 24000,
  "temperature": 65.0,
  "power_usage": 150.0,
  "inference_fps": 30.5
}

GPU History¶

GET /api/system/gpu/history?since=2025-12-23T09:00:00Z&limit=100

Parameters:

Name	Type	Default	Description
since	datetime	null	Lower bound for recorded_at
limit	integer	300	Max samples (max 5000)

Response:

{
  "samples": [
    {
      "recorded_at": "2025-12-23T10:00:00Z",
      "gpu_name": "NVIDIA RTX A5500",
      "utilization": 72.0,
      "memory_used": 11500,
      "memory_total": 24000,
      "temperature": 64.0,
      "power_usage": 145.0,
      "inference_fps": 28.5
    }
  ],
  "count": 1,
  "limit": 100
}

Configuration¶

Runtime configuration management.

Configuration Relationships¶

The following diagram shows how system configuration parameters relate to each other and affect system behavior:

Mermaid source (click to expand)

flowchart LR
    subgraph Global["Global Configuration"]
        APP[app_name]
        VER[version]
        RET[retention_days<br/>1-365]
    end

    subgraph Pipeline["Pipeline Configuration"]
        BATCH[batch_window_seconds<br/>Default: 90s]
        IDLE[batch_idle_timeout_seconds<br/>Default: 30s]
        CONF[detection_confidence_threshold<br/>0.0-1.0]
    end

    subgraph Severity["Severity Mapping"]
        LOW[low<br/>0-29]
        MED[medium<br/>30-59]
        HIGH[high<br/>60-84]
        CRIT[critical<br/>85-100]
    end

    subgraph Cleanup["Data Lifecycle"]
        CLEAN[Cleanup Service]
        EVENTS[Events Table]
        DETECT[Detections Table]
        LOGS[Logs Table]
    end

    RET --> CLEAN
    CLEAN -->|"Delete older than"| EVENTS
    CLEAN -->|"Delete older than"| DETECT
    CLEAN -->|"Delete older than"| LOGS

    CONF -->|"Filter below"| DETECT
    BATCH -->|"Aggregate into"| EVENTS
    IDLE -->|"Close after"| EVENTS

    EVENTS -->|"risk_score"| LOW
    EVENTS -->|"risk_score"| MED
    EVENTS -->|"risk_score"| HIGH
    EVENTS -->|"risk_score"| CRIT

    style RET fill:#FFB800,color:#000
    style BATCH fill:#3B82F6,color:#fff
    style IDLE fill:#3B82F6,color:#fff
    style CONF fill:#3B82F6,color:#fff
    style LOW fill:#4ADE80,color:#000
    style MED fill:#FBBF24,color:#000
    style HIGH fill:#F97316,color:#fff
    style CRIT fill:#E74856,color:#fff
    style CLEAN fill:#64748B,color:#fff

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/config`	Get configuration
PATCH	`/api/system/config`	Update configuration

Get Configuration¶

GET /api/system/config

Response:

{
  "app_name": "Home Security Intelligence",
  "version": "0.1.0",
  "retention_days": 30,
  "batch_window_seconds": 90,
  "batch_idle_timeout_seconds": 30,
  "detection_confidence_threshold": 0.5
}

Update Configuration¶

Requires API key authentication.

PATCH /api/system/config
X-API-Key: your-api-key
Content-Type: application/json

{
  "retention_days": 14,
  "detection_confidence_threshold": 0.6
}

Configurable Fields:

Field	Type	Range	Description
`retention_days`	integer	1-365	Data retention period
`batch_window_seconds`	integer	>= 1	Detection batch window
`batch_idle_timeout_seconds`	integer	>= 1	Batch idle timeout
`detection_confidence_threshold`	float	0.0-1.0	Minimum confidence filter

Statistics¶

System-wide statistics and metrics.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/stats`	System statistics
GET	`/api/system/storage`	Storage statistics
GET	`/api/system/telemetry`	Pipeline telemetry
GET	`/api/system/pipeline`	Pipeline status
GET	`/api/system/performance`	Current performance metrics
GET	`/api/system/performance/history`	Historical performance
GET	`/api/system/pipeline-latency`	Pipeline latency metrics
GET	`/api/system/pipeline-latency/history`	Pipeline latency history

System Statistics¶

GET /api/system/stats

Response:

{
  "total_cameras": 4,
  "total_events": 156,
  "total_detections": 892,
  "uptime_seconds": 86400.5
}

Storage Statistics¶

GET /api/system/storage

Response:

{
  "disk_used_bytes": 107374182400,
  "disk_total_bytes": 536870912000,
  "disk_free_bytes": 429496729600,
  "disk_usage_percent": 20.0,
  "thumbnails": { "file_count": 1500, "size_bytes": 75000000 },
  "images": { "file_count": 10000, "size_bytes": 5000000000 },
  "clips": { "file_count": 50, "size_bytes": 500000000 },
  "events_count": 156,
  "detections_count": 892,
  "gpu_stats_count": 2880,
  "logs_count": 5000,
  "timestamp": "2025-12-30T10:30:00Z"
}

Pipeline Telemetry¶

GET /api/system/telemetry

Response:

{
  "queues": {
    "detection_queue": 5,
    "analysis_queue": 2
  },
  "latencies": {
    "watch": { "avg_ms": 10.0, "p95_ms": 40.0, "sample_count": 500 },
    "detect": { "avg_ms": 200.0, "p95_ms": 600.0, "sample_count": 500 },
    "batch": null,
    "analyze": null
  },
  "timestamp": "2025-12-27T10:30:00Z"
}

Performance Metrics¶

Get comprehensive real-time performance metrics:

GET /api/system/performance

Response:

{
  "gpu": {
    "utilization": 75.5,
    "memory_used_mb": 12000,
    "memory_total_mb": 24000,
    "temperature_c": 65.0,
    "power_watts": 150.0
  },
  "ai_models": {
    "yolo26": { "status": "loaded", "vram_mb": 2048 },
    "nemotron": { "status": "loaded", "vram_mb": 8192 }
  },
  "inference": {
    "avg_latency_ms": 45.2,
    "p95_latency_ms": 120.5,
    "throughput_fps": 22.3
  },
  "databases": {
    "postgres": { "status": "healthy", "connections": 10 },
    "redis": { "status": "healthy", "memory_mb": 256 }
  },
  "host": {
    "cpu_percent": 35.2,
    "ram_percent": 62.1,
    "disk_percent": 45.0
  },
  "containers": [
    { "name": "backend", "status": "running" },
    { "name": "ai", "status": "running" }
  ],
  "alerts": [],
  "timestamp": "2025-12-27T10:30:00Z"
}

Performance History¶

Get historical performance metrics for time-series visualization:

GET /api/system/performance/history?time_range=5m

Parameters:

Name	Type	Default	Description
time_range	string	5m	Time range: `5m`, `15m`, or `60m`

Response:

{
  "snapshots": [
    { "timestamp": "2025-12-27T10:29:00Z", "gpu": { ... }, ... },
    { "timestamp": "2025-12-27T10:30:00Z", "gpu": { ... }, ... }
  ],
  "time_range": "5m",
  "count": 60
}

Pipeline Latency¶

Get detailed pipeline latency metrics with percentiles:

GET /api/system/pipeline-latency?window_minutes=60

Parameters:

Name	Type	Default	Description
window_minutes	integer	60	Time window for statistics (in minutes)

Response:

{
  "watch_to_detect": {
    "avg_ms": 15.2,
    "min_ms": 5.0,
    "max_ms": 45.0,
    "p50_ms": 12.0,
    "p95_ms": 35.0,
    "p99_ms": 42.0,
    "sample_count": 500
  },
  "detect_to_batch": {
    "avg_ms": 200.0,
    "min_ms": 150.0,
    "max_ms": 350.0,
    "p50_ms": 190.0,
    "p95_ms": 300.0,
    "p99_ms": 340.0,
    "sample_count": 500
  },
  "batch_to_analyze": { ... },
  "total_pipeline": { ... },
  "timestamp": "2025-12-27T10:30:00Z"
}

Pipeline Stages:

Stage	Description
`watch_to_detect`	File watcher to YOLO26 processing start
`detect_to_batch`	Detection completion to batch aggregation
`batch_to_analyze`	Batch completion to Nemotron analysis start
`total_pipeline`	Total end-to-end processing time

Pipeline Latency History¶

Get pipeline latency history for time-series charts:

GET /api/system/pipeline-latency/history?since=60&bucket_seconds=60

Parameters:

Name	Type	Default	Description
since	integer	60	Minutes of history (1-1440)
bucket_seconds	integer	60	Time bucket size in seconds (10-3600)

Response:

{
  "snapshots": [
    {
      "timestamp": "2025-12-27T10:29:00Z",
      "watch_to_detect": { "avg_ms": 15.0, "p50_ms": 12.0, "p95_ms": 35.0 },
      "detect_to_batch": { "avg_ms": 200.0, "p50_ms": 190.0, "p95_ms": 300.0 },
      "batch_to_analyze": { ... },
      "total_pipeline": { ... }
    }
  ],
  "count": 60
}

Data Cleanup¶

Manage data retention and cleanup.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/cleanup/status`	Cleanup service status
POST	`/api/system/cleanup`	Trigger cleanup
POST	`/api/system/cleanup/orphaned-files`	Clean orphaned files

Cleanup Status¶

GET /api/system/cleanup/status

Response:

{
  "running": true,
  "retention_days": 30,
  "cleanup_time": "03:00",
  "delete_images": false,
  "next_cleanup": "2025-12-31T03:00:00Z",
  "timestamp": "2025-12-30T10:30:00Z"
}

Trigger Cleanup¶

Requires API key authentication.

POST /api/system/cleanup?dry_run=true
X-API-Key: your-api-key

Parameters:

Name	Type	Default	Description
dry_run	boolean	false	Preview without deleting

Response:

{
  "events_deleted": 15,
  "detections_deleted": 89,
  "gpu_stats_deleted": 2880,
  "logs_deleted": 150,
  "thumbnails_deleted": 89,
  "images_deleted": 0,
  "space_reclaimed": 524288000,
  "retention_days": 30,
  "dry_run": true,
  "timestamp": "2025-12-27T10:30:00Z"
}

Orphaned File Cleanup¶

Find and clean up orphaned files (files on disk not referenced in database):

Requires API key authentication.

POST /api/system/cleanup/orphaned-files?dry_run=true
X-API-Key: your-api-key

Parameters:

Name	Type	Default	Description
dry_run	boolean	true	Preview without deleting (default safe)

Response:

{
  "thumbnails": {
    "orphaned_count": 45,
    "orphaned_size_bytes": 2250000,
    "files": ["thumb_001.jpg", "thumb_002.jpg"]
  },
  "clips": {
    "orphaned_count": 5,
    "orphaned_size_bytes": 50000000,
    "files": ["clip_old.mp4"]
  },
  "total_orphaned_files": 50,
  "total_size_bytes": 52250000,
  "dry_run": true,
  "timestamp": "2025-12-27T10:30:00Z"
}

Storage Directories Scanned:

Thumbnails directory (video_thumbnails_dir setting)
Clips directory (clips_directory setting)

Severity Configuration¶

Configure risk score to severity level mappings.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/severity`	Get severity definitions
PUT	`/api/system/severity`	Update thresholds

Get Severity Definitions¶

GET /api/system/severity

Response:

{
  "definitions": [
    { "severity": "low", "min_score": 0, "max_score": 29, "color": "#22c55e" },
    { "severity": "medium", "min_score": 30, "max_score": 59, "color": "#eab308" },
    { "severity": "high", "min_score": 60, "max_score": 84, "color": "#f97316" },
    { "severity": "critical", "min_score": 85, "max_score": 100, "color": "#ef4444" }
  ],
  "thresholds": {
    "low_max": 29,
    "medium_max": 59,
    "high_max": 84
  }
}

Update Severity Thresholds¶

Requires API key authentication.

PUT /api/system/severity
X-API-Key: your-api-key
Content-Type: application/json

{
  "low_max": 20,
  "medium_max": 45,
  "high_max": 70
}

Thresholds must be strictly ordered: low_max < medium_max < high_max

Alert Rules¶

Configure automated alert triggers based on security events.

Alert Flow Diagram¶

The following diagram shows how detections flow through the alert system, from initial detection through risk assessment to notification delivery:

Mermaid source (click to expand)

flowchart TB
    subgraph Detection["Detection Stage"]
        CAM[Camera Image]
        YOLO26[YOLO26<br/>Object Detection]
    end

    subgraph RiskAssessment["Risk Assessment"]
        NEM[Nemotron LLM<br/>Risk Analysis]
        SCORE{Risk Score<br/>0-100}
    end

    subgraph AlertMatching["Alert Rule Matching"]
        RULES[Alert Rules Engine]
        THRESH{Score >= Threshold?}
        TYPE{Object Type Match?}
        SCHED{Within Schedule?}
        COOL{Cooldown Elapsed?}
    end

    subgraph Notification["Notification Delivery"]
        EMAIL[Email Channel]
        WEBHOOK[Webhook Channel]
        PUSH[Push Channel]
    end

    CAM --> YOLO26
    YOLO26 -->|Detections| NEM
    NEM --> SCORE

    SCORE -->|"85+"| RULES
    SCORE -->|"60-84"| RULES
    SCORE -->|"30-59"| RULES
    SCORE -->|"0-29"| RULES

    RULES --> THRESH
    THRESH -->|Yes| TYPE
    THRESH -->|No| DROP1[No Alert]
    TYPE -->|Yes| SCHED
    TYPE -->|No| DROP2[No Alert]
    SCHED -->|Yes| COOL
    SCHED -->|No| DROP3[No Alert]
    COOL -->|Yes| EMAIL
    COOL -->|Yes| WEBHOOK
    COOL -->|Yes| PUSH
    COOL -->|No| DROP4[Cooldown Active]

    style CAM fill:#64748B,color:#fff
    style YOLO26 fill:#A855F7,color:#fff
    style NEM fill:#A855F7,color:#fff
    style RULES fill:#3B82F6,color:#fff
    style EMAIL fill:#4ADE80,color:#000
    style WEBHOOK fill:#4ADE80,color:#000
    style PUSH fill:#4ADE80,color:#000
    style DROP1 fill:#6B7280,color:#fff
    style DROP2 fill:#6B7280,color:#fff
    style DROP3 fill:#6B7280,color:#fff
    style DROP4 fill:#FFB800,color:#000

Endpoints¶

Method	Endpoint	Description
GET	`/api/alerts/rules`	List alert rules
POST	`/api/alerts/rules`	Create rule
GET	`/api/alerts/rules/{id}`	Get rule
PUT	`/api/alerts/rules/{id}`	Update rule
DELETE	`/api/alerts/rules/{id}`	Delete rule
POST	`/api/alerts/rules/{id}/test`	Test rule
POST	`/api/alerts/{alert_id}/acknowledge`	Acknowledge alert
POST	`/api/alerts/{alert_id}/dismiss`	Dismiss alert

List Alert Rules¶

GET /api/alerts/rules?enabled=true&severity=high

Response:

{
  "rules": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "High Risk Person Detection",
      "enabled": true,
      "severity": "high",
      "risk_threshold": 75,
      "object_types": ["person"],
      "camera_ids": null,
      "schedule": {
        "days": ["mon", "tue", "wed", "thu", "fri"],
        "start_time": "22:00",
        "end_time": "06:00"
      },
      "cooldown_seconds": 300,
      "channels": ["email", "webhook"]
    }
  ],
  "count": 1
}

Create Alert Rule¶

POST /api/alerts/rules
Content-Type: application/json

{
  "name": "Critical Person Detection",
  "severity": "critical",
  "risk_threshold": 85,
  "object_types": ["person"],
  "schedule": {
    "days": ["sun", "mon", "tue", "wed", "thu", "fri", "sat"],
    "start_time": "00:00",
    "end_time": "06:00"
  },
  "cooldown_seconds": 600,
  "channels": ["email", "webhook"]
}

Matching Criteria:

Risk score >= threshold
Detection object type matches
Camera ID matches (or null = all)
Current time within schedule
Cooldown period elapsed

Test Alert Rule¶

Test a rule against historical events:

POST /api/alerts/rules/550e8400-e29b-41d4-a716-446655440000/test
Content-Type: application/json

{
  "event_ids": [1, 2, 3],
  "test_time": "2025-12-23T03:00:00Z",
  "limit": 50
}

Acknowledge Alert¶

Mark an alert as acknowledged:

POST /api/alerts/{alert_id}/acknowledge

Response: 200 OK, 404 Not Found, or 409 Conflict

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "acknowledged",
  "acknowledged_at": "2025-12-23T10:35:00Z",
  "rule_id": "550e8400-e29b-41d4-a716-446655440001",
  "event_id": 123,
  "created_at": "2025-12-23T10:30:00Z"
}

Status Requirements: Only alerts with status PENDING or DELIVERED can be acknowledged.

Dismiss Alert¶

Mark an alert as dismissed:

POST /api/alerts/{alert_id}/dismiss

Response: 200 OK, 404 Not Found, or 409 Conflict

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "dismissed",
  "dismissed_at": "2025-12-23T10:40:00Z",
  "rule_id": "550e8400-e29b-41d4-a716-446655440001",
  "event_id": 123,
  "created_at": "2025-12-23T10:30:00Z"
}

Status Requirements: Only alerts with status PENDING, DELIVERED, or ACKNOWLEDGED can be dismissed.

Logs¶

View and query application logs.

Endpoints¶

Method	Endpoint	Description
GET	`/api/logs`	List logs
GET	`/api/logs/stats`	Log statistics
GET	`/api/logs/{id}`	Get log entry
POST	`/api/logs/frontend`	Submit single frontend log
POST	`/api/logs/frontend/batch`	Submit multiple logs

List Logs¶

GET /api/logs?level=ERROR&component=detection_worker&limit=100

Parameters:

Name	Type	Description
level	string	DEBUG, INFO, WARNING, ERROR, CRITICAL
component	string	Module/component name
camera_id	string	Associated camera
source	string	`backend` or `frontend`
search	string	Search in message text
start_date	datetime	Filter from date
end_date	datetime	Filter to date
limit	integer	Page size (1-1000, default: 100)
offset	integer	Page offset (default: 0)

Response:

{
  "logs": [
    {
      "id": 1,
      "timestamp": "2025-12-23T10:30:00Z",
      "level": "ERROR",
      "component": "detection_worker",
      "message": "Connection refused to YOLO26 service",
      "camera_id": "front_door",
      "request_id": "abc123",
      "duration_ms": null,
      "extra": { "host": "ai:8001" },
      "source": "backend"
    }
  ],
  "count": 1
}

Log Statistics¶

GET /api/logs/stats

Response:

{
  "total_today": 5000,
  "errors_today": 15,
  "warnings_today": 50,
  "by_component": {
    "detection_worker": 1500,
    "analysis_worker": 1000
  },
  "by_level": {
    "INFO": 4000,
    "WARNING": 50,
    "ERROR": 15
  },
  "top_component": "detection_worker"
}

Submit Frontend Log¶

Allow frontend to send logs to backend for centralized logging and debugging.

Single Log Entry¶

POST /api/logs/frontend
Content-Type: application/json

{
  "level": "ERROR",
  "message": "Failed to load component",
  "timestamp": "2024-01-21T12:00:00Z",
  "component": "Dashboard",
  "context": {"userId": "123", "action": "load"},
  "url": "/dashboard",
  "user_agent": "Mozilla/5.0..."
}

Request Body Fields:

Field	Type	Required	Description
`level`	string	Yes	Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
`message`	string	Yes	Log message text
`timestamp`	string	No	ISO 8601 timestamp (defaults to server time)
`component`	string	No	Frontend component name
`context`	object	No	Additional context data (JSON object)
`url`	string	No	Page URL where the log originated
`user_agent`	string	No	Browser user agent string

Response:

{
  "success": true,
  "count": 1,
  "message": "Log entry recorded"
}

Batch Log Entries¶

Submit multiple log entries in a single request (1-100 entries). This is more efficient for high-volume logging scenarios.

POST /api/logs/frontend/batch
Content-Type: application/json

{
  "entries": [
    {
      "level": "INFO",
      "message": "Page loaded successfully",
      "component": "App",
      "timestamp": "2024-01-21T12:00:00Z"
    },
    {
      "level": "ERROR",
      "message": "API call failed",
      "component": "ApiClient",
      "context": {"endpoint": "/api/events", "status": 500},
      "timestamp": "2024-01-21T12:00:01Z"
    },
    {
      "level": "WARNING",
      "message": "Slow render detected",
      "component": "EventTimeline",
      "context": {"render_time_ms": 2500},
      "timestamp": "2024-01-21T12:00:02Z"
    }
  ]
}

Request Body:

Field	Type	Required	Description
`entries`	array	Yes	Array of log entries (1-100 max)

Each entry in the array follows the same schema as the single log entry endpoint.

Response:

{
  "success": true,
  "count": 3,
  "message": "3 log entries recorded"
}

Error Response (validation failure):

{
  "success": false,
  "count": 0,
  "message": "Batch must contain 1-100 entries",
  "errors": ["entries: ensure this value has at most 100 items"]
}

Use Cases:

Error Reporting: Capture frontend errors with stack traces and component context
Performance Monitoring: Log slow renders, API latency, and user interactions
User Journey Tracking: Track navigation and feature usage for debugging
Debugging: Centralize frontend and backend logs for correlation

Notification Preferences¶

Configure global notification settings and per-camera overrides.

Endpoints¶

Method	Endpoint	Description
GET	`/api/notification-preferences`	Get global preferences
PUT	`/api/notification-preferences`	Update preferences
GET	`/api/notification-preferences/cameras`	List camera settings
GET	`/api/notification-preferences/cameras/{camera_id}`	Get camera setting
PUT	`/api/notification-preferences/cameras/{camera_id}`	Update camera setting
GET	`/api/notification-preferences/quiet-hours`	Get quiet hours
POST	`/api/notification-preferences/quiet-hours`	Create quiet hours
DELETE	`/api/notification-preferences/quiet-hours/{id}`	Delete quiet hours

Global Preferences¶

GET /api/notification-preferences

Response:

{
  "id": 1,
  "enabled": true,
  "sound": "default",
  "risk_filters": ["critical", "high", "medium"]
}

Update Global Preferences¶

PUT /api/notification-preferences
Content-Type: application/json

{
  "enabled": true,
  "sound": "alert",
  "risk_filters": ["critical", "high"]
}

Sound Options: none, default, alert, chime, urgent

Camera-Specific Settings¶

PUT /api/notification-preferences/cameras/front_door
Content-Type: application/json

{
  "enabled": true,
  "risk_threshold": 75
}

Quiet Hours¶

Create a quiet hours period:

POST /api/notification-preferences/quiet-hours
Content-Type: application/json

{
  "label": "Night Hours",
  "start_time": "22:00:00",
  "end_time": "23:59:00",
  "days": ["monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]
}

Notification Testing¶

Test notification delivery.

Endpoints¶

Method	Endpoint	Description
GET	`/api/notification/config`	Get notification config
POST	`/api/notification/test`	Test notification

Get Notification Configuration¶

GET /api/notification/config

Response:

{
  "notification_enabled": true,
  "email_configured": true,
  "webhook_configured": true,
  "push_configured": false,
  "available_channels": ["email", "webhook"],
  "smtp_host": "smtp.example.com",
  "smtp_port": 587,
  "smtp_from_address": "alerts@example.com"
}

Test Notification¶

POST /api/notification/test
Content-Type: application/json

{
  "channel": "email",
  "email_recipients": ["test@example.com"]
}

Channels: email, webhook, push

Services Management¶

Manage and monitor container services.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/services`	List all services
POST	`/api/system/services/{name}/restart`	Restart service
POST	`/api/system/services/{name}/start`	Start service
POST	`/api/system/services/{name}/enable`	Enable service
POST	`/api/system/services/{name}/disable`	Disable service

List Services¶

Get status of all managed services:

GET /api/system/services?category=ai

Parameters:

Name	Type	Description
category	string	Filter by category: `infrastructure`, `ai`, `monitoring`

Response:

{
  "services": [
    {
      "name": "backend",
      "category": "infrastructure",
      "status": "running",
      "enabled": true,
      "failure_count": 0,
      "last_restart": null
    },
    {
      "name": "ai",
      "category": "ai",
      "status": "running",
      "enabled": true,
      "failure_count": 0,
      "last_restart": null
    }
  ],
  "summaries": {
    "infrastructure": { "total": 3, "healthy": 3, "unhealthy": 0 },
    "ai": { "total": 1, "healthy": 1, "unhealthy": 0 },
    "monitoring": { "total": 2, "healthy": 2, "unhealthy": 0 }
  }
}

Restart Service¶

Manually restart a service:

POST /api/system/services/ai/restart

Response: 200 OK, 400 Bad Request (disabled), or 404 Not Found

{
  "action": "restart",
  "service": "ai",
  "success": true,
  "message": "Service ai restarted successfully",
  "service_info": {
    "name": "ai",
    "status": "running",
    "enabled": true,
    "failure_count": 0
  }
}

Start Service¶

Start a stopped service:

POST /api/system/services/ai/start

Response: 200 OK, 400 Bad Request (already running or disabled), or 404 Not Found

Enable Service¶

Re-enable a disabled service (allows self-healing to resume):

POST /api/system/services/ai/enable

Response: 200 OK or 404 Not Found

Disable Service¶

Disable a service (prevents self-healing restarts):

POST /api/system/services/ai/disable

Response: 200 OK or 404 Not Found

Model Zoo¶

Manage AI models in the Model Zoo.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/models`	Get model registry
GET	`/api/system/models/{model_name}`	Get specific model status
GET	`/api/system/model-zoo/status`	Compact model status
GET	`/api/system/model-zoo/latency/history`	Model latency history

Get Model Registry¶

Get comprehensive information about all AI models:

GET /api/system/models

Response:

{
  "vram_budget_mb": 1650,
  "vram_used_mb": 512,
  "vram_available_mb": 1138,
  "models": [
    {
      "name": "yolo11-license-plate",
      "category": "detection",
      "status": "loaded",
      "vram_mb": 256,
      "last_used": "2025-12-23T10:30:00Z",
      "enabled": true
    },
    {
      "name": "insightface",
      "category": "recognition",
      "status": "unloaded",
      "vram_mb": 512,
      "last_used": null,
      "enabled": true
    }
  ]
}

Model Categories:

Category	Description
`detection`	Object detection (YOLO variants)
`recognition`	Face/license plate recognition
`ocr`	Optical character recognition
`embedding`	Visual embeddings (CLIP)
`depth-estimation`	Depth estimation models
`pose`	Human pose estimation

Get Model Status¶

Get detailed status for a specific model:

GET /api/system/models/yolo11-license-plate

Response: 200 OK or 404 Not Found

{
  "name": "yolo11-license-plate",
  "category": "detection",
  "status": "loaded",
  "vram_mb": 256,
  "last_used": "2025-12-23T10:30:00Z",
  "enabled": true,
  "load_time_ms": 1250,
  "inference_count": 1500
}

Model Zoo Status¶

Compact status for all 18 Model Zoo models:

GET /api/system/model-zoo/status

Response:

{
  "models": [
    {
      "name": "yolo11-license-plate",
      "category": "detection",
      "status": "loaded",
      "vram_mb": 256
    }
  ],
  "categories": {
    "detection": { "loaded": 2, "total": 4 },
    "recognition": { "loaded": 0, "total": 2 }
  },
  "timestamp": "2025-12-23T10:30:00Z"
}

Model Latency History¶

Get latency history for a specific model:

GET /api/system/model-zoo/latency/history?model=yolo11-license-plate&since=60&bucket_seconds=60

Parameters:

Name	Type	Required	Default	Description
model	string	Yes	-	Model name
since	integer	No	60	Minutes of history (1-1440)
bucket_seconds	integer	No	60	Time bucket size in seconds (10-3600)

Response: 200 OK or 404 Not Found

{
  "model": "yolo11-license-plate",
  "snapshots": [
    {
      "timestamp": "2025-12-23T10:29:00Z",
      "avg_ms": 45.2,
      "p50_ms": 42.0,
      "p95_ms": 85.0,
      "sample_count": 100
    }
  ],
  "count": 60
}

Anomaly Detection Configuration¶

Configure anomaly detection thresholds for baseline deviation alerts.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/anomaly-config`	Get anomaly config
PATCH	`/api/system/anomaly-config`	Update anomaly config

Get Anomaly Config¶

GET /api/system/anomaly-config

Response:

{
  "threshold_stdev": 3.0,
  "min_samples": 100,
  "decay_factor": 0.1,
  "window_days": 7
}

Configuration Fields:

Field	Type	Description
`threshold_stdev`	float	Standard deviations for anomaly detection
`min_samples`	integer	Minimum samples before detection is reliable
`decay_factor`	float	Exponential decay factor for EWMA
`window_days`	integer	Rolling window size in days

Update Anomaly Config¶

Requires API key authentication.

PATCH /api/system/anomaly-config
X-API-Key: your-api-key
Content-Type: application/json

{
  "threshold_stdev": 2.5,
  "min_samples": 50
}

Note: decay_factor and window_days are not configurable at runtime as they affect historical data calculations.

Circuit Breakers¶

Manage circuit breakers that protect external services from cascading failures.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/circuit-breakers`	Get all circuit breakers
POST	`/api/system/circuit-breakers/{name}/reset`	Reset circuit breaker

Get Circuit Breakers¶

Get status of all circuit breakers:

GET /api/system/circuit-breakers

Response:

{
  "circuit_breakers": [
    {
      "name": "yolo26",
      "state": "closed",
      "failure_count": 0,
      "success_count": 150,
      "last_failure": null,
      "last_state_change": "2025-12-23T08:00:00Z"
    },
    {
      "name": "nemotron",
      "state": "closed",
      "failure_count": 0,
      "success_count": 75,
      "last_failure": null,
      "last_state_change": "2025-12-23T08:00:00Z"
    }
  ],
  "timestamp": "2025-12-23T10:30:00Z"
}

Reset Circuit Breaker¶

Manually reset a circuit breaker to CLOSED state:

Requires API key authentication (when enabled).

POST /api/system/circuit-breakers/yolo26/reset
X-API-Key: your-api-key

Response: 200 OK, 400 Bad Request (invalid name), or 404 Not Found

{
  "name": "yolo26",
  "previous_state": "open",
  "new_state": "closed",
  "message": "Circuit breaker yolo26 reset to closed state",
  "timestamp": "2025-12-23T10:30:00Z"
}

Prometheus Metrics¶

Export metrics in Prometheus exposition format.

Endpoints¶

Method	Endpoint	Description
GET	`/api/metrics`	Prometheus metrics

Get Metrics¶

GET /api/metrics

Response: 200 OK with text/plain content type

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/events",status="200"} 1523

# HELP detection_latency_seconds Detection processing latency
# TYPE detection_latency_seconds histogram
detection_latency_seconds_bucket{le="0.1"} 450
detection_latency_seconds_bucket{le="0.5"} 980
detection_latency_seconds_bucket{le="1.0"} 1050
detection_latency_seconds_sum 523.45
detection_latency_seconds_count 1050

# HELP gpu_utilization_percent Current GPU utilization
# TYPE gpu_utilization_percent gauge
gpu_utilization_percent 75.5

# HELP active_events_count Number of active security events
# TYPE active_events_count gauge
active_events_count 3

This endpoint is designed for Prometheus scraping and returns all registered application metrics.

WebSocket Event Registry¶

Discover available WebSocket event types.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/websocket/events`	List WebSocket event types

List WebSocket Event Types¶

Get all available WebSocket event types with schemas:

GET /api/system/websocket/events

Response:

{
  "event_types": [
    {
      "type": "detection.new",
      "channel": "detections",
      "description": "New detection from AI pipeline",
      "deprecated": false,
      "payload_schema": { ... },
      "example": { ... }
    },
    {
      "type": "event.created",
      "channel": "events",
      "description": "New security event created",
      "deprecated": false,
      "payload_schema": { ... },
      "example": { ... }
    }
  ],
  "channels": ["detections", "events", "alerts", "cameras", "jobs", "system"],
  "total_count": 24,
  "deprecated_count": 2
}

Channels:

Channel	Description
`detections`	AI detection pipeline events
`events`	Security event lifecycle events
`alerts`	Alert notifications and state changes
`cameras`	Camera status and configuration
`jobs`	Background job lifecycle events
`system`	System health and status events

Circuit Breaker Pattern¶

The system uses circuit breakers to protect external services from cascading failures. This is critical for maintaining system stability when AI services (YOLO26, Nemotron) or Redis experience issues.

Circuit Breaker States¶

Mermaid source (click to expand)

stateDiagram-v2
    [*] --> CLOSED: Initial State

    CLOSED --> OPEN: failures >= threshold (5)
    OPEN --> HALF_OPEN: recovery_timeout (30s) elapsed
    HALF_OPEN --> CLOSED: success_threshold met
    HALF_OPEN --> OPEN: any failure

    CLOSED: Normal Operation
    CLOSED: Calls pass through
    CLOSED: Track consecutive failures
    CLOSED: Reset on success

    OPEN: Circuit Tripped
    OPEN: Calls rejected immediately
    OPEN: CircuitBreakerError raised
    OPEN: Waiting for recovery timeout

    HALF_OPEN: Recovery Testing
    HALF_OPEN: Limited calls allowed
    HALF_OPEN: Track success/failure
    HALF_OPEN: Careful service probing

Circuit Breaker Configuration¶

Parameter	Default	Description
`failure_threshold`	5	Consecutive failures before opening
`recovery_timeout`	30s	Wait time before testing recovery
`half_open_max_calls`	3	Max calls allowed in half-open state
`success_threshold`	2	Successes needed to close circuit

For detailed circuit breaker implementation and WebSocket resilience patterns, see the Resilience Architecture documentation.

System Monitoring API - Worker supervisor, pipeline status, Prometheus integration
Core Resources API - Cameras, events, detections
AI Pipeline API - Enrichment and batch processing
Real-time API - WebSocket streams
Resilience Architecture - Circuit breakers, retry logic, DLQ management