System Monitoring API¶

This guide covers advanced system monitoring endpoints for health checks, worker supervision, pipeline status, and Prometheus integration.

Total Endpoints: 16 (Monitoring: 2, Supervisor: 6, Pipeline: 2, WebSocket Registry: 1, Performance: 2, Full Health: 1, Circuit Breakers: 2)

System Monitoring Overview¶

The following diagram illustrates the key system monitoring components and their relationships:

Mermaid source (click to expand)

flowchart TB
    subgraph HealthChecks["Health Check Endpoints"]
        FULL["/api/system/health/full<br/>Comprehensive Health"]
        WS["/api/system/health/websocket<br/>WebSocket Health"]
    end

    subgraph Monitoring["Prometheus Monitoring"]
        MON_HEALTH["/api/system/monitoring/health<br/>Stack Health"]
        MON_TARGETS["/api/system/monitoring/targets<br/>Scrape Targets"]
    end

    subgraph Supervisor["Worker Supervisor"]
        SUP_STATUS["/api/system/supervisor<br/>Supervisor Status"]
        SUP_WORKERS["/supervisor/workers/{name}/*<br/>Worker Control"]
        SUP_HISTORY["/supervisor/restart-history<br/>Restart History"]
    end

    subgraph Pipeline["Pipeline Status"]
        PIPE_STATUS["/api/system/pipeline<br/>Pipeline Architecture"]
        WS_EVENTS["/api/system/websocket/events<br/>Event Registry"]
    end

    FULL --> WS
    FULL --> MON_HEALTH
    MON_HEALTH --> MON_TARGETS
    SUP_STATUS --> SUP_WORKERS
    SUP_WORKERS --> SUP_HISTORY
    PIPE_STATUS --> WS_EVENTS

    style FULL fill:#3B82F6,color:#fff
    style WS fill:#3B82F6,color:#fff
    style MON_HEALTH fill:#FFB800,color:#000
    style MON_TARGETS fill:#FFB800,color:#000
    style SUP_STATUS fill:#4ADE80,color:#000
    style SUP_WORKERS fill:#4ADE80,color:#000
    style SUP_HISTORY fill:#4ADE80,color:#000
    style PIPE_STATUS fill:#A855F7,color:#fff
    style WS_EVENTS fill:#A855F7,color:#fff

Prometheus Monitoring¶

Monitor the health of the Prometheus monitoring stack and scrape targets.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/monitoring/health`	Monitoring stack health status
GET	`/api/system/monitoring/targets`	Prometheus scrape target status

Monitoring Stack Health¶

Get comprehensive health status of the monitoring infrastructure:

GET /api/system/monitoring/health

Response:

{
  "healthy": true,
  "prometheus_reachable": true,
  "prometheus_url": "http://prometheus:9090",
  "targets_summary": [
    {
      "job": "backend",
      "total": 1,
      "up": 1,
      "down": 0,
      "unknown": 0
    },
    {
      "job": "redis-exporter",
      "total": 1,
      "up": 1,
      "down": 0,
      "unknown": 0
    }
  ],
  "exporters": [
    {
      "name": "redis-exporter",
      "status": "up",
      "endpoint": "redis-exporter:9121",
      "last_scrape": "2025-12-27T10:30:00Z",
      "error": null
    },
    {
      "name": "json-exporter",
      "status": "up",
      "endpoint": "json-exporter:7979",
      "last_scrape": "2025-12-27T10:30:00Z",
      "error": null
    },
    {
      "name": "blackbox-exporter",
      "status": "up",
      "endpoint": "blackbox-exporter:9115",
      "last_scrape": "2025-12-27T10:30:00Z",
      "error": null
    }
  ],
  "metrics_collection": {
    "collecting": true,
    "last_successful_scrape": "2025-12-27T10:30:00Z",
    "scrape_interval_seconds": 15,
    "total_series": 15000
  },
  "issues": [],
  "timestamp": "2025-12-27T10:30:00Z"
}

Health Determination:

Condition	Result
Prometheus reachable, no targets down	`healthy: true`
Prometheus reachable, some targets down	`healthy: true` (if up > down)
Prometheus reachable, majority targets down	`healthy: false`
Prometheus unreachable	`healthy: false`

Known Exporters Tracked:

Exporter	Default Endpoint	Description
redis-exporter	redis-exporter:9121	Redis metrics
json-exporter	json-exporter:7979	JSON endpoint metrics
blackbox-exporter	blackbox-exporter:9115	Probe-based monitoring

Prometheus Scrape Targets¶

Get detailed status of all Prometheus scrape targets:

GET /api/system/monitoring/targets

Response: 200 OK or 503 Service Unavailable (if Prometheus unreachable)

{
  "targets": [
    {
      "job": "backend",
      "instance": "backend:8000",
      "health": "up",
      "labels": {
        "job": "backend",
        "instance": "backend:8000"
      },
      "last_scrape": "2025-12-27T10:30:00Z",
      "last_error": null,
      "scrape_duration_seconds": 0.045
    },
    {
      "job": "redis-exporter",
      "instance": "redis-exporter:9121",
      "health": "up",
      "labels": {
        "job": "redis-exporter",
        "instance": "redis-exporter:9121"
      },
      "last_scrape": "2025-12-27T10:30:00Z",
      "last_error": null,
      "scrape_duration_seconds": 0.012
    }
  ],
  "total": 5,
  "up": 5,
  "down": 0,
  "jobs": ["backend", "redis-exporter", "json-exporter", "blackbox-exporter", "prometheus"],
  "timestamp": "2025-12-27T10:30:00Z"
}

Target Health Values:

Health	Description
`up`	Target is healthy, scrape successful
`down`	Target is unhealthy, scrape failed
`unknown`	Target status cannot be determined

Worker Supervisor¶

The Worker Supervisor monitors background worker tasks and automatically restarts them with exponential backoff when they crash.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/supervisor`	Get supervisor status
GET	`/api/system/supervisor/status`	Alias for supervisor status
POST	`/api/system/supervisor/workers/{name}/start`	Start a stopped worker
POST	`/api/system/supervisor/workers/{name}/stop`	Stop a running worker
POST	`/api/system/supervisor/workers/{name}/restart`	Restart a worker
POST	`/api/system/supervisor/reset/{name}`	Reset worker restart count
GET	`/api/system/supervisor/restart-history`	Get restart history

Get Supervisor Status¶

Get status of the Worker Supervisor and all supervised workers:

GET /api/system/supervisor

Response:

{
  "running": true,
  "worker_count": 4,
  "workers": [
    {
      "name": "file_watcher",
      "status": "running",
      "restart_count": 0,
      "max_restarts": 5,
      "last_started_at": "2025-12-27T08:00:00Z",
      "last_crashed_at": null,
      "error": null
    },
    {
      "name": "detection_worker",
      "status": "running",
      "restart_count": 1,
      "max_restarts": 5,
      "last_started_at": "2025-12-27T09:15:00Z",
      "last_crashed_at": "2025-12-27T09:14:55Z",
      "error": "Connection reset by peer"
    },
    {
      "name": "analysis_worker",
      "status": "running",
      "restart_count": 0,
      "max_restarts": 5,
      "last_started_at": "2025-12-27T08:00:00Z",
      "last_crashed_at": null,
      "error": null
    },
    {
      "name": "cleanup_service",
      "status": "running",
      "restart_count": 0,
      "max_restarts": 5,
      "last_started_at": "2025-12-27T08:00:00Z",
      "last_crashed_at": null,
      "error": null
    }
  ],
  "timestamp": "2025-12-27T10:30:00Z"
}

Worker Status Values:

Status	Description
`running`	Worker is actively processing
`stopped`	Worker is stopped (can be started)
`crashed`	Worker crashed, pending restart
`restarting`	Worker is being restarted
`failed`	Worker exceeded max restart limit

Start Worker¶

Start a stopped or failed worker:

POST /api/system/supervisor/workers/file_watcher/start

Response: 200 OK, 400 Bad Request, 404 Not Found, or 503 Service Unavailable

{
  "success": true,
  "message": "Worker 'file_watcher' started successfully",
  "worker_name": "file_watcher"
}

Stop Worker¶

Stop a running worker:

POST /api/system/supervisor/workers/file_watcher/stop

Response: 200 OK, 400 Bad Request, 404 Not Found, or 503 Service Unavailable

{
  "success": true,
  "message": "Worker 'file_watcher' stopped successfully",
  "worker_name": "file_watcher"
}

Restart Worker¶

Restart a worker (stops if running, then starts):

POST /api/system/supervisor/workers/file_watcher/restart

Response: 200 OK, 400 Bad Request, 404 Not Found, or 503 Service Unavailable

{
  "success": true,
  "message": "Worker 'file_watcher' restarted successfully",
  "worker_name": "file_watcher"
}

Reset Worker¶

Reset a failed worker's restart count to allow new restart attempts:

POST /api/system/supervisor/reset/detection_worker

Response: 200 OK, 404 Not Found, or 503 Service Unavailable

{
  "success": true,
  "message": "Worker 'detection_worker' restart count reset"
}

When to Use:

When a worker exceeds its max_restarts limit, it enters failed status and won't be automatically restarted. Use this endpoint to reset the counter and allow the supervisor to attempt restarts again after fixing the underlying issue.

Worker Name Validation¶

Worker names must:

Start with a letter (a-z, A-Z)
Contain only alphanumeric characters and underscores
Match the pattern: ^[a-zA-Z][a-zA-Z0-9_]*$

Invalid names return 400 Bad Request.

Get Restart History¶

Get paginated history of worker restart events:

GET /api/system/supervisor/restart-history?worker_name=detection_worker&limit=50&offset=0

Parameters:

Name	Type	Default	Description
worker_name	string	null	Filter by worker name (optional)
limit	integer	50	Max events to return (1-100)
offset	integer	0	Number of events to skip

Response:

{
  "items": [
    {
      "worker_name": "detection_worker",
      "timestamp": "2025-12-27T09:15:00Z",
      "attempt": 1,
      "status": "success",
      "error": null
    },
    {
      "worker_name": "detection_worker",
      "timestamp": "2025-12-27T09:14:55Z",
      "attempt": 0,
      "status": "crashed",
      "error": "Connection reset by peer"
    }
  ],
  "pagination": {
    "total": 2,
    "limit": 50,
    "offset": 0,
    "items_count": 2,
    "has_more": false
  }
}

Restart Event Status Values:

Status	Description
`success`	Worker was successfully restarted
`crashed`	Worker crashed (before restart)
`failed`	Restart attempt failed
`manual`	Manual restart via API

Pipeline Status¶

Monitor the AI processing pipeline components.

Endpoints¶

Method	Endpoint	Description
GET	`/api/system/pipeline`	Pipeline architecture status
GET	`/api/system/websocket/events`	WebSocket event registry

Get Pipeline Status¶

Get combined status of all pipeline operations:

GET /api/system/pipeline

Response:

{
  "file_watcher": {
    "running": true,
    "camera_root": "/export/foscam",
    "pending_tasks": 3,
    "observer_type": "native"
  },
  "batch_aggregator": {
    "active_batches": 2,
    "batches": [
      {
        "batch_id": "550e8400-e29b-41d4-a716-446655440000",
        "camera_id": "front_door",
        "detection_count": 5,
        "started_at": 1703678400.0,
        "age_seconds": 45.2,
        "last_activity_seconds": 12.5
      },
      {
        "batch_id": "550e8400-e29b-41d4-a716-446655440001",
        "camera_id": "backyard",
        "detection_count": 3,
        "started_at": 1703678420.0,
        "age_seconds": 25.1,
        "last_activity_seconds": 5.3
      }
    ],
    "batch_window_seconds": 90,
    "idle_timeout_seconds": 30
  },
  "degradation": {
    "mode": "normal",
    "is_degraded": false,
    "redis_healthy": true,
    "memory_queue_size": 0,
    "fallback_queues": {},
    "services": [
      {
        "name": "yolo26",
        "status": "healthy",
        "last_check": "2025-12-27T10:30:00Z",
        "consecutive_failures": 0,
        "error_message": null
      },
      {
        "name": "nemotron",
        "status": "healthy",
        "last_check": "2025-12-27T10:30:00Z",
        "consecutive_failures": 0,
        "error_message": null
      }
    ],
    "available_features": ["detection", "analysis", "entity_tracking", "enrichment"]
  },
  "timestamp": "2025-12-27T10:30:00Z"
}

Pipeline Components:

Component	Description
`file_watcher`	Monitors camera directories for new uploads
`batch_aggregator`	Groups detections into time-based batches
`degradation`	Handles graceful degradation when services fail

FileWatcher Observer Types:

Type	Description
`native`	Uses OS-native filesystem events (inotify)
`polling`	Falls back to polling when native unavailable

Degradation Modes:

Mode	Description
`normal`	All services healthy, full functionality
`degraded`	Some services unhealthy, reduced functionality
`minimal`	Critical services only, limited processing
`offline`	All external services unavailable

WebSocket Event Registry¶

List all available WebSocket event types with schemas:

GET /api/system/websocket/events

Response:

{
  "event_types": [
    {
      "type": "detection.new",
      "channel": "detections",
      "description": "New detection from AI pipeline",
      "deprecated": false,
      "payload_schema": {
        "type": "object",
        "properties": {
          "detection_id": { "type": "integer" },
          "camera_id": { "type": "string" },
          "object_type": { "type": "string" },
          "confidence": { "type": "number" }
        }
      },
      "example": {
        "detection_id": 123,
        "camera_id": "front_door",
        "object_type": "person",
        "confidence": 0.95
      }
    },
    {
      "type": "event.created",
      "channel": "events",
      "description": "New security event created",
      "deprecated": false,
      "payload_schema": {
        "type": "object",
        "properties": {
          "event_id": { "type": "integer" },
          "camera_id": { "type": "string" },
          "risk_score": { "type": "integer" },
          "summary": { "type": "string" }
        }
      },
      "example": {
        "event_id": 456,
        "camera_id": "front_door",
        "risk_score": 75,
        "summary": "Person detected at front door"
      }
    }
  ],
  "channels": ["detections", "events", "alerts", "cameras", "jobs", "system"],
  "total_count": 24,
  "deprecated_count": 2
}

WebSocket Channels:

Channel	Description
`detections`	AI detection pipeline events
`events`	Security event lifecycle events
`alerts`	Alert notifications and state changes
`cameras`	Camera status and configuration
`jobs`	Background job lifecycle events
`system`	System health and status events

Event Type Naming Convention:

Events follow a hierarchical naming pattern: {domain}.{action}

Examples:

detection.new - New detection from pipeline
event.created - Security event created
event.updated - Security event modified
camera.status_changed - Camera status change
job.progress - Background job progress update
system.health_changed - System health status change

Full Health Check¶

Get comprehensive health status for all system components.

Endpoint¶

Method	Endpoint	Description
GET	`/api/system/health/full`	Comprehensive health check

Get Full Health¶

GET /api/system/health/full

Response: 200 OK (healthy/degraded) or 503 Service Unavailable (unhealthy)

{
  "status": "healthy",
  "ready": true,
  "message": "All systems operational",
  "postgres": {
    "name": "postgres",
    "status": "healthy",
    "message": "Database operational",
    "details": null
  },
  "redis": {
    "name": "redis",
    "status": "healthy",
    "message": "Redis connected",
    "details": { "redis_version": "7.0.0" }
  },
  "ai_services": [
    {
      "name": "yolo26",
      "display_name": "YOLO26 Object Detection",
      "status": "healthy",
      "url": "http://ai:8001",
      "response_time_ms": 15.2,
      "error": null,
      "circuit_state": "closed",
      "last_check": "2025-12-27T10:30:00Z"
    },
    {
      "name": "nemotron",
      "display_name": "Nemotron LLM Risk Analysis",
      "status": "healthy",
      "url": "http://ai:8002",
      "response_time_ms": 25.5,
      "error": null,
      "circuit_state": "closed",
      "last_check": "2025-12-27T10:30:00Z"
    },
    {
      "name": "florence",
      "display_name": "Florence-2 Vision Language",
      "status": "healthy",
      "url": "http://ai:8003",
      "response_time_ms": 18.3,
      "error": null,
      "circuit_state": "closed",
      "last_check": "2025-12-27T10:30:00Z"
    },
    {
      "name": "clip",
      "display_name": "CLIP Embedding Service",
      "status": "healthy",
      "url": "http://ai:8004",
      "response_time_ms": 12.1,
      "error": null,
      "circuit_state": "closed",
      "last_check": "2025-12-27T10:30:00Z"
    },
    {
      "name": "enrichment",
      "display_name": "Enrichment Service",
      "status": "healthy",
      "url": "http://ai:8005",
      "response_time_ms": 8.5,
      "error": null,
      "circuit_state": "closed",
      "last_check": "2025-12-27T10:30:00Z"
    }
  ],
  "circuit_breakers": {
    "total": 5,
    "open": 0,
    "half_open": 0,
    "closed": 5,
    "breakers": {
      "yolo26": "closed",
      "nemotron": "closed",
      "florence": "closed",
      "clip": "closed",
      "enrichment": "closed"
    }
  },
  "workers": [
    {
      "name": "file_watcher",
      "running": true,
      "critical": true
    },
    {
      "name": "cleanup_service",
      "running": true,
      "critical": false
    }
  ],
  "timestamp": "2025-12-27T10:30:00Z",
  "version": "0.1.0"
}

AI Services Tracked:

Service	Display Name	Critical
`yolo26`	YOLO26 Object Detection	Yes
`nemotron`	Nemotron LLM Risk Analysis	Yes
`florence`	Florence-2 Vision Language	No
`clip`	CLIP Embedding Service	No
`enrichment`	Enrichment Service	No

Overall Status Determination:

Condition	Status	Ready	HTTP Status
All services healthy	`healthy`	`true`	200
Only non-critical services unhealthy	`degraded`	`true`	200
Any critical service unhealthy	`unhealthy`	`false`	503

Critical Services:

PostgreSQL database
Redis cache
YOLO26 (object detection)
Nemotron (risk analysis)
File watcher worker

System Operations API - Core system health, GPU, configuration, and cleanup
Core Resources API - Cameras, events, detections
AI Pipeline API - Enrichment and batch processing
Real-time API - WebSocket streams
WebSocket Contracts - WebSocket message formats

Error Responses¶

All endpoints follow standard error response format:

{
  "detail": "Error description"
}

Status Code	Description
400	Invalid request parameters
404	Resource not found (worker, circuit breaker)
503	Service unavailable (supervisor not running)

System Monitoring API¶

System Monitoring Overview¶

Prometheus Monitoring¶

Endpoints¶

Monitoring Stack Health¶

Prometheus Scrape Targets¶

Worker Supervisor¶

Endpoints¶

Get Supervisor Status¶

Start Worker¶

Stop Worker¶

Restart Worker¶

Reset Worker¶

Worker Name Validation¶

Get Restart History¶

Pipeline Status¶

Endpoints¶

Get Pipeline Status¶

WebSocket Event Registry¶

Full Health Check¶

Endpoint¶

Get Full Health¶

Related Documentation¶

Error Responses¶