System API¶
The System API provides endpoints for health monitoring, system configuration, GPU statistics, and operational management in the NEM home security monitoring system.
Source: backend/api/routes/system.py
Overview¶
The System API provides:
- Health check endpoints for Kubernetes probes
- GPU and system statistics
- Configuration management
- Circuit breaker status
- Worker monitoring
Health Check Endpoints¶
Detailed Health Check¶
Get detailed system health check including database, Redis, and AI services.
Source: backend/api/routes/system.py:1049-1181
Response Caching¶
Results are cached for 5 seconds to reduce load from frequent health probes.
Source: backend/api/routes/system.py:308
Response¶
{
"status": "healthy",
"services": {
"database": {
"status": "healthy",
"message": "Database operational",
"details": {
"pool": {
"size": 5,
"overflow": 0,
"checkedin": 4,
"checkedout": 1,
"total_connections": 5
}
}
},
"redis": {
"status": "healthy",
"message": "Redis connected",
"details": {
"redis_version": "7.4.0"
}
},
"ai": {
"status": "healthy",
"message": "AI services operational",
"details": {
"yolo26": "healthy",
"nemotron": "healthy"
}
}
},
"timestamp": "2026-01-23T12:00:00Z",
"recent_events": [
{
"timestamp": "2026-01-23T11:55:00Z",
"service": "redis",
"event_type": "recovery",
"message": "Redis connection restored"
}
]
}
Health Status Values¶
| Status | Description |
|---|---|
healthy | All services operational |
degraded | Some services unhealthy but core functionality available |
unhealthy | Critical services down |
HTTP Status Codes¶
| Code | Description |
|---|---|
| 200 | Healthy |
| 503 | Degraded or unhealthy |
Readiness Probe¶
Kubernetes-style readiness probe with detailed information.
Source: backend/api/routes/system.py:1188-1328
Checks Performed¶
- Database connectivity (critical)
- Redis connectivity (required for queue processing)
- Critical pipeline workers (detection, analysis)
- Worker supervisor health
Source: backend/api/routes/system.py:564-596
Response¶
{
"ready": true,
"status": "ready",
"checks": {
"database": "healthy",
"redis": "healthy",
"pipeline_workers": "healthy",
"supervisor": "healthy"
},
"workers": [
{
"name": "detection_worker",
"running": true,
"message": null
},
{
"name": "analysis_worker",
"running": true,
"message": null
},
{
"name": "batch_aggregator",
"running": true,
"message": null
}
]
}
HTTP Status Codes¶
| Code | Description |
|---|---|
| 200 | Ready to receive traffic |
| 503 | Not ready |
WebSocket Health¶
Check WebSocket broadcaster health.
Source: backend/api/routes/system.py:1331-1394
Response¶
{
"status": "healthy",
"connected_clients": 5,
"broadcaster_running": true,
"last_broadcast": "2026-01-23T12:00:00Z"
}
Full Health Check¶
Comprehensive health check including all AI services and circuit breakers.
Source: backend/api/schemas/health.py:316-385
Response¶
{
"status": "healthy",
"ready": true,
"message": "All systems operational",
"postgres": {
"name": "postgres",
"status": "healthy",
"message": "Database operational",
"details": null
},
"redis": {
"name": "redis",
"status": "healthy",
"message": "Redis connected",
"details": {
"redis_version": "7.4.0"
}
},
"ai_services": [
{
"name": "yolo26",
"display_name": "YOLO26 Object Detection",
"status": "healthy",
"url": "http://ai-yolo26:8095",
"response_time_ms": 45.2,
"circuit_state": "closed",
"error": null,
"last_check": "2026-01-23T12:00:00Z"
},
{
"name": "nemotron",
"display_name": "Nemotron LLM",
"status": "healthy",
"url": "http://nemotron:8080",
"response_time_ms": 120.5,
"circuit_state": "closed",
"error": null,
"last_check": "2026-01-23T12:00:00Z"
}
],
"circuit_breakers": {
"total": 5,
"closed": 5,
"open": 0,
"half_open": 0,
"breakers": {
"yolo26": "closed",
"nemotron": "closed",
"florence": "closed",
"clip": "closed",
"enrichment": "closed"
}
},
"workers": [
{
"name": "file_watcher",
"running": true,
"critical": true
}
],
"timestamp": "2026-01-23T12:00:00Z",
"version": "0.1.0"
}
Source: backend/api/schemas/health.py:316-385
Monitoring Endpoints¶
Prometheus-Style Health¶
Prometheus-compatible health endpoint for monitoring integrations.
Response¶
Monitoring Targets¶
Get monitoring target information for service discovery.
Response¶
{
"targets": [
{
"name": "backend",
"url": "http://backend:8000",
"health_endpoint": "/api/system/health"
},
{
"name": "yolo26",
"url": "http://ai-yolo26:8095",
"health_endpoint": "/health"
}
]
}
GPU Statistics¶
Get GPU Stats¶
Get current GPU utilization and memory statistics.
Source: backend/api/routes/system.py:634-683
Response¶
{
"recorded_at": "2026-01-23T12:00:00Z",
"gpu_name": "NVIDIA GeForce RTX 4090",
"utilization": 45.5,
"memory_used": 8192,
"memory_total": 24576,
"temperature": 65,
"power_usage": 250,
"inference_fps": 30.5,
"fan_speed": 45,
"sm_clock": 2520,
"memory_bandwidth_utilization": 35.2,
"pstate": "P0",
"throttle_reasons": [],
"power_limit": 450,
"sm_clock_max": 2520,
"compute_processes_count": 3,
"pcie_replay_counter": 0,
"temp_slowdown_threshold": 83,
"memory_clock": 10501,
"memory_clock_max": 10501,
"pcie_link_gen": 4,
"pcie_link_width": 16,
"pcie_tx_throughput": 1024,
"pcie_rx_throughput": 512,
"encoder_utilization": 0,
"decoder_utilization": 15,
"bar1_used": 256
}
Caching¶
Results are cached for 5 seconds.
Source: backend/api/routes/system.py:347-356
Get GPU Stats History¶
Get historical GPU statistics.
Query Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
hours | integer | 24 | Number of hours to retrieve (1-168) |
interval | string | "5m" | Aggregation interval (1m, 5m, 15m, 1h) |
Configuration¶
Get Configuration¶
Get current system configuration.
Response¶
{
"batch_timeout_seconds": 90,
"batch_idle_timeout_seconds": 30,
"retention_days": 30,
"detection_confidence_threshold": 0.5,
"risk_score_thresholds": {
"low": 0,
"medium": 30,
"high": 60,
"critical": 80
}
}
Update Configuration¶
Update system configuration.
Request Body¶
Circuit Breakers¶
Get Circuit Breaker Status¶
Get status of all circuit breakers.
Response¶
{
"breakers": [
{
"name": "yolo26",
"state": "closed",
"failure_count": 0,
"last_failure": null,
"last_success": "2026-01-23T12:00:00Z"
},
{
"name": "nemotron",
"state": "open",
"failure_count": 3,
"last_failure": "2026-01-23T11:55:00Z",
"last_success": "2026-01-23T11:50:00Z",
"reset_at": "2026-01-23T12:00:30Z"
}
]
}
Circuit Breaker States¶
| State | Description |
|---|---|
closed | Normal operation, requests pass through |
open | Service failing, requests fail immediately |
half_open | Testing recovery, limited requests allowed |
Source: backend/api/schemas/health.py:187-198
Reset Circuit Breaker¶
Manually reset a circuit breaker to closed state.
Path Parameters¶
| Parameter | Type | Description |
|---|---|---|
name | string | Circuit breaker name |
Worker Management¶
Get Worker Status¶
Get status of all background workers.
Response¶
{
"workers": [
{
"name": "gpu_monitor",
"running": true,
"message": null
},
{
"name": "cleanup_service",
"running": true,
"message": null
},
{
"name": "file_watcher",
"running": true,
"message": null
},
{
"name": "detection_worker",
"running": true,
"message": null
},
{
"name": "analysis_worker",
"running": true,
"message": null
},
{
"name": "batch_aggregator",
"running": true,
"message": null
}
]
}
Source: backend/api/routes/system.py:445-561
Data Models¶
HealthResponse¶
| Field | Type | Description |
|---|---|---|
status | string | Overall status (healthy, degraded, unhealthy) |
services | object | Individual service statuses |
timestamp | datetime | Response timestamp |
recent_events | array | Recent health events |
ReadinessResponse¶
Source: backend/api/schemas/health.py:86-134
| Field | Type | Description |
|---|---|---|
ready | boolean | Overall readiness |
checks | object | Individual check results |
CheckResult¶
Source: backend/api/schemas/health.py:49-83
| Field | Type | Description |
|---|---|---|
status | string | healthy, unhealthy, or degraded |
latency_ms | float | Check latency in milliseconds |
error | string | Error message if unhealthy |
ServiceHealthState Enum¶
Source: backend/api/schemas/health.py:171-184
healthy- Service is fully operationalunhealthy- Service is down or critical issuesdegraded- Partially operationalunknown- Status cannot be determined
CircuitState Enum¶
Source: backend/api/schemas/health.py:187-198
closed- Normal operationopen- Service failing, requests blockedhalf_open- Testing recovery
AIServiceHealthStatus¶
Source: backend/api/schemas/health.py:201-234
| Field | Type | Description |
|---|---|---|
name | string | Service identifier |
display_name | string | Human-readable name |
status | string | Health state |
url | string | Service URL |
response_time_ms | float | Response time |
circuit_state | string | Circuit breaker state |
error | string | Error message |
last_check | datetime | Last check timestamp |
InfrastructureHealthStatus¶
Source: backend/api/schemas/health.py:237-259
| Field | Type | Description |
|---|---|---|
name | string | Service name |
status | string | Health state |
message | string | Status message |
details | object | Additional details |
WorkerHealthStatus¶
Source: backend/api/schemas/health.py:295-313
| Field | Type | Description |
|---|---|---|
name | string | Worker name |
running | boolean | Running status |
critical | boolean | Whether worker is critical |
Circuit Breaker Implementation¶
The health check system uses circuit breakers to prevent cascading failures.
Source: backend/api/routes/system.py:158-254
Configuration¶
| Parameter | Default | Description |
|---|---|---|
failure_threshold | 3 | Failures before opening circuit |
reset_timeout | 30s | Time before retrying |
Behavior¶
- Closed: Normal operation, health checks executed
- After N failures: Circuit opens, health checks skipped
- After timeout: Circuit becomes half-open, allows one request
- On success: Circuit closes, normal operation resumes
Timeouts and Limits¶
| Constant | Value | Description |
|---|---|---|
HEALTH_CHECK_TIMEOUT_SECONDS | 5.0 | Health check timeout |
HEALTH_CACHE_TTL_SECONDS | 5.0 | Health cache duration |
AI_HEALTH_CHECK_TIMEOUT_SECONDS | 3.0 | AI service check timeout |
MAX_CONCURRENT_HEALTH_CHECKS | 10 | Max concurrent checks |
Source: backend/api/routes/system.py:297-298, 308, 820-824
Related Documentation¶
- Health Schemas - Schema details
- Error Handling - Error response formats
- Background Services - Worker documentation