Skip to content

Error Recovery Flow

This document describes the error handling and recovery mechanisms used throughout the system, including circuit breakers, retry logic, and graceful degradation.

Error Recovery Overview

Circuit Breaker Pattern

Source: backend/services/circuit_breaker.py:1-42

# backend/services/circuit_breaker.py:1-42
"""Circuit breaker pattern implementation for external service protection.

This module provides a circuit breaker implementation that protects external services
from cascading failures. When a service experiences repeated failures, the circuit
breaker "opens" to prevent further calls, allowing the service time to recover.

States:
    - CLOSED: Normal operation, calls pass through
    - OPEN: Circuit tripped, calls are rejected immediately
    - HALF_OPEN: Recovery testing, limited calls allowed to test service health

Features:
    - Configurable failure thresholds and recovery timeouts
    - Half-open state for gradual recovery testing
    - Excluded exceptions that don't count as failures
    - Thread-safe async implementation
    - Registry for managing multiple circuit breakers
    - Prometheus metrics integration for monitoring
"""

Circuit Breaker State Machine

Circuit Breaker

stateDiagram-v2
    [*] --> CLOSED: Initial state

    CLOSED --> OPEN: failure_count >= failure_threshold
    Note right of CLOSED: Normal operation
    Note right of CLOSED: Calls pass through

    OPEN --> HALF_OPEN: recovery_timeout elapsed
    Note right of OPEN: Service protection
    Note right of OPEN: Calls rejected immediately

    HALF_OPEN --> CLOSED: success_count >= success_threshold
    HALF_OPEN --> OPEN: Any failure
    Note right of HALF_OPEN: Recovery testing
    Note right of HALF_OPEN: Limited calls allowed

CircuitState Enum

Source: backend/services/circuit_breaker.py:118-124

# backend/services/circuit_breaker.py:118-124
class CircuitState(StrEnum):
    """Circuit breaker states."""

    CLOSED = auto()
    OPEN = auto()
    HALF_OPEN = auto()

CircuitBreakerConfig

Source: backend/services/circuit_breaker.py:126-142

# backend/services/circuit_breaker.py:126-142
@dataclass(slots=True)
class CircuitBreakerConfig:
    """Configuration for circuit breaker behavior.

    Attributes:
        failure_threshold: Number of failures before opening circuit
        recovery_timeout: Seconds to wait before transitioning to half-open
        half_open_max_calls: Maximum calls allowed in half-open state
        success_threshold: Successes needed in half-open to close circuit
        excluded_exceptions: Exception types that don't count as failures
    """

    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    success_threshold: int = 2
    excluded_exceptions: tuple[type[Exception], ...] = ()

Service Circuit Breaker Configurations

Source: backend/main.py:248-297

AI Services (Aggressive)

# backend/main.py:265-270
ai_service_config = CircuitBreakerConfig(
    failure_threshold=5,        # Opens after 5 consecutive failures
    recovery_timeout=30.0,      # Wait 30s before recovery attempt
    half_open_max_calls=3,      # Allow 3 test calls in half-open
    success_threshold=2,        # 2 successes close the circuit
)

Infrastructure Services (Tolerant)

# backend/main.py:273-278
infrastructure_config = CircuitBreakerConfig(
    failure_threshold=10,       # More tolerant - 10 failures
    recovery_timeout=60.0,      # Longer recovery - 60s
    half_open_max_calls=5,      # More test calls
    success_threshold=3,        # More successes needed
)

Service-Specific Configurations

Service Failure Threshold Recovery Timeout Source
YOLO26 5 30s AI config
Nemotron 5 30s AI config
PostgreSQL 10 60s Infrastructure config
Redis 10 60s Infrastructure config

Circuit Breaker Sequence Diagram

sequenceDiagram
    participant Client
    participant CB as CircuitBreaker
    participant Service as External Service

    Note over CB: State: CLOSED

    Client->>CB: Request 1
    CB->>Service: Forward request
    Service--xCB: Failure
    CB->>CB: failure_count++

    Client->>CB: Request 2
    CB->>Service: Forward request
    Service--xCB: Failure
    CB->>CB: failure_count++

    Note over CB: ... (failures continue)

    Client->>CB: Request 5
    CB->>Service: Forward request
    Service--xCB: Failure
    CB->>CB: failure_count = 5 >= threshold
    CB->>CB: State: OPEN
    CB-->>Client: CircuitBreakerError

    Note over CB: State: OPEN (30s timeout)

    Client->>CB: Request 6
    CB-->>Client: CircuitBreakerError (immediate)

    Note over CB: 30 seconds elapse

    CB->>CB: State: HALF_OPEN

    Client->>CB: Request 7 (test call)
    CB->>Service: Forward request
    Service-->>CB: Success
    CB->>CB: success_count++

    Client->>CB: Request 8 (test call)
    CB->>Service: Forward request
    Service-->>CB: Success
    CB->>CB: success_count = 2 >= threshold
    CB->>CB: State: CLOSED

    Note over CB: Normal operation resumes

YOLO26 Circuit Breaker

Source: backend/services/detector_client.py:295-309

# backend/services/detector_client.py:295-309
self._circuit_breaker = CircuitBreaker(
    name="yolo26",
    config=CircuitBreakerConfig(
        failure_threshold=5,        # Opens after 5 consecutive failures
        recovery_timeout=60.0,      # Waits 60 seconds before attempting recovery
        half_open_max_calls=3,      # Allow 3 test calls in half-open
        success_threshold=2,        # 2 successes close the circuit
        excluded_exceptions=(ValueError,),  # HTTP 4xx errors don't trip circuit
    ),
)

Retry Logic

Exponential Backoff with Jitter

Source: backend/services/event_broadcaster.py:145-244

# backend/services/event_broadcaster.py:204-209
# Calculate exponential backoff with jitter
delay = min(base_delay * (2**attempt), max_delay)
jitter = delay * random.uniform(0.1, 0.3)  # 10-30% jitter
total_delay = delay + jitter

Retry Timing

Attempt Base Delay With Jitter Cumulative
1 0s 0s 0s
2 2s 2.2-2.6s ~2.4s
3 4s 4.4-5.2s ~7.2s
4 8s 8.8-10.4s ~16.8s
Max 30s 33-39s Capped

Detector Client Retry

Source: backend/services/detector_client.py:31-34

# backend/services/detector_client.py:31-34
# Retry Logic (NEM-1343):
#     - Configurable max retries via DETECTOR_MAX_RETRIES setting (default: 3)
#     - Exponential backoff: 2^attempt seconds between retries (capped at 30s)
#     - Only retries transient failures (connection, timeout, HTTP 5xx)

Nemotron Analyzer Retry

Source: backend/services/nemotron_analyzer.py:24-27

# backend/services/nemotron_analyzer.py:24-27
# Retry Logic (NEM-1343):
#     - Configurable max retries via NEMOTRON_MAX_RETRIES setting (default: 3)
#     - Exponential backoff: 2^attempt seconds between retries (capped at 30s)
#     - Only retries transient failures (connection, timeout, HTTP 5xx)

Broadcast Retry

Source: backend/services/event_broadcaster.py:145-182

# backend/services/event_broadcaster.py:145-182
async def broadcast_with_retry[T](
    broadcast_func: Callable[[], Awaitable[T]],
    message_type: str,
    *,
    max_retries: int = DEFAULT_MAX_RETRIES,    # 3
    base_delay: float = DEFAULT_BASE_DELAY,     # 1.0s
    max_delay: float = DEFAULT_MAX_DELAY,       # 30.0s
    metrics: BroadcastRetryMetrics | None = None,
) -> T:
    """Execute a broadcast function with retry logic and exponential backoff.

    This function wraps any broadcast operation with retry logic that:
    - Uses exponential backoff (1s, 2s, 4s, etc.) with jitter
    - Logs each retry attempt with context
    - Records metrics for monitoring
    - Raises the final exception if all retries are exhausted
    """

Error Recovery Matrix

Detection Pipeline Errors

Error Location Handling Recovery
Image truncated FileWatcher Skip file Camera retries
Connection refused DetectorClient Retry 3x Circuit breaker
Timeout DetectorClient Retry 3x Circuit breaker
HTTP 5xx DetectorClient Retry 3x Circuit breaker
HTTP 4xx DetectorClient No retry Log and skip
Circuit open DetectorClient Immediate fail Wait recovery

Analysis Pipeline Errors

Error Location Handling Recovery
Batch not found NemotronAnalyzer Skip Data inconsistency
LLM timeout NemotronAnalyzer Retry 3x Backoff
Parse error NemotronAnalyzer No retry Log and skip
Validation error NemotronAnalyzer No retry Log and skip

Enrichment Pipeline Errors

Error Location Handling Recovery
Service unavailable EnrichmentPipeline Skip model Continue with partial
Timeout EnrichmentPipeline Skip model Continue with partial
All models fail EnrichmentPipeline Return FAILED LLM analyzes without

Broadcast Errors

Error Location Handling Recovery
Redis unavailable EventBroadcaster Retry 3x Degrade gracefully
WebSocket closed EventBroadcaster Buffer message Client reconnects
All retries fail EventBroadcaster Log error Message lost

WebSocket Circuit Breaker

Source: backend/services/event_broadcaster.py:388-396

# backend/services/event_broadcaster.py:388-396
# Circuit breaker for WebSocket connection resilience
self._circuit_breaker = WebSocketCircuitBreaker(
    failure_threshold=self.MAX_RECOVERY_ATTEMPTS,  # 5
    recovery_timeout=30.0,
    half_open_max_calls=1,
    success_threshold=1,
    name="event_broadcaster",
)

Recovery Sequence Diagram

sequenceDiagram
    participant DW as Detection Worker
    participant CB as Circuit Breaker
    participant RT as YOLO26
    participant DLQ as Dead Letter Queue

    DW->>CB: Send detection request
    CB->>RT: Forward request
    RT--xCB: Connection refused

    loop Retry with backoff
        DW->>CB: Retry (attempt 2)
        CB->>RT: Forward request
        RT--xCB: Timeout

        DW->>CB: Retry (attempt 3)
        CB->>RT: Forward request
        RT--xCB: HTTP 503
    end

    Note over DW: Max retries exhausted

    CB->>CB: failure_count = 5
    CB->>CB: State: OPEN

    DW->>DLQ: Move to dead letter queue

    Note over CB: Recovery timeout (60s)

    CB->>CB: State: HALF_OPEN

    DW->>CB: New request (test)
    CB->>RT: Forward request
    RT-->>CB: Success

    CB->>CB: success_count++
    CB->>CB: State: CLOSED

    Note over DW: Normal operation resumes

Prometheus Metrics

Source: backend/services/circuit_breaker.py:64-97

# backend/services/circuit_breaker.py:64-97
# Legacy metrics
CIRCUIT_BREAKER_STATE = Gauge(
    "circuit_breaker_state",
    "Current state of the circuit breaker (0=closed, 1=open, 2=half_open)",
    labelnames=["service"],
)

CIRCUIT_BREAKER_FAILURES_TOTAL = Counter(
    "circuit_breaker_failures_total",
    "Total number of failures recorded by the circuit breaker",
    labelnames=["service"],
)

CIRCUIT_BREAKER_STATE_CHANGES_TOTAL = Counter(
    "circuit_breaker_state_changes_total",
    "Total number of state transitions",
    labelnames=["service", "from_state", "to_state"],
)

# HSI-prefixed metrics (Grafana dashboard)
HSI_CIRCUIT_BREAKER_STATE = Gauge(
    "hsi_circuit_breaker_state",
    "Current state of the circuit breaker (0=closed, 1=open, 2=half_open)",
    labelnames=["service"],
)

HSI_CIRCUIT_BREAKER_TRIPS_TOTAL = Counter(
    "hsi_circuit_breaker_trips_total",
    "Total number of times the circuit breaker has tripped",
    labelnames=["service"],
)

CircuitBreakerMetrics

Source: backend/services/circuit_breaker.py:145-184

# backend/services/circuit_breaker.py:145-184
@dataclass(slots=True)
class CircuitBreakerMetrics:
    """Metrics for circuit breaker monitoring.

    Attributes:
        name: Circuit breaker name
        state: Current state
        failure_count: Consecutive failures
        success_count: Consecutive successes in half-open
        total_calls: Total calls attempted
        rejected_calls: Calls rejected due to open circuit
        last_failure_time: Timestamp of last failure
        last_state_change: Timestamp of last state transition
    """

    name: str
    state: CircuitState
    failure_count: int = 0
    success_count: int = 0
    total_calls: int = 0
    rejected_calls: int = 0
    last_failure_time: datetime | None = None
    last_state_change: datetime | None = None

Graceful Degradation Strategies

Detection Service Down

Normal Flow:
  Image -> YOLO26 -> Detection -> Batch -> LLM

Degraded Flow (YOLO26 down):
  Image -> Queue (waiting) -> DLQ after max retries

Recovery:
  Circuit closes -> Process DLQ -> Resume normal flow

LLM Service Down

Normal Flow:
  Batch -> Nemotron LLM -> Event -> Broadcast

Degraded Flow (Nemotron down):
  Batch -> Queue (waiting) -> DLQ after max retries

Recovery:
  Circuit closes -> Process DLQ -> Resume normal flow

Enrichment Service Down

Normal Flow:
  Detection -> Enrichment -> LLM (enriched prompt)

Degraded Flow (Enrichment down):
  Detection -> LLM (basic prompt, no enrichment)

Recovery:
  Service available -> Full enrichment resumes

Redis Down

Normal Flow:
  Event -> Redis Pub/Sub -> WebSocket Clients

Degraded Flow (Redis down):
  Event -> DB (events still stored)
  WebSocket: No real-time updates

Recovery:
  Redis available -> Pub/Sub resumes
  Clients reconnect -> Get latest state via API

Dead Letter Queue

When all retries are exhausted, messages go to the Dead Letter Queue:

# DLQ structure
DLQ:detection_queue -> [failed detection jobs]
DLQ:analysis_queue -> [failed analysis batches]

DLQ Processing

  1. Manual inspection via /api/dlq endpoint
  2. Automatic retry after recovery
  3. Manual requeue via admin API