Resilience Architecture¶

This document details the resilience patterns implemented in the Home Security Intelligence system to ensure reliable operation even when external services (YOLO26, Nemotron LLM, Redis) experience failures.

Table of Contents¶

Resilience Overview
Circuit Breaker Pattern
Retry Handler with Exponential Backoff
Dead-Letter Queue (DLQ) Management
Service Health Monitoring
Graceful Degradation
Recovery Strategies
Configuration Reference
WebSocket Circuit Breaker and Degraded Mode

Resilience Overview¶

The system implements multiple layers of resilience to handle failures gracefully:

Resilience Architecture Overview

Layered resilience architecture showing circuit breakers, retry logic with exponential backoff, dead-letter queues, and health monitoring.

Mermaid source (click to expand)

flowchart TB
    subgraph Input["Incoming Request"]
        REQ[Service Call]
    end

    subgraph CircuitBreaker["Circuit Breaker Layer"]
        CB{Circuit<br/>State?}
        CLOSED[CLOSED<br/>Normal Operation]
        OPEN[OPEN<br/>Fast Fail]
        HALF[HALF_OPEN<br/>Test Recovery]
    end

    subgraph Retry["Retry Layer"]
        RT{Retry<br/>Attempt?}
        BACKOFF[Exponential<br/>Backoff]
        EXEC[Execute<br/>Operation]
    end

    subgraph Outcome["Outcome Handling"]
        SUCCESS[Success<br/>Reset Counters]
        FAIL[Failure<br/>Increment Counter]
        DLQ[Dead Letter<br/>Queue]
    end

    subgraph Recovery["Recovery Services"]
        HM[Health<br/>Monitor]
        AUTO[Auto<br/>Restart]
    end

    REQ --> CB
    CB -->|Closed| CLOSED --> RT
    CB -->|Open| OPEN --> FAIL
    CB -->|Half-Open| HALF --> RT

    RT -->|Yes| BACKOFF --> EXEC
    RT -->|Max Retries| DLQ

    EXEC -->|OK| SUCCESS
    EXEC -->|Error| FAIL

    FAIL -->|Threshold Met| OPEN
    SUCCESS --> CLOSED

    HM -->|Unhealthy| AUTO
    AUTO -->|Restart| HM

    style OPEN fill:#E74856,color:#fff
    style SUCCESS fill:#76B900,color:#fff
    style DLQ fill:#A855F7,color:#fff
    style HALF fill:#FFB800,color:#000

Resilience Components¶

Component	Location	Responsibility
CircuitBreaker	`backend/services/circuit_breaker.py:270`	Prevents cascading failures by failing fast
RetryHandler	`backend/services/retry_handler.py:184`	Exponential backoff with DLQ support
ServiceHealthMonitor	`backend/services/health_monitor.py:44`	Periodic health checks and auto-recovery
DegradationManager	`backend/services/degradation_manager.py`	Graceful degradation during outages

Circuit Breaker Pattern¶

Circuit Breaker Pattern

The circuit breaker protects external services from cascading failures by monitoring failure rates and temporarily blocking calls to unhealthy services.

Circuit Breaker States¶

Circuit Breaker State Machine

State machine showing transitions between CLOSED (normal), OPEN (tripped), and HALF_OPEN (testing) states.

Mermaid source (click to expand)

stateDiagram-v2
    [*] --> CLOSED: Initial State

    CLOSED --> OPEN: failures >= threshold
    OPEN --> HALF_OPEN: recovery_timeout elapsed
    HALF_OPEN --> CLOSED: success_threshold met
    HALF_OPEN --> OPEN: any failure

    CLOSED: Normal Operation
    CLOSED: Calls pass through
    CLOSED: Track failures

    OPEN: Circuit Tripped
    OPEN: Calls rejected immediately
    OPEN: CircuitBreakerError raised

    HALF_OPEN: Recovery Testing
    HALF_OPEN: Limited calls allowed
    HALF_OPEN: Track successes

Implementation Details¶

The CircuitBreaker class at line 270 implements the pattern:

# backend/services/circuit_breaker.py:270
class CircuitBreaker:
    """Circuit breaker for protecting external service calls.

    Implements the circuit breaker pattern with three states:
    - CLOSED: Normal operation, calls pass through
    - OPEN: Service failing, calls rejected immediately
    - HALF_OPEN: Testing recovery, limited calls allowed
    """

    def __init__(
        self,
        name: str,
        config: CircuitBreakerConfig | None = None,
    ) -> None:
        self._name = name
        self._config = config or CircuitBreakerConfig()
        self._state = CircuitState.CLOSED
        self._failure_count = 0
        # ...

Circuit Breaker Configuration¶

The CircuitBreakerConfig at line 139 defines behavior:

Parameter	Default	Description
`failure_threshold`	5	Failures before opening circuit
`recovery_timeout`	30.0s	Wait time before testing recovery
`half_open_max_calls`	3	Max calls allowed in half-open state
`success_threshold`	2	Successes needed to close circuit
`excluded_exceptions`	()	Exception types that don't count as failures

Usage Pattern¶

from backend.services.circuit_breaker import get_circuit_breaker, CircuitBreakerConfig

# Get or create circuit breaker for a service
breaker = get_circuit_breaker(
    "yolo26",
    CircuitBreakerConfig(
        failure_threshold=5,
        recovery_timeout=30.0,
    )
)

# Execute through circuit breaker
try:
    result = await breaker.call(detector_client.detect_objects, image_path)
except CircuitBreakerError:
    # Service unavailable, use fallback
    result = []

Circuit Breaker Registry¶

The CircuitBreakerRegistry at line 1018 manages multiple breakers:

Circuit Breaker Registry

Global registry managing circuit breakers for yolo26, nemotron, and redis services.

Mermaid source (click to expand)

flowchart TB
    subgraph Registry["CircuitBreakerRegistry"]
        R[Global Registry]
    end

    subgraph Breakers["Individual Circuit Breakers"]
        B1[yolo26<br/>breaker]
        B2[nemotron<br/>breaker]
        B3[redis<br/>breaker]
    end

    subgraph Services["Protected Services"]
        S1[YOLO26<br/>:8095]
        S2[Nemotron LLM<br/>:8091]
        S3[Redis<br/>:6379]
    end

    R --> B1
    R --> B2
    R --> B3

    B1 --> S1
    B2 --> S2
    B3 --> S3

    style S1 fill:#3B82F6,color:#fff
    style S2 fill:#3B82F6,color:#fff
    style S3 fill:#A855F7,color:#fff

Retry Handler with Exponential Backoff¶

The RetryHandler at line 184 provides automatic retries with exponential backoff for transient failures.

Retry Flow¶

Retry Handler Flow

Retry flow showing exponential backoff calculation, jitter application, cap enforcement, and dead-letter queue handling.

Mermaid source (click to expand)

flowchart TB
    subgraph Input["Job Processing"]
        JOB[Detection/Analysis Job]
    end

    subgraph RetryLoop["Retry Handler"]
        ATT{Attempt<br/>N of Max?}
        EXEC[Execute<br/>Operation]
        CHK{Success?}
        CALC[Calculate<br/>Backoff Delay]
        WAIT[Wait with<br/>Jitter]
    end

    subgraph Outcomes["Final Outcome"]
        OK[Success<br/>Return Result]
        DLQ[Move to DLQ<br/>dlq:queue_name]
    end

    JOB --> ATT
    ATT -->|Attempt N| EXEC
    ATT -->|Max Exceeded| DLQ

    EXEC --> CHK
    CHK -->|Yes| OK
    CHK -->|No| CALC

    CALC --> WAIT
    WAIT --> ATT

    style OK fill:#76B900,color:#fff
    style DLQ fill:#E74856,color:#fff

Exponential Backoff Algorithm¶

The RetryConfig at line 64 configures backoff behavior:

# backend/services/retry_handler.py:64
@dataclass
class RetryConfig:
    """Configuration for retry behavior."""

    max_retries: int = 3
    base_delay_seconds: float = 1.0
    max_delay_seconds: float = 30.0
    exponential_base: float = 2.0
    jitter: bool = True

    def get_delay(self, attempt: int) -> float:
        """Calculate delay: base * (exponential_base ^ (attempt - 1))"""
        delay = self.base_delay_seconds * (self.exponential_base ** (attempt - 1))
        delay = min(delay, self.max_delay_seconds)
        if self.jitter:
            jitter_amount = delay * 0.25 * random.random()
            delay = delay + jitter_amount
        return delay

Backoff Timing Example¶

Attempt	Base Delay	With Jitter (0-25%)
1	1.0s	1.0s - 1.25s
2	2.0s	2.0s - 2.5s
3	4.0s	4.0s - 5.0s
4	8.0s	8.0s - 10.0s
5	16.0s	16.0s - 20.0s
6+	30.0s (max)	30.0s - 37.5s

Dead-Letter Queue (DLQ) Management¶

Jobs that exhaust all retry attempts are moved to dead-letter queues for manual inspection and reprocessing.

DLQ Architecture¶

Dead-letter queue architecture showing processing queues (detection_queue, analysis_queue) flowing through workers to the retry handler, with failed jobs moving to DLQ storage (dlq:detection_queue, dlq:analysis_queue) and the DLQ management API providing inspection, requeue, and clear operations

DLQ system architecture with queue workers, retry handling, and management API for failed job recovery.

Mermaid source (click to expand)

flowchart TB
    subgraph ProcessingQueues["Processing Queues"]
        DQ[detection_queue]
        AQ[analysis_queue]
    end

    subgraph Workers["Queue Workers"]
        DW[DetectionQueueWorker]
        AW[AnalysisQueueWorker]
    end

    subgraph RetryLayer["Retry Handler"]
        RH[RetryHandler<br/>max_retries=3]
    end

    subgraph DLQs["Dead Letter Queues"]
        DLQ1[dlq:detection_queue]
        DLQ2[dlq:analysis_queue]
    end

    subgraph Management["DLQ Management API"]
        API[/api/dlq/*]
        INSPECT[Inspect Jobs]
        REQUEUE[Requeue Jobs]
        CLEAR[Clear Queue]
    end

    DQ --> DW
    AQ --> AW
    DW --> RH
    AW --> RH

    RH -->|Exhausted| DLQ1
    RH -->|Exhausted| DLQ2

    API --> INSPECT
    API --> REQUEUE
    API --> CLEAR

    DLQ1 -.->|Manual| REQUEUE
    DLQ2 -.->|Manual| REQUEUE
    REQUEUE -.->|Return to| DQ
    REQUEUE -.->|Return to| AQ

    style DLQ1 fill:#E74856,color:#fff
    style DLQ2 fill:#E74856,color:#fff
    style API fill:#3B82F6,color:#fff

DLQ Job Format¶

Jobs in the DLQ include failure metadata:

{
  "original_job": {
    "camera_id": "front_door",
    "file_path": "/export/foscam/front_door/image_001.jpg",
    "timestamp": "2024-01-15T10:30:00.000000"
  },
  "error": "Connection refused: YOLO26 service unavailable",
  "attempt_count": 3,
  "first_failed_at": "2024-01-15T10:30:01.000000",
  "last_failed_at": "2024-01-15T10:30:15.000000",
  "queue_name": "detection_queue"
}

DLQ Statistics¶

The DLQStats dataclass at line 175:

# backend/services/retry_handler.py:175
@dataclass
class DLQStats:
    """Statistics about dead-letter queues."""

    detection_queue_count: int = 0
    analysis_queue_count: int = 0
    total_count: int = 0

DLQ API Endpoints¶

Endpoint	Method	Description
`/api/dlq/stats`	GET	Get DLQ statistics
`/api/dlq/{queue_name}`	GET	List jobs in a DLQ
`/api/dlq/{queue_name}/requeue`	POST	Move job back to processing
`/api/dlq/{queue_name}`	DELETE	Clear all jobs in DLQ

Service Health Monitoring¶

The ServiceHealthMonitor at line 44 continuously monitors external services and orchestrates automatic recovery.

Health Check Flow¶

Health Check Flow

Service health monitoring flow showing monitored services, state transitions, and recovery actions.

Mermaid source (click to expand)

flowchart TB
    subgraph Monitor["ServiceHealthMonitor"]
        LOOP[Health Check Loop<br/>Every 15s]
    end

    subgraph Services["Monitored Services"]
        S1[YOLO26<br/>GET /health]
        S2[Nemotron<br/>GET /health]
        S3[Redis<br/>PING]
    end

    subgraph States["Service States"]
        HEALTHY[healthy<br/>Normal operation]
        UNHEALTHY[unhealthy<br/>Health check failed]
        RESTARTING[restarting<br/>Restart in progress]
        FAILED[failed<br/>Max retries exceeded]
    end

    subgraph Recovery["Recovery Actions"]
        BACKOFF[Exponential<br/>Backoff]
        RESTART[Restart<br/>Service]
        BROADCAST[WebSocket<br/>Broadcast]
    end

    LOOP --> S1
    LOOP --> S2
    LOOP --> S3

    S1 & S2 & S3 -->|OK| HEALTHY
    S1 & S2 & S3 -->|Fail| UNHEALTHY

    UNHEALTHY --> BACKOFF
    BACKOFF --> RESTART
    RESTART -->|Success| HEALTHY
    RESTART -->|Fail| RESTARTING
    RESTARTING -->|Max Retries| FAILED

    HEALTHY --> BROADCAST
    UNHEALTHY --> BROADCAST
    FAILED --> BROADCAST

    style HEALTHY fill:#76B900,color:#fff
    style UNHEALTHY fill:#FFB800,color:#000
    style FAILED fill:#E74856,color:#fff
    style RESTARTING fill:#3B82F6,color:#fff

Health Monitor Implementation¶

# backend/services/health_monitor.py:44
class ServiceHealthMonitor:
    """Monitors service health and orchestrates automatic recovery.

    Status values:
        - healthy: Service responding normally
        - unhealthy: Health check failed
        - restarting: Restart in progress
        - restart_failed: Restart attempt failed
        - failed: Max retries exceeded, giving up
    """

    def __init__(
        self,
        manager: ServiceManager,
        services: list[ServiceConfig],
        broadcaster: EventBroadcaster | None = None,
        check_interval: float = 15.0,
    ) -> None:
        self._manager = manager
        self._services = services
        self._broadcaster = broadcaster
        self._check_interval = check_interval
        # ...

Recovery Backoff Strategy¶

Recovery attempts use exponential backoff to avoid overwhelming recovering services:

Attempt	Backoff Delay	Formula
1	5s	`backoff_base * 2^0`
2	10s	`backoff_base * 2^1`
3	20s	`backoff_base * 2^2`
4	40s	`backoff_base * 2^3`
5	(Give up)	Max retries exceeded

Graceful Degradation¶

When services are unavailable, the system degrades gracefully rather than failing completely.

Degradation Modes¶

Graceful Degradation

Graceful degradation modes showing normal operation, failure scenarios, and degraded behaviors.

Mermaid source (click to expand)

flowchart TB
    subgraph Normal["Normal Operation"]
        N1[Full AI Pipeline]
        N2[Real-time Events]
        N3[Risk Scoring]
    end

    subgraph Degraded["Degraded Modes"]
        D1[Detection Only<br/>No LLM Analysis]
        D2[Queue Buffering<br/>Service Recovery]
        D3[Fallback Risk<br/>Score: 50, Medium]
    end

    subgraph Failed["Failure Scenarios"]
        F1[YOLO26<br/>Unavailable]
        F2[Nemotron<br/>Unavailable]
        F3[Redis<br/>Unavailable]
    end

    F1 -->|Skip Detection| D2
    F2 -->|Use Fallback| D3
    F3 -->|Fail Open| D1

    N1 --> F1
    N1 --> F2
    N1 --> F3

    style Normal fill:#76B900,color:#fff
    style Degraded fill:#FFB800,color:#000
    style Failed fill:#E74856,color:#fff

Degradation Behavior by Component¶

Component	Failure Mode	Degradation Behavior
YOLO26	Unreachable	DetectorClient returns empty list, detection skipped
Nemotron	Unreachable	NemotronAnalyzer returns default risk (50, medium)
Redis	Unreachable	Deduplication fails open (allows processing)
Redis	Pub/sub down	WebSocket updates unavailable
PostgreSQL	Unreachable	Full system failure (critical dependency)

Fallback Risk Assessment¶

When Nemotron is unavailable, the system uses a fallback risk assessment:

# backend/services/nemotron_analyzer.py (within analyze_batch)
# Create fallback risk data when LLM is unavailable
risk_data = {
    "risk_score": 50,
    "risk_level": "medium",
    "summary": "Analysis unavailable - LLM service error",
    "reasoning": "Failed to analyze detections due to service error",
}

Recovery Strategies¶

Automatic Recovery Sequence¶

Mermaid source (click to expand)

sequenceDiagram
    participant HM as HealthMonitor
    participant SVC as External Service
    participant SM as ServiceManager
    participant WS as WebSocket

    Note over HM: Check interval: 15s
    HM->>SVC: Health check
    SVC--xHM: Timeout/Error

    HM->>WS: Broadcast "unhealthy"

    loop Retry with backoff
        HM->>HM: Calculate backoff (5s * 2^n)
        HM->>HM: Wait backoff period
        HM->>WS: Broadcast "restarting"
        HM->>SM: Restart service
        SM->>SVC: docker restart / systemctl restart
        HM->>HM: Wait 2s for startup
        HM->>SVC: Health check
        alt Healthy
            SVC-->>HM: OK
            HM->>WS: Broadcast "healthy"
        else Still Unhealthy
            SVC--xHM: Error
            Note over HM: Increment retry count
        end
    end

    alt Max retries exceeded
        HM->>WS: Broadcast "failed"
        Note over HM: Manual intervention required
    end

Service Manager Strategies¶

The system supports different restart strategies via the ServiceManager interface:

Strategy	Implementation	Use Case
`ShellServiceManager`	Shell commands (`systemctl`, scripts)	Development, native services
`DockerServiceManager`	Docker CLI (`docker restart`)	Production containers
`PodmanServiceManager`	Podman CLI (`podman restart`)	Podman deployments

Configuration Reference¶

Circuit Breaker Settings¶

Environment Variable	Default	Description
`CIRCUIT_BREAKER_FAILURE_THRESHOLD`	5	Failures before opening
`CIRCUIT_BREAKER_RECOVERY_TIMEOUT`	30	Seconds before half-open
`CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS`	3	Max test calls
`CIRCUIT_BREAKER_SUCCESS_THRESHOLD`	2	Successes to close

Retry Handler Settings¶

Environment Variable	Default	Description
`RETRY_MAX_RETRIES`	3	Maximum retry attempts
`RETRY_BASE_DELAY`	1.0	Initial delay (seconds)
`RETRY_MAX_DELAY`	30.0	Maximum delay (seconds)
`RETRY_EXPONENTIAL_BASE`	2.0	Backoff multiplier

Health Monitor Settings¶

Environment Variable	Default	Description
`HEALTH_CHECK_INTERVAL`	15.0	Check interval (seconds)
`SERVICE_MAX_RETRIES`	5	Max restart attempts
`SERVICE_BACKOFF_BASE`	5.0	Initial restart backoff

WebSocket Circuit Breaker and Degraded Mode¶

The system includes a dedicated WebSocket circuit breaker pattern for real-time connection resilience. This provides automatic recovery when Redis pub/sub experiences failures and graceful degradation when recovery fails.

Architecture Overview¶

WebSocket Circuit Breaker Architecture

WebSocket circuit breaker architecture showing backend services, Redis pub/sub, and frontend clients.

Mermaid source (click to expand)

flowchart TB
    subgraph Backend["Backend Services"]
        SB[SystemBroadcaster<br/>Port: /ws/system]
        EB[EventBroadcaster<br/>Port: /ws/events]
        CB1[WebSocketCircuitBreaker<br/>system_broadcaster]
        CB2[WebSocketCircuitBreaker<br/>event_broadcaster]
    end

    subgraph Redis["Redis Pub/Sub"]
        CH1[system_status channel]
        CH2[security_events channel]
    end

    subgraph Frontend["Frontend Clients"]
        WS[useWebSocket Hook]
        WSM[WebSocketManager<br/>Connection Deduplication]
    end

    SB --> CB1
    EB --> CB2
    CB1 --> CH1
    CB2 --> CH2
    CH1 -.->|Subscribe| SB
    CH2 -.->|Subscribe| EB

    SB -->|Broadcast| WS
    EB -->|Broadcast| WS
    WS --> WSM

    style CB1 fill:#FFB800,color:#000
    style CB2 fill:#FFB800,color:#000
    style WSM fill:#3B82F6,color:#fff

WebSocket Circuit Breaker States¶

The WebSocketCircuitBreaker implements the circuit breaker pattern specifically for WebSocket broadcaster services.

State	Description	Behavior
CLOSED	Normal operation, WebSocket operations proceed normally	All broadcasts pass through
OPEN	Too many failures, operations blocked to allow recovery	Broadcasts are rejected immediately
HALF_OPEN	Testing recovery, limited operations allowed	Single test operation allowed per recovery cycle

State Diagram¶

Note: The WebSocket circuit breaker state diagram is included in the architecture overview diagram above.

Mermaid source (click to expand)

stateDiagram-v2
    [*] --> CLOSED: Initial State

    CLOSED --> OPEN: failures >= threshold (5)
    OPEN --> HALF_OPEN: recovery_timeout (30s) elapsed
    HALF_OPEN --> CLOSED: success_threshold (1) met
    HALF_OPEN --> OPEN: any failure

    CLOSED: Normal Operation
    CLOSED: WebSocket broadcasts pass through
    CLOSED: Track consecutive failures
    CLOSED: Reset failure count on success

    OPEN: Circuit Tripped
    OPEN: Broadcasts rejected immediately
    OPEN: Waiting for recovery timeout
    OPEN: Degraded mode notification sent

    HALF_OPEN: Recovery Testing
    HALF_OPEN: Single test operation allowed
    HALF_OPEN: Track success/failure
    HALF_OPEN: Careful service probing

Configuration¶

Both SystemBroadcaster and EventBroadcaster use the following circuit breaker configuration:

Parameter	Default	Description
`failure_threshold`	5 (MAX_RECOVERY_ATTEMPTS)	Consecutive failures before opening circuit
`recovery_timeout`	30.0s	Wait time before transitioning to HALF_OPEN
`half_open_max_calls`	1	Max calls allowed in HALF_OPEN state
`success_threshold`	1	Successes needed in HALF_OPEN to close circuit

Backend: Broadcaster Integration¶

Both EventBroadcaster and SystemBroadcaster integrate the WebSocketCircuitBreaker for pub/sub listener resilience:

# backend/services/system_broadcaster.py
from backend.core.websocket_circuit_breaker import WebSocketCircuitBreaker

class SystemBroadcaster:
    MAX_RECOVERY_ATTEMPTS = 5

    def __init__(self, ...):
        self._circuit_breaker = WebSocketCircuitBreaker(
            failure_threshold=self.MAX_RECOVERY_ATTEMPTS,
            recovery_timeout=30.0,
            half_open_max_calls=1,
            success_threshold=1,
            name="system_broadcaster",
        )
        self._is_degraded = False

    def is_degraded(self) -> bool:
        """Check if the broadcaster is in degraded mode."""
        return self._is_degraded

    def get_circuit_state(self) -> WebSocketCircuitState:
        """Get current circuit breaker state."""
        return self._circuit_breaker.get_state()

Degraded Mode¶

When the circuit breaker opens and recovery fails, the broadcaster enters degraded mode:

Mermaid source (click to expand)

sequenceDiagram
    participant Redis as Redis Pub/Sub
    participant CB as Circuit Breaker
    participant SB as SystemBroadcaster
    participant WS as WebSocket Clients

    Note over Redis: Redis connection fails

    loop Recovery Attempts (1-5)
        SB->>Redis: Attempt reconnect
        Redis--xSB: Connection failed
        SB->>CB: record_failure()
        CB->>CB: failure_count++
    end

    CB->>CB: failure_count >= 5
    CB->>SB: is_call_permitted() = false
    SB->>SB: Enter degraded mode
    SB->>WS: Broadcast service_status: degraded

    Note over SB: Manual restart required

Degraded Mode Behavior¶

is_degraded() method - Returns True when all recovery attempts are exhausted

Client notification - Connected clients receive a service_status message:

{
  "type": "service_status",
  "data": {
    "service": "system_broadcaster",
    "status": "degraded",
    "message": "System status broadcasting is degraded. Updates may be delayed or unavailable.",
    "circuit_state": "open"
  }
}

Graceful handling - WebSocket connections are still accepted, but real-time broadcasts may be delayed or unavailable
CRITICAL logging - Operator alert logged for manual intervention

Recovery Sequence¶

The broadcaster attempts automatic recovery with exponential backoff:

Recovery Flow

Broadcaster recovery flow showing failure detection, recovery attempts, circuit breaker check, and outcomes.

Mermaid source (click to expand)

flowchart TB
    subgraph Failure["Failure Detection"]
        F1[Redis Connection Error]
        F2[Pub/Sub Listener Dies]
    end

    subgraph Recovery["Recovery Attempts"]
        R1{Attempt < 5?}
        R2[Record Failure]
        R3[Exponential Backoff<br/>1s, 2s, 4s, 8s...]
        R4[Reset Pub/Sub Connection]
        R5[Restart Listener Task]
    end

    subgraph CircuitBreaker["Circuit Breaker Check"]
        CB{is_call_permitted?}
        CB_BLOCK[Block Recovery<br/>Circuit OPEN]
    end

    subgraph Outcome["Outcome"]
        SUCCESS[Recovery Success<br/>Reset Counters]
        DEGRADED[Enter Degraded Mode<br/>Broadcast Status]
    end

    F1 --> R2
    F2 --> R2
    R2 --> CB

    CB -->|Yes| R1
    CB -->|No| CB_BLOCK --> DEGRADED

    R1 -->|Yes| R3 --> R4 --> R5
    R1 -->|No| DEGRADED

    R5 -->|Success| SUCCESS
    R5 -->|Failure| R2

    style DEGRADED fill:#E74856,color:#fff
    style SUCCESS fill:#76B900,color:#fff
    style CB_BLOCK fill:#FFB800,color:#000

Frontend: Client-Side Circuit Breaker Pattern¶

The frontend implements its own circuit breaker-like behavior through the reconnection logic in webSocketManager.ts. While not a traditional circuit breaker class, the maxReconnectAttempts mechanism provides equivalent protection:

Closed State (equivalent): Normal connection, reset on successful open
Open State (equivalent): hasExhaustedRetries = true, no more connection attempts
Half-Open State (equivalent): Each reconnection attempt tests if the server is available

This approach is more appropriate for client-side WebSocket connections where:

The client cannot "block" operations like a backend service can
The primary failure mode is disconnection, not request failures
User feedback (connection status) is more important than request throttling

Frontend Circuit Breaker State Machine¶

Frontend State Machine

Frontend circuit breaker state machine showing Connected, Reconnecting, and Exhausted states.

Mermaid source (click to expand)

stateDiagram-v2
    [*] --> Connected: Initial connect()

    Connected --> Reconnecting: onClose event
    Connected: isConnected = true
    Connected: hasExhaustedRetries = false
    Connected: reconnectAttempts = 0

    Reconnecting --> Connected: onOpen event
    Reconnecting --> Reconnecting: attempt < maxReconnectAttempts
    Reconnecting --> Exhausted: attempt >= maxReconnectAttempts
    Reconnecting: isConnected = false
    Reconnecting: Exponential backoff + jitter
    Reconnecting: reconnectAttempts++

    Exhausted --> Connected: Manual connect() call
    Exhausted: hasExhaustedRetries = true
    Exhausted: onMaxRetriesExhausted() called
    Exhausted: No automatic reconnection

WebSocket Manager Architecture¶

The WebSocketManager provides connection deduplication and automatic reconnection:

WebSocket Manager Architecture

WebSocket Manager architecture showing React components, hook, manager singleton, and managed connection.

Mermaid source (click to expand)

flowchart TB
    subgraph Components["React Components"]
        C1[Dashboard]
        C2[EventFeed]
        C3[SystemStatus]
    end

    subgraph Hook["useWebSocket Hook"]
        H[useWebSocket<br/>Options & Callbacks]
    end

    subgraph Manager["WebSocketManager Singleton"]
        M[Connection Pool]
        SUB[Subscribers Map<br/>Reference Counting]
    end

    subgraph Connection["Managed Connection"]
        WS[WebSocket Instance]
        RT[Reconnect Logic<br/>Exponential Backoff]
        HB[Heartbeat Handler<br/>Ping/Pong]
    end

    C1 & C2 & C3 --> H
    H --> M
    M --> SUB --> WS
    WS --> RT
    WS --> HB

    style M fill:#3B82F6,color:#fff

Client Reconnection Configuration¶

// frontend/src/hooks/useWebSocket.ts
export interface WebSocketOptions {
  url: string;
  reconnect?: boolean; // Default: true
  reconnectInterval?: number; // Default: 1000ms (base interval)
  reconnectAttempts?: number; // Default: 5 (max attempts)
  connectionTimeout?: number; // Default: 10000ms
  autoRespondToHeartbeat?: boolean; // Default: true
  onMaxRetriesExhausted?: () => void; // Called when max attempts reached
}

Exponential Backoff with Jitter¶

// frontend/src/hooks/webSocketManager.ts
function calculateBackoffDelay(
  attempt: number,
  baseInterval: number,
  maxInterval: number = 30000
): number {
  const exponentialDelay = baseInterval * Math.pow(2, attempt);
  const cappedDelay = Math.min(exponentialDelay, maxInterval);
  const jitter = Math.random() * 0.25 * cappedDelay;
  return Math.floor(cappedDelay + jitter);
}

Attempt	Base Delay	Exponential	Capped	With Jitter (0-25%)
0	1000ms	1000ms	1000ms	1000-1250ms
1	1000ms	2000ms	2000ms	2000-2500ms
2	1000ms	4000ms	4000ms	4000-5000ms
3	1000ms	8000ms	8000ms	8000-10000ms
4	1000ms	16000ms	16000ms	16000-20000ms
5+	1000ms	32000ms+	30000ms	30000-37500ms

Client State Tracking¶

The useWebSocket hook exposes reconnection state:

export interface UseWebSocketReturn {
  isConnected: boolean; // Current connection status
  hasExhaustedRetries: boolean; // True if max attempts reached
  reconnectCount: number; // Current retry attempt count
  lastHeartbeat: Date | null; // Timestamp of last server heartbeat
  connect: () => void; // Manual reconnect trigger
  disconnect: () => void; // Manual disconnect
}

End-to-End Resilience Flow¶

Mermaid source (click to expand)

sequenceDiagram
    participant Client as Frontend Client
    participant WSM as WebSocketManager
    participant Backend as Backend (FastAPI)
    participant SB as SystemBroadcaster
    participant CB as Circuit Breaker
    participant Redis as Redis

    Note over Redis: Redis goes offline

    SB->>Redis: Pub/Sub subscribe
    Redis--xSB: Connection lost
    SB->>CB: record_failure()

    loop Recovery (up to 5 attempts)
        SB->>Redis: Reconnect attempt
        Redis--xSB: Still unavailable
        SB->>CB: record_failure()
    end

    CB->>SB: is_call_permitted() = false
    SB->>SB: _is_degraded = true
    SB->>Backend: Broadcast degraded status
    Backend->>Client: service_status: degraded

    Note over Client: Client shows degraded banner

    Client->>WSM: Connection lost
    WSM->>WSM: Start reconnection

    loop Client Reconnection (up to 5 attempts)
        WSM->>Backend: WebSocket connect
        Backend-->>WSM: Connection established
        Note over WSM: onClose triggered (backend may drop)
        WSM->>WSM: Exponential backoff
    end

    alt Max Retries Exceeded
        WSM->>Client: onMaxRetriesExhausted()
        Note over Client: Show "Connection lost" UI
    else Redis Recovers
        Redis->>SB: Connection restored
        SB->>CB: record_success()
        CB->>SB: is_call_permitted() = true
        SB->>SB: _is_degraded = false
        SB->>Backend: Resume broadcasting
        Backend->>Client: service_status: healthy
    end

Monitoring and Observability¶

Backend Metrics¶

The circuit breaker tracks metrics via get_metrics() and get_status():

Metric	Description	API Endpoint
`failure_count`	Consecutive failures since last success	`GET /api/system/health/ready`
`success_count`	Consecutive successes in HALF_OPEN state	Internal monitoring
`total_failures`	Total failures recorded	Prometheus metrics (if enabled)
`total_successes`	Total successes recorded	Prometheus metrics (if enabled)
`last_failure_time`	Timestamp of last failure (monotonic)	Circuit breaker status
`last_state_change`	Timestamp of last state transition	Circuit breaker status
`opened_at`	Timestamp when circuit was last opened	Circuit breaker status

Health Check Integration¶

# Check broadcaster health in health endpoints
broadcaster = get_system_broadcaster_sync()
if broadcaster.is_degraded():
    return {"status": "degraded", "reason": "WebSocket broadcasting unavailable"}

Client-Side Monitoring¶

// React component monitoring example
const { isConnected, hasExhaustedRetries, reconnectCount, lastHeartbeat } = useWebSocket({
  url: '/ws/system',
  onMaxRetriesExhausted: () => {
    console.error('WebSocket connection failed after max retries');
    showConnectionErrorBanner();
  },
  onHeartbeat: () => {
    updateLastHeartbeatIndicator();
  },
});

Supervisor Task (EventBroadcaster)¶

The EventBroadcaster includes an additional supervision layer that monitors listener health:

# backend/services/event_broadcaster.py
async def _supervise_listener(self) -> None:
    """Supervision task that monitors listener health and restarts if needed."""
    while self._is_listening:
        await asyncio.sleep(self.SUPERVISION_INTERVAL)  # 30 seconds

        listener_alive = self._listener_task is not None and not self._listener_task.done()

        if listener_alive:
            self._circuit_breaker.record_success()
            self._recovery_attempts = 0
        elif self._is_listening:
            # Listener died - attempt recovery
            if self._circuit_breaker.is_call_permitted():
                await self._restart_listener()
            else:
                self._enter_degraded_mode()

Backend vs Frontend Circuit Breaker Comparison¶

Aspect	Backend (WebSocketCircuitBreaker)	Frontend (WebSocketManager)
Implementation	Dedicated class with explicit states	Reconnection logic with attempt counter
State Tracking	`WebSocketCircuitState` enum (CLOSED/OPEN/HALF)	Derived from `reconnectAttempts` counter
Failure Detection	Explicit `record_failure()` calls	`onClose` event triggers attempt increment
Recovery Testing	HALF_OPEN state with limited calls	Each reconnect attempt is a recovery test
Blocking Behavior	Rejects operations when OPEN	Stops automatic reconnection when exhausted
User Notification	`service_status` WebSocket message	`onMaxRetriesExhausted` callback
Manual Reset	`reset()` method	`connect()` method resets attempt counter
Timeout-based Reset	Yes (`recovery_timeout` triggers HALF_OPEN)	No (manual `connect()` required)
Thread Safety	`asyncio.Lock` for async contexts	Single-threaded JavaScript (not needed)
Metrics	`get_metrics()` with counters and timestamps	`getConnectionState()` with basic state

Configuration Summary¶

Component	Setting	Default	Description
Backend CB	`failure_threshold`	5	Failures before circuit opens
Backend CB	`recovery_timeout`	30s	Wait before HALF_OPEN transition
Backend	`SUPERVISION_INTERVAL`	30s	Listener health check interval
Frontend	`reconnectAttempts`	5	Max client reconnection attempts
Frontend	`reconnectInterval`	1000ms	Base backoff interval
Frontend	`connectionTimeout`	10000ms	Connection establishment timeout
Frontend	`maxInterval` (backoff)	30000ms	Maximum backoff delay

Manual Recovery¶

When the system enters degraded mode, manual intervention is required:

# Check container health
docker compose -f docker-compose.prod.yml ps

# Check Redis connectivity
docker compose -f docker-compose.prod.yml exec redis redis-cli ping

# Restart the backend service
docker compose -f docker-compose.prod.yml restart backend

# View broadcaster logs
docker compose -f docker-compose.prod.yml logs backend | grep -i "broadcaster\|circuit"

Look for these log patterns:

Log Level	Pattern	Meaning
CRITICAL	`EventBroadcaster has entered DEGRADED MODE`	Requires manual restart
WARNING	`Circuit breaker is OPEN`	Recovery blocked, waiting for reset
INFO	`Restarting pub/sub listener (attempt N/5)`	Auto-recovery in progress
INFO	`transitioned HALF_OPEN -> CLOSED (recovered)`	Service successfully recovered

Document	Purpose
Resilience Patterns Guide	Developer guide with code examples for circuit breakers, retry logic, and prompt injection prevention
AI Pipeline	Detection and analysis flow
Real-Time	WebSocket and pub/sub architecture
Data Model	Database schema and relationships
Backend AGENTS.md	Service implementation details
Frontend Hooks	React hooks including useWebSocket
Backend Core	Core infrastructure including Redis

This document describes the resilience architecture for the Home Security Intelligence system. For implementation details, see the source files referenced in the frontmatter.

Resilience Architecture¶

Table of Contents¶

Resilience Overview¶

Resilience Components¶

Circuit Breaker Pattern¶

Circuit Breaker States¶

Implementation Details¶

Circuit Breaker Configuration¶

Usage Pattern¶

Circuit Breaker Registry¶

Retry Handler with Exponential Backoff¶

Retry Flow¶

Exponential Backoff Algorithm¶

Backoff Timing Example¶

Dead-Letter Queue (DLQ) Management¶

DLQ Architecture¶

DLQ Job Format¶

DLQ Statistics¶

DLQ API Endpoints¶

Service Health Monitoring¶

Health Check Flow¶

Health Monitor Implementation¶

Recovery Backoff Strategy¶

Graceful Degradation¶

Degradation Modes¶

Degradation Behavior by Component¶

Fallback Risk Assessment¶

Recovery Strategies¶

Automatic Recovery Sequence¶

Service Manager Strategies¶

Configuration Reference¶

Circuit Breaker Settings¶

Retry Handler Settings¶

Health Monitor Settings¶

WebSocket Circuit Breaker and Degraded Mode¶

Architecture Overview¶

WebSocket Circuit Breaker States¶

State Diagram¶

Configuration¶

Backend: Broadcaster Integration¶

Degraded Mode¶

Degraded Mode Behavior¶

Recovery Sequence¶

Frontend: Client-Side Circuit Breaker Pattern¶

Frontend Circuit Breaker State Machine¶

WebSocket Manager Architecture¶

Client Reconnection Configuration¶

Exponential Backoff with Jitter¶

Client State Tracking¶

End-to-End Resilience Flow¶

Monitoring and Observability¶

Backend Metrics¶

Health Check Integration¶

Client-Side Monitoring¶

Supervisor Task (EventBroadcaster)¶

Backend vs Frontend Circuit Breaker Comparison¶

Configuration Summary¶

Manual Recovery¶

Related Documentation¶