Health Monitoring and Auto-Recovery¶

The ServiceHealthMonitor provides continuous health monitoring and automatic recovery of dependent services with exponential backoff restart strategies.

Source: backend/services/health_monitor.py

Overview¶

The ServiceHealthMonitor class (backend/services/health_monitor.py:44-401) provides:

Periodic health checks for all configured services
Automatic restart with exponential backoff on failure
Configurable max retries before giving up
WebSocket broadcast of service status changes
Graceful shutdown support
Event history tracking

Service Status Values¶

Status	Description
`healthy`	Service responding normally
`unhealthy`	Health check failed
`restarting`	Restart in progress
`restart_failed`	Restart attempt failed
`restart_disabled`	Service unhealthy but restart disabled
`failed`	Max retries exceeded, giving up

Configuration¶

ServiceConfig¶

Service configuration is defined in backend/services/service_managers.py:

@dataclass
class ServiceConfig:
    """Configuration for a monitored service."""

    name: str              # Service identifier
    health_url: str        # Health check endpoint
    restart_cmd: str | None  # Restart command (None = disabled)
    max_retries: int       # Max restart attempts
    backoff_base: float    # Base delay for exponential backoff

ServiceHealthMonitor Constructor¶

Defined in backend/services/health_monitor.py:59-89:

# backend/services/health_monitor.py:59-89
def __init__(
    self,
    manager: ServiceManager,
    services: list[ServiceConfig],
    broadcaster: EventBroadcaster | None = None,
    check_interval: float = 15.0,
    max_events: int = 100,
) -> None:
    """Initialize the health monitor.

    Args:
        manager: ServiceManager implementation for health checks and restarts
        services: List of service configurations to monitor
        broadcaster: Optional EventBroadcaster for WebSocket status updates
        check_interval: Seconds between health check cycles (default: 15.0)
        max_events: Maximum number of health events to track (default: 100)
    """

Configuration Parameters¶

Parameter	Type	Default	Description
`manager`	ServiceManager	Required	Implementation for health checks and restarts
`services`	list[ServiceConfig]	Required	Services to monitor
`broadcaster`	EventBroadcaster	None	WebSocket broadcaster for status updates
`check_interval`	float	15.0	Seconds between health check cycles
`max_events`	int	100	Maximum health events to retain

Health Event Tracking¶

HealthEvent Dataclass¶

Defined in backend/services/health_monitor.py:27-41:

# backend/services/health_monitor.py:27-41
@dataclass(slots=True)
class HealthEvent:
    """Represents a health-related event for tracking failure history."""

    timestamp: datetime
    service: str
    event_type: str  # "failure", "recovery", "restart"
    message: str | None = None

Event Types¶

Type	Description
`failure`	Service health check failed or restart failed
`recovery`	Service recovered from unhealthy state
`restart`	Restart attempt initiated

Health Check Loop¶

Health Check Strategy - Periodic monitoring with exponential backoff recovery

The main monitoring loop (backend/services/health_monitor.py:130-184):

# backend/services/health_monitor.py:130-184
async def _health_check_loop(self) -> None:
    """Main loop - check all services every N seconds."""
    while self._running:
        try:
            for service in self._services:
                if not self._running:
                    break

                try:
                    is_healthy = await self._manager.check_health(service)

                    if is_healthy:
                        # Service recovered or still healthy
                        if self._failure_counts.get(service.name, 0) > 0:
                            logger.info(f"Service {service.name} recovered")
                            self._failure_counts[service.name] = 0
                            await self._broadcast_status(service, "healthy", "Service recovered")
                    else:
                        # Service is unhealthy
                        await self._broadcast_status(service, "unhealthy", "Health check failed")
                        await self._handle_failure(service)

                except Exception as e:
                    await self._broadcast_status(service, "unhealthy", f"Health check error: {e}")
                    await self._handle_failure(service)

            await asyncio.sleep(self._check_interval)

        except asyncio.CancelledError:
            break

Failure Handling with Exponential Backoff¶

Recovery Algorithm¶

The _handle_failure() method (backend/services/health_monitor.py:186-287):

# backend/services/health_monitor.py:186-287
async def _handle_failure(self, service: ServiceConfig) -> None:
    """Handle service failure with exponential backoff restart."""
    # Check if restart is disabled
    if service.restart_cmd is None:
        await self._broadcast_status(service, "restart_disabled", "...")
        return

    # Increment failure count
    current_failures = self._failure_counts.get(service.name, 0) + 1
    self._failure_counts[service.name] = current_failures

    # Check max retries
    if current_failures > service.max_retries:
        await self._broadcast_status(service, "failed", "Max retries exceeded")
        return

    # Calculate exponential backoff: backoff_base * 2^(failures-1)
    backoff_delay = service.backoff_base * (2 ** (current_failures - 1))

    # Wait for backoff period
    await asyncio.sleep(backoff_delay)

    # Attempt restart
    await self._broadcast_status(service, "restarting", f"Attempt {current_failures}/{service.max_retries}")
    restart_success = await self._manager.restart(service)

    if restart_success:
        # Verify health after restart
        await asyncio.sleep(2)  # Brief pause before health check
        is_healthy = await self._manager.check_health(service)

        if is_healthy:
            self._failure_counts[service.name] = 0
            await self._broadcast_status(service, "healthy", "Service restarted successfully")

Backoff Timing¶

With backoff_base=5.0 (default):

Attempt	Backoff Delay	Formula
1	5s	`5 * 2^0`
2	10s	`5 * 2^1`
3	20s	`5 * 2^2`
4	40s	`5 * 2^3`
5	80s	`5 * 2^4`

WebSocket Status Broadcasting¶

Broadcast Format¶

Status updates use a canonical message envelope (backend/services/health_monitor.py:310-368):

# backend/services/health_monitor.py:350-359
event_data = {
    "type": "service_status",
    "data": {
        "service": service.name,
        "status": status,
        "message": message,
    },
    "timestamp": datetime.now(UTC).isoformat(),
}

Example WebSocket Message¶

{
  "type": "service_status",
  "data": {
    "service": "yolo26",
    "status": "restarting",
    "message": "Attempting restart (attempt 2/5)"
  },
  "timestamp": "2024-01-15T10:30:15.000000+00:00"
}

Lifecycle Management¶

Starting the Monitor¶

# backend/services/health_monitor.py:91-107
async def start(self) -> None:
    """Start the health check loop.

    This method is idempotent - calling when already running has no effect.
    """
    if self._running:
        logger.warning("ServiceHealthMonitor already running")
        return

    logger.info("Starting ServiceHealthMonitor")
    self._running = True
    self._failure_counts.clear()

    # Start health check loop in background
    self._task = asyncio.create_task(self._health_check_loop())

Stopping the Monitor¶

# backend/services/health_monitor.py:109-128
async def stop(self) -> None:
    """Stop the health check loop gracefully."""
    if not self._running:
        return

    logger.info("Stopping ServiceHealthMonitor")
    self._running = False

    # Cancel health check task
    if self._task:
        self._task.cancel()
        with contextlib.suppress(asyncio.CancelledError):
            await self._task
        self._task = None

Status and Monitoring¶

get_status()¶

Get current failure counts (backend/services/health_monitor.py:370-382):

# backend/services/health_monitor.py:370-382
def get_status(self) -> dict[str, dict[str, int | str]]:
    """Get current status of all monitored services."""
    return {
        service.name: {
            "failure_count": self._failure_counts.get(service.name, 0),
            "max_retries": service.max_retries,
        }
        for service in self._services
    }

get_recent_events()¶

Get event history (backend/services/health_monitor.py:384-396):

# backend/services/health_monitor.py:384-396
def get_recent_events(self, limit: int = 50) -> list[HealthEvent]:
    """Get recent health events.

    Args:
        limit: Maximum number of events to return (default: 50)

    Returns:
        List of recent HealthEvent objects, most recent first
    """
    events = list(self._health_events)
    events.reverse()
    return events[:limit]

is_running Property¶

# backend/services/health_monitor.py:398-401
@property
def is_running(self) -> bool:
    """Check if the health monitor is running."""
    return self._running

Service Manager Integration¶

The monitor uses a ServiceManager interface for health checks and restarts:

# backend/services/service_managers.py
class ServiceManager(ABC):
    """Abstract interface for service management."""

    @abstractmethod
    async def check_health(self, service: ServiceConfig) -> bool:
        """Check if a service is healthy."""
        pass

    @abstractmethod
    async def restart(self, service: ServiceConfig) -> bool:
        """Restart a service."""
        pass

Available Implementations¶

Manager	Description	Use Case
`ShellServiceManager`	Shell commands	Development, systemd services
`DockerServiceManager`	Docker CLI	Production Docker/Podman containers

Usage Example¶

from backend.services.health_monitor import ServiceHealthMonitor
from backend.services.service_managers import (
    ServiceConfig,
    DockerServiceManager,
)
from backend.services.event_broadcaster import get_event_broadcaster

# Define services to monitor
services = [
    ServiceConfig(
        name="yolo26",
        health_url="http://localhost:8095/health",
        restart_cmd="docker restart yolo26",
        max_retries=5,
        backoff_base=5.0,
    ),
    ServiceConfig(
        name="nemotron",
        health_url="http://localhost:8091/health",
        restart_cmd="docker restart nemotron",
        max_retries=5,
        backoff_base=5.0,
    ),
    ServiceConfig(
        name="redis",
        health_url="redis://localhost:6379",
        restart_cmd=None,  # Restart disabled for Redis
        max_retries=3,
        backoff_base=5.0,
    ),
]

# Create monitor
manager = DockerServiceManager()
broadcaster = await get_event_broadcaster()

monitor = ServiceHealthMonitor(
    manager=manager,
    services=services,
    broadcaster=broadcaster,
    check_interval=15.0,
)

# Start monitoring
await monitor.start()

# Check status
status = monitor.get_status()
for service_name, service_status in status.items():
    print(f"{service_name}: failures={service_status['failure_count']}")

# Get recent events
events = monitor.get_recent_events(limit=10)
for event in events:
    print(f"{event.timestamp}: {event.service} - {event.event_type}")

# Stop gracefully
await monitor.stop()

Recovery Sequence Diagram¶

%%{init: {'theme': 'dark'}}%%
sequenceDiagram
    participant HM as Health Monitor
    participant SVC as Service
    participant DS as Docker/Systemd
    participant WS as WebSocket Clients

    HM->>SVC: check_health()
    SVC--xHM: unhealthy

    Note over HM: Wait backoff (5s)

    HM->>SVC: restart()
    SVC->>DS: restart command
    DS-->>SVC: restarted
    SVC-->>HM: success

    Note over HM: Wait 2s for startup

    HM->>SVC: check_health()
    SVC-->>HM: healthy

    HM->>WS: broadcast "healthy"

Best Practices¶

Set appropriate check intervals: Balance responsiveness with system load
Configure restart commands carefully: Test commands work in production environment
Monitor failure counts: Track failure_count metrics for alerting
Use WebSocket broadcasts: Keep UI informed of service status
Test recovery paths: Simulate failures to verify auto-recovery works

Circuit Breaker - Failure protection during restarts
Graceful Degradation - Service health tracking integration
Retry Handler - Retry logic coordination

Source: NEM-3458 - Health Monitoring Documentation