Dead Letter Queue Management¶

Monitor and recover failed AI pipeline jobs through the DLQ dashboard.

Time to read: ~8 min Prerequisites: System running, access to dashboard

Overview¶

The Dead Letter Queue (DLQ) is a holding area for jobs that have failed processing in the AI pipeline after exhausting all retry attempts. This guide covers how to monitor, investigate, and recover failed jobs using the DLQ dashboard.

Why Jobs End Up in the DLQ¶

Jobs move to the DLQ when they fail repeatedly. Common causes include:

Failure Type	Description	Queue Affected
AI Service Unavailable	YOLO26 or Nemotron container is down	Both queues
Service Timeout	AI processing took too long	Both queues
GPU Memory Exhausted	VRAM full, model cannot process	Both queues
File Not Found	Image deleted before processing	Detection queue
Invalid Image Format	Corrupted or unsupported image	Detection queue
Context Length Exceeded	Too many detections in analysis batch	Analysis queue

DLQ Architecture¶

DLQ Architecture

Dead Letter Queue flow showing message lifecycle, retry mechanism, and manual review path.

flowchart TD
    subgraph Detection["Detection Pipeline"]
        FW[File Watcher] --> DQ[detection_queue]
        DQ --> YOLO[YOLO26 Detector]
        DQ -->|"3 failed retries"| DLQ_D[dlq:detection_queue]
    end

    subgraph Analysis["Analysis Pipeline"]
        BA[Batch Aggregator] --> AQ[analysis_queue]
        AQ --> NEM[Nemotron LLM]
        AQ -->|"3 failed retries"| DLQ_A[dlq:analysis_queue]
    end

    subgraph Recovery["Manual Recovery"]
        DLQ_D --> REVIEW[Dashboard Review]
        DLQ_A --> REVIEW
        REVIEW -->|Requeue| DQ
        REVIEW -->|Requeue| AQ
        REVIEW -->|Clear| TRASH[Delete]
    end

    style DLQ_D fill:#EF4444,color:#fff
    style DLQ_A fill:#EF4444,color:#fff
    style REVIEW fill:#3B82F6,color:#fff

Retry Behavior¶

Before a job reaches the DLQ, the system attempts processing with exponential backoff:

sequenceDiagram
    participant Q as Queue
    participant W as Worker
    participant S as AI Service
    participant D as DLQ

    Q->>W: Dequeue job
    W->>S: Process (attempt 1)
    S-->>W: Error
    Note over W: Wait 1s + jitter
    W->>S: Process (attempt 2)
    S-->>W: Error
    Note over W: Wait 2s + jitter
    W->>S: Process (attempt 3)
    S-->>W: Error
    W->>D: Move to DLQ
    Note over D: Job awaits manual review

Setting	Default	Description
Max Retries	3	Number of attempts before DLQ
Base Delay	1s	Initial delay between retries
Max Delay	30s	Maximum delay between retries
Exponential Base	2.0	Multiplier for delay (1s, 2s, 4s...)
Jitter	0-25%	Random variance to prevent thundering herd

DLQ Dashboard¶

The DLQ Monitor is available in the Settings page of the dashboard. It provides real-time visibility into failed jobs.

Accessing the Dashboard¶

Navigate to Settings in the main navigation
Scroll to the Dead Letter Queue section
The monitor auto-refreshes every 30 seconds

Dashboard Features¶

Total Failed Badge

Shows aggregate count of failed jobs across all queues
Red badge indicates attention needed

Queue Panels

Expandable panels for each queue (Detection and Analysis)
Shows count per queue
Click to expand and view individual jobs

Job Details Each failed job displays:

Error Message: The last error that caused failure
Attempt Count: Number of processing attempts made
First Failed At: When the job first failed
Last Failed At: When the job most recently failed
Original Payload: Expandable view of the original job data

Viewing Failed Items¶

Via Dashboard¶

Expand a queue panel to see individual jobs
Review error messages to understand failure causes
Click "View payload" to inspect the original job data
Look for patterns: same camera, similar timestamps, common errors

Via API¶

# Get DLQ statistics
curl http://localhost:8000/api/dlq/stats

# List jobs in detection DLQ (first 100)
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?limit=100"

# List jobs in analysis DLQ with pagination
curl "http://localhost:8000/api/dlq/jobs/dlq:analysis_queue?start=100&limit=50"

Error Context (Enhanced Debugging)¶

Each DLQ job includes enriched error context for faster debugging:

Field	Description
`error_type`	Exception class name (e.g., `ConnectionError`)
`stack_trace`	Truncated stack trace (max 4KB)
`http_status`	HTTP status code if network error
`response_body`	Truncated AI service response (max 2KB)
`retry_delays`	Actual delays applied between retries
`context`	System state at failure (queue depths, etc.)

Retry Functionality¶

Requeue All Jobs¶

The "Requeue All" button moves all jobs from a DLQ back to their original processing queue for retry.

Via Dashboard:

Expand the queue panel
Click "Requeue All"
Confirm the action in the dialog
Jobs are moved to the processing queue

Via API:

# Requeue all detection failures
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue \
  -H "X-API-Key: your-api-key"

# Requeue all analysis failures
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue \
  -H "X-API-Key: your-api-key"

Limits:

Maximum 10,000 jobs requeued per call (configurable via MAX_REQUEUE_ITERATIONS)
If more jobs exist, the response indicates the limit was hit

Requeue Single Job¶

# Requeue oldest job from DLQ
curl -X POST http://localhost:8000/api/dlq/requeue/dlq:detection_queue \
  -H "X-API-Key: your-api-key"

Clearing/Purging the DLQ¶

When to Clear¶

Clear the DLQ when jobs are:

From deleted cameras (no longer relevant)
Old files that no longer exist
Known invalid data that shouldn't be reprocessed

Via Dashboard¶

Expand the queue panel
Click "Clear All"
Confirm the destructive action
All jobs are permanently deleted

Via API¶

# Clear detection DLQ
curl -X DELETE http://localhost:8000/api/dlq/dlq:detection_queue \
  -H "X-API-Key: your-api-key"

# Clear analysis DLQ
curl -X DELETE http://localhost:8000/api/dlq/dlq:analysis_queue \
  -H "X-API-Key: your-api-key"

Warning: Clearing the DLQ permanently deletes all jobs. This cannot be undone. Always review jobs before clearing.

Monitoring DLQ Health¶

Dashboard Indicators¶

Badge Color: Red indicates failed jobs present
Queue Count: Number next to each queue name
Auto-Refresh: Dashboard updates every 30 seconds

API Monitoring¶

# Quick health check
curl http://localhost:8000/api/dlq/stats | jq '.total_count'

# Watch DLQ growth
watch -n 10 'curl -s http://localhost:8000/api/dlq/stats | jq'

Prometheus Metrics (NEM-3891)¶

The DLQ depth is exposed as a Prometheus gauge metric for alerting and dashboards:

Metric	Labels	Description
`hsi_dlq_depth`	`queue_name`	Current number of jobs in DLQ
`hsi_queue_items_moved_to_dlq_total`	`queue_name`	Cumulative jobs moved to DLQ

Example PromQL Queries:

# Total jobs in all DLQs
sum(hsi_dlq_depth)

# Jobs in detection DLQ only
hsi_dlq_depth{queue_name="dlq:detection_queue"}

# Rate of jobs moving to DLQ per minute
rate(hsi_queue_items_moved_to_dlq_total[5m]) * 60

Prometheus Alerts¶

The following alerts are pre-configured in monitoring/prometheus_rules.yml:

Alert	Threshold	Severity	Description
AIDLQHasMessages	> 0 for 5m	Warning	DLQ has failed messages requiring attention
AIDLQGrowing	+5 in 15m	Warning	DLQ growing rapidly (systemic issue)
AIDLQCritical	> 50 for 2m	Critical	DLQ critically large (data loss risk)

Circuit Breaker Protection¶

The DLQ has circuit breaker protection to prevent cascading failures when Redis is overloaded:

State	Behavior
CLOSED	Normal operation, jobs written to DLQ
OPEN	DLQ writes skipped, jobs logged as DATA LOSS
HALF_OPEN	Testing recovery, limited writes

Configuration (environment variables):

Variable	Default	Range	Description
`DLQ_CIRCUIT_BREAKER_FAILURE_THRESHOLD`	5	1-50	Number of DLQ write failures before opening circuit breaker
`DLQ_CIRCUIT_BREAKER_RECOVERY_TIMEOUT`	60.0	10.0-600.0	Seconds to wait before attempting DLQ writes again after circuit opens
`DLQ_CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS`	3	1-10	Maximum test calls allowed when circuit is half-open
`DLQ_CIRCUIT_BREAKER_SUCCESS_THRESHOLD`	2	1-10	Successful DLQ writes needed to close circuit from half-open state

These settings are defined in backend/core/config.py (lines 1596-1619) and can be overridden via environment variables.

When the circuit is open, check logs for CRITICAL DATA LOSS entries.

Recovery Workflows¶

After AI Service Outage¶

Verify services are healthy

curl http://localhost:8000/api/system/health
curl http://localhost:8095/health  # YOLO26
curl http://localhost:8091/health  # Nemotron

Check DLQ size

curl http://localhost:8000/api/dlq/stats

Requeue all jobs for retry

curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue \
  -H "X-API-Key: your-api-key"
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue \
  -H "X-API-Key: your-api-key"

Monitor for new failures

watch -n 5 'curl -s http://localhost:8000/api/dlq/stats'

Investigating Recurring Failures¶

List failed jobs

curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?limit=20" | jq

Look for patterns
Same camera_id: Camera configuration issue
Same error: Systemic problem
Clustered timestamps: Transient outage
Check error context
error_type: Category of failure
http_status: Network/service issues
context: System state at failure time

Handling Stale Jobs¶

Jobs may become stale when:

Source files were deleted
Camera was removed from system
Retention policy cleaned up data

Identify stale jobs:

# List jobs and check file_path existence
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue" | \
  jq '.jobs[] | .original_job.file_path'

Clear if confirmed stale:

curl -X DELETE http://localhost:8000/api/dlq/dlq:detection_queue \
  -H "X-API-Key: your-api-key"

Authentication¶

Destructive operations (requeue, clear) require API key authentication when API_KEY_ENABLED=true:

HTTP Header (preferred):

X-API-Key: your-api-key

Query Parameter (fallback):

?api_key=your-api-key

Read-only operations (stats, list jobs) do not require authentication.

Configuration Reference¶

Variable	Default	Description
`MAX_REQUEUE_ITERATIONS`	10000	Max jobs requeued in single call
`API_KEY_ENABLED`	false	Require auth for destructive ops
`API_KEYS`	[]	Valid API keys (JSON array)

DLQ API Reference - Complete API documentation
Architecture: Resilience - Retry and circuit breaker patterns
AI Troubleshooting - Diagnosing AI service issues
Monitoring Guide - System health monitoring

Back to Operator Hub

Dead Letter Queue Management¶

Overview¶

Why Jobs End Up in the DLQ¶

DLQ Architecture¶

Retry Behavior¶

DLQ Dashboard¶

Accessing the Dashboard¶

Dashboard Features¶

Viewing Failed Items¶

Via Dashboard¶

Via API¶

Error Context (Enhanced Debugging)¶

Retry Functionality¶

Requeue All Jobs¶

Requeue Single Job¶

Clearing/Purging the DLQ¶

When to Clear¶

Via Dashboard¶

Via API¶

Monitoring DLQ Health¶

Dashboard Indicators¶

API Monitoring¶

Prometheus Metrics (NEM-3891)¶

Prometheus Alerts¶

Circuit Breaker Protection¶

Recovery Workflows¶

After AI Service Outage¶

Investigating Recurring Failures¶

Handling Stale Jobs¶

Authentication¶

Configuration Reference¶

Related Documentation¶