Dead Letter Queue Management¶
Monitor and recover failed AI pipeline jobs through the DLQ dashboard.
Time to read: ~8 min Prerequisites: System running, access to dashboard
Overview¶
The Dead Letter Queue (DLQ) is a holding area for jobs that have failed processing in the AI pipeline after exhausting all retry attempts. This guide covers how to monitor, investigate, and recover failed jobs using the DLQ dashboard.
Why Jobs End Up in the DLQ¶
Jobs move to the DLQ when they fail repeatedly. Common causes include:
| Failure Type | Description | Queue Affected |
|---|---|---|
| AI Service Unavailable | YOLO26 or Nemotron container is down | Both queues |
| Service Timeout | AI processing took too long | Both queues |
| GPU Memory Exhausted | VRAM full, model cannot process | Both queues |
| File Not Found | Image deleted before processing | Detection queue |
| Invalid Image Format | Corrupted or unsupported image | Detection queue |
| Context Length Exceeded | Too many detections in analysis batch | Analysis queue |
DLQ Architecture¶

Dead Letter Queue flow showing message lifecycle, retry mechanism, and manual review path.
flowchart TD
subgraph Detection["Detection Pipeline"]
FW[File Watcher] --> DQ[detection_queue]
DQ --> YOLO[YOLO26 Detector]
DQ -->|"3 failed retries"| DLQ_D[dlq:detection_queue]
end
subgraph Analysis["Analysis Pipeline"]
BA[Batch Aggregator] --> AQ[analysis_queue]
AQ --> NEM[Nemotron LLM]
AQ -->|"3 failed retries"| DLQ_A[dlq:analysis_queue]
end
subgraph Recovery["Manual Recovery"]
DLQ_D --> REVIEW[Dashboard Review]
DLQ_A --> REVIEW
REVIEW -->|Requeue| DQ
REVIEW -->|Requeue| AQ
REVIEW -->|Clear| TRASH[Delete]
end
style DLQ_D fill:#EF4444,color:#fff
style DLQ_A fill:#EF4444,color:#fff
style REVIEW fill:#3B82F6,color:#fff Retry Behavior¶
Before a job reaches the DLQ, the system attempts processing with exponential backoff:
sequenceDiagram
participant Q as Queue
participant W as Worker
participant S as AI Service
participant D as DLQ
Q->>W: Dequeue job
W->>S: Process (attempt 1)
S-->>W: Error
Note over W: Wait 1s + jitter
W->>S: Process (attempt 2)
S-->>W: Error
Note over W: Wait 2s + jitter
W->>S: Process (attempt 3)
S-->>W: Error
W->>D: Move to DLQ
Note over D: Job awaits manual review | Setting | Default | Description |
|---|---|---|
| Max Retries | 3 | Number of attempts before DLQ |
| Base Delay | 1s | Initial delay between retries |
| Max Delay | 30s | Maximum delay between retries |
| Exponential Base | 2.0 | Multiplier for delay (1s, 2s, 4s...) |
| Jitter | 0-25% | Random variance to prevent thundering herd |
DLQ Dashboard¶
The DLQ Monitor is available in the Settings page of the dashboard. It provides real-time visibility into failed jobs.
Accessing the Dashboard¶
- Navigate to Settings in the main navigation
- Scroll to the Dead Letter Queue section
- The monitor auto-refreshes every 30 seconds
Dashboard Features¶
Total Failed Badge
- Shows aggregate count of failed jobs across all queues
- Red badge indicates attention needed
Queue Panels
- Expandable panels for each queue (Detection and Analysis)
- Shows count per queue
- Click to expand and view individual jobs
Job Details Each failed job displays:
- Error Message: The last error that caused failure
- Attempt Count: Number of processing attempts made
- First Failed At: When the job first failed
- Last Failed At: When the job most recently failed
- Original Payload: Expandable view of the original job data
Viewing Failed Items¶
Via Dashboard¶
- Expand a queue panel to see individual jobs
- Review error messages to understand failure causes
- Click "View payload" to inspect the original job data
- Look for patterns: same camera, similar timestamps, common errors
Via API¶
# Get DLQ statistics
curl http://localhost:8000/api/dlq/stats
# List jobs in detection DLQ (first 100)
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?limit=100"
# List jobs in analysis DLQ with pagination
curl "http://localhost:8000/api/dlq/jobs/dlq:analysis_queue?start=100&limit=50"
Error Context (Enhanced Debugging)¶
Each DLQ job includes enriched error context for faster debugging:
| Field | Description |
|---|---|
error_type | Exception class name (e.g., ConnectionError) |
stack_trace | Truncated stack trace (max 4KB) |
http_status | HTTP status code if network error |
response_body | Truncated AI service response (max 2KB) |
retry_delays | Actual delays applied between retries |
context | System state at failure (queue depths, etc.) |
Retry Functionality¶
Requeue All Jobs¶
The "Requeue All" button moves all jobs from a DLQ back to their original processing queue for retry.
Via Dashboard:
- Expand the queue panel
- Click "Requeue All"
- Confirm the action in the dialog
- Jobs are moved to the processing queue
Via API:
# Requeue all detection failures
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue \
-H "X-API-Key: your-api-key"
# Requeue all analysis failures
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue \
-H "X-API-Key: your-api-key"
Limits:
- Maximum 10,000 jobs requeued per call (configurable via
MAX_REQUEUE_ITERATIONS) - If more jobs exist, the response indicates the limit was hit
Requeue Single Job¶
# Requeue oldest job from DLQ
curl -X POST http://localhost:8000/api/dlq/requeue/dlq:detection_queue \
-H "X-API-Key: your-api-key"
Clearing/Purging the DLQ¶
When to Clear¶
Clear the DLQ when jobs are:
- From deleted cameras (no longer relevant)
- Old files that no longer exist
- Known invalid data that shouldn't be reprocessed
Via Dashboard¶
- Expand the queue panel
- Click "Clear All"
- Confirm the destructive action
- All jobs are permanently deleted
Via API¶
# Clear detection DLQ
curl -X DELETE http://localhost:8000/api/dlq/dlq:detection_queue \
-H "X-API-Key: your-api-key"
# Clear analysis DLQ
curl -X DELETE http://localhost:8000/api/dlq/dlq:analysis_queue \
-H "X-API-Key: your-api-key"
Warning: Clearing the DLQ permanently deletes all jobs. This cannot be undone. Always review jobs before clearing.
Monitoring DLQ Health¶
Dashboard Indicators¶
- Badge Color: Red indicates failed jobs present
- Queue Count: Number next to each queue name
- Auto-Refresh: Dashboard updates every 30 seconds
API Monitoring¶
# Quick health check
curl http://localhost:8000/api/dlq/stats | jq '.total_count'
# Watch DLQ growth
watch -n 10 'curl -s http://localhost:8000/api/dlq/stats | jq'
Prometheus Metrics (NEM-3891)¶
The DLQ depth is exposed as a Prometheus gauge metric for alerting and dashboards:
| Metric | Labels | Description |
|---|---|---|
hsi_dlq_depth | queue_name | Current number of jobs in DLQ |
hsi_queue_items_moved_to_dlq_total | queue_name | Cumulative jobs moved to DLQ |
Example PromQL Queries:
# Total jobs in all DLQs
sum(hsi_dlq_depth)
# Jobs in detection DLQ only
hsi_dlq_depth{queue_name="dlq:detection_queue"}
# Rate of jobs moving to DLQ per minute
rate(hsi_queue_items_moved_to_dlq_total[5m]) * 60
Prometheus Alerts¶
The following alerts are pre-configured in monitoring/prometheus_rules.yml:
| Alert | Threshold | Severity | Description |
|---|---|---|---|
| AIDLQHasMessages | > 0 for 5m | Warning | DLQ has failed messages requiring attention |
| AIDLQGrowing | +5 in 15m | Warning | DLQ growing rapidly (systemic issue) |
| AIDLQCritical | > 50 for 2m | Critical | DLQ critically large (data loss risk) |
Circuit Breaker Protection¶
The DLQ has circuit breaker protection to prevent cascading failures when Redis is overloaded:
| State | Behavior |
|---|---|
| CLOSED | Normal operation, jobs written to DLQ |
| OPEN | DLQ writes skipped, jobs logged as DATA LOSS |
| HALF_OPEN | Testing recovery, limited writes |
Configuration (environment variables):
| Variable | Default | Range | Description |
|---|---|---|---|
DLQ_CIRCUIT_BREAKER_FAILURE_THRESHOLD | 5 | 1-50 | Number of DLQ write failures before opening circuit breaker |
DLQ_CIRCUIT_BREAKER_RECOVERY_TIMEOUT | 60.0 | 10.0-600.0 | Seconds to wait before attempting DLQ writes again after circuit opens |
DLQ_CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS | 3 | 1-10 | Maximum test calls allowed when circuit is half-open |
DLQ_CIRCUIT_BREAKER_SUCCESS_THRESHOLD | 2 | 1-10 | Successful DLQ writes needed to close circuit from half-open state |
These settings are defined in backend/core/config.py (lines 1596-1619) and can be overridden via environment variables.
When the circuit is open, check logs for CRITICAL DATA LOSS entries.
Recovery Workflows¶
After AI Service Outage¶
- Verify services are healthy
curl http://localhost:8000/api/system/health
curl http://localhost:8095/health # YOLO26
curl http://localhost:8091/health # Nemotron
- Check DLQ size
- Requeue all jobs for retry
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue \
-H "X-API-Key: your-api-key"
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue \
-H "X-API-Key: your-api-key"
- Monitor for new failures
Investigating Recurring Failures¶
- List failed jobs
-
Look for patterns
-
Same camera_id: Camera configuration issue
- Same error: Systemic problem
-
Clustered timestamps: Transient outage
-
Check error context
error_type: Category of failurehttp_status: Network/service issuescontext: System state at failure time
Handling Stale Jobs¶
Jobs may become stale when:
- Source files were deleted
- Camera was removed from system
- Retention policy cleaned up data
Identify stale jobs:
# List jobs and check file_path existence
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue" | \
jq '.jobs[] | .original_job.file_path'
Clear if confirmed stale:
Authentication¶
Destructive operations (requeue, clear) require API key authentication when API_KEY_ENABLED=true:
HTTP Header (preferred):
Query Parameter (fallback):
Read-only operations (stats, list jobs) do not require authentication.
Configuration Reference¶
| Variable | Default | Description |
|---|---|---|
MAX_REQUEUE_ITERATIONS | 10000 | Max jobs requeued in single call |
API_KEY_ENABLED | false | Require auth for destructive ops |
API_KEYS | [] | Valid API keys (JSON array) |
Related Documentation¶
- DLQ API Reference - Complete API documentation
- Architecture: Resilience - Retry and circuit breaker patterns
- AI Troubleshooting - Diagnosing AI service issues
- Monitoring Guide - System health monitoring