Troubleshooting Index¶
Your first stop when something goes wrong. Find your symptom, get to the solution fast.
Time to read: ~5 min Prerequisites: None
Quick Self-Check Before Troubleshooting¶
Before diving into specific issues, run these quick checks:
# 1. System health (is everything running?)
curl -s http://localhost:8000/api/system/health | jq .
# 2. Service status (are all containers up?)
docker compose -f docker-compose.prod.yml ps
# 3. GPU status (is the GPU available?)
nvidia-smi
# 4. Recent logs (any obvious errors?)
docker compose -f docker-compose.prod.yml logs --tail=50 backend
If all services show "healthy" and containers are running, proceed to the specific symptom below.
Fast Triage Flow (Health → Fix)¶
Use this when you're not sure where to start.

Decision tree for diagnosing system health issues: Start with the health check endpoint, then follow the path based on which services are failing to quickly identify and resolve problems.
Symptom Quick Reference Table¶
| Symptom | Likely Cause | Quick Fix | Detailed Guide |
|---|---|---|---|
| Dashboard shows no events | File watcher not running or AI services down | Restart backend | Events Not Appearing |
| Risk gauge stuck at 0 | Nemotron service unavailable | Start Nemotron LLM | AI Issues |
| Camera shows offline | Camera not uploading or folder path wrong | Check FTP and folder config | Camera Offline |
| AI not responding | Services not started or port conflicts | Start AI services | AI Not Working |
| WebSocket disconnected | Backend down or network issues | Check backend health | WebSocket Issues |
| High CPU/memory usage | Too many images or memory leak | Check queue sizes | Performance Issues |
| Disk space running out | Retention not configured | Run cleanup | Disk Space Issues |
| Slow detection response | GPU not being used | Check CUDA availability | Slow Performance |
| "Connection refused" errors | Service not running | Start the service | Connection Issues |
| CORS errors in browser | Frontend/backend URL mismatch | Update CORS_ORIGINS | CORS Errors |
Dashboard Shows No Events¶
What You See¶
- Empty activity feed
- No recent events in timeline
- Risk gauge may be at 0 or stale
Quick Diagnosis¶
# Check if events exist in database
curl -s http://localhost:8000/api/events?limit=5 | jq .count
# Check pipeline status
curl -s http://localhost:8000/api/system/pipeline | jq .
Possible Causes (Most Likely First)¶
- File watcher not running - Images not being picked up
- AI services not running - Detections not being created
- No images uploaded - Cameras not sending images
- Batch not completing - Detections queued but not analyzed
Solutions¶
1. Check file watcher status:
curl -s http://localhost:8000/api/system/health/ready | jq '.workers[] | select(.name | contains("detection"))'
If not running, restart backend: docker compose -f docker-compose.prod.yml restart backend
2. Verify images are being uploaded:
3. Check AI service health:
curl http://localhost:8095/health # YOLO26
curl http://localhost:8091/health # Nemotron
curl http://localhost:8092/health # Florence-2 (optional)
curl http://localhost:8093/health # CLIP (optional)
curl http://localhost:8094/health # Enrichment (optional)
4. Check queue depths:
If queues are growing but not processing, AI services may be down.
See: Connection Issues, AI Issues
Risk Gauge Stuck at 0¶
What You See¶
- Dashboard risk gauge shows 0 or minimal value
- Events exist but have null risk scores
- "Analyzing..." spinner never completes
Quick Diagnosis¶
# Check recent events for risk scores
curl -s http://localhost:8000/api/events?limit=3 | jq '.events[].risk_score'
# Check Nemotron health
curl -s http://localhost:8091/health
Possible Causes¶
- Nemotron service not running - Most common cause
- Nemotron timeout - Model too slow or overloaded
- LLM response parsing failure - Invalid JSON from model
Solutions¶
1. Start Nemotron if not running:
2. Check Nemotron logs:
tail -f /tmp/nemotron-llm.log
# Or in container:
docker compose -f docker-compose.prod.yml logs -f ai-llm
3. Increase timeout if needed:
4. Test Nemotron directly:
curl -X POST http://localhost:8091/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Say hello", "max_tokens": 20}'
See: AI Issues - Analysis Failing
Camera Shows Offline¶
What You See¶
- Camera status indicator shows offline/error
- No new detections from specific camera
- last_seen_at timestamp is stale
Quick Diagnosis¶
# Check camera status in database
curl -s http://localhost:8000/api/cameras | jq '.cameras[] | {name, status, last_seen_at}'
# Check if images exist in camera folder
ls -lt /export/foscam/<camera_name>/ | head -5
Possible Causes¶
- Camera not FTP uploading - Network or camera config issue
- Wrong folder path - Camera registered with incorrect path
- File permissions - Backend cannot read camera folder
- Camera hardware issue - Camera offline or rebooting
Solutions¶
1. Verify camera is uploading:
2. Check folder path in camera settings:
3. Fix permissions:
4. Check FTP server status:
See: Connection Issues - File Watcher
AI Not Working¶
What You See¶
- Health check shows AI services as unhealthy
- Error: "YOLO26 service connection refused"
- Error: "Nemotron service connection refused"
- No detections being created
Quick Diagnosis¶
# Overall AI status
curl -s http://localhost:8000/api/system/health | jq '.services.ai'
# Individual service checks
curl http://localhost:8095/health # Should return {"status": "ok", ...}
curl http://localhost:8091/health # Should return {"status": "ok"}
Possible Causes (Most Likely First)¶
- AI services not started - Need to start them manually
- Port conflicts - Something else using 8095/8091
- GPU not available - CUDA not initialized
- Model files missing - Models not downloaded
Solutions¶
1. Start AI services:
# Both services
./scripts/start-ai.sh start
# Or individually
./ai/start_detector.sh # YOLO26
./ai/start_llm.sh # Nemotron
2. Check for port conflicts:
3. Verify CUDA:
4. Download models if missing:
5. Check AI service logs:
See: AI Issues, GPU Issues
WebSocket Disconnected¶
What You See¶
- Dashboard shows "Disconnected" status
- Real-time updates stop working
- Events appear in timeline but feed not updating
- Browser console shows WebSocket errors
Quick Diagnosis¶
# Test WebSocket endpoint
websocat ws://localhost:8000/ws/events
# Check backend is responding
curl http://localhost:8000/api/system/health
Possible Causes¶
- Backend not running - Container down
- Proxy/firewall issues - WebSocket blocked
- Idle timeout - Connection closed due to inactivity
- Rate limiting - Too many reconnection attempts
Solutions¶
1. Restart backend:
2. Check backend logs:
3. Adjust idle timeout:
4. Check rate limits:
5. Clear browser cache and reload
See: Connection Issues - WebSocket
High CPU/Memory Usage¶
What You See¶
- System becomes slow or unresponsive
- Container restarts due to OOM
- Backend logs show high latency
Quick Diagnosis¶
# Container resource usage
docker stats
# Queue backlogs
curl -s http://localhost:8000/api/system/telemetry | jq '.queues'
# DLQ size
curl -s http://localhost:8000/api/dlq/stats
Possible Causes¶
- Queue backlog - Images piling up faster than processed
- Too many cameras - Overloading the system
- Memory leak - Long-running process issue
- Insufficient resources - Container limits too low
Solutions¶
1. Check and clear queues if backed up:
# View queue sizes
redis-cli llen detection_queue
redis-cli llen analysis_queue
# If severely backed up, you may need to clear
# CAUTION: This loses queued jobs
redis-cli del detection_queue
2. Increase container memory limits:
3. Reduce camera throughput:
- Increase camera upload interval
- Reduce number of active cameras
4. Restart services:
Disk Space Running Out¶
What You See¶
- Error: "No space left on device"
- Database operations fail
- Thumbnail generation fails
Quick Diagnosis¶
# Check disk usage
df -h
# Check storage stats
curl -s http://localhost:8000/api/system/storage | jq .
# Database size
psql -h localhost -U security -d security -c "SELECT pg_size_pretty(pg_database_size('security'));"
Possible Causes¶
- Retention too long - Default 30 days may be too much
- Cleanup not running - Scheduled cleanup failing
- Thumbnails accumulating - Largest storage consumer
- Camera images not cleaned - Original images retained
Solutions¶
1. Run immediate cleanup:
# Preview first
curl -s -X POST "http://localhost:8000/api/system/cleanup?dry_run=true" | jq .
# Execute cleanup
curl -X POST http://localhost:8000/api/system/cleanup
2. Reduce retention period:
3. Vacuum PostgreSQL:
4. Enable image deletion (if you have backups): Configure delete_images=True in cleanup service
See: Database Issues - Disk Space
Slow AI Inference¶
What You See¶
- Detection takes >100ms (expected: 30-50ms)
- LLM responses take >30s (expected: 2-5s)
- Queue backlogs growing
Quick Diagnosis¶
# Check GPU utilization
nvidia-smi
# Check device being used
curl -s http://localhost:8095/health | jq '.device'
# Check pipeline latency
curl -s http://localhost:8000/api/system/pipeline-latency | jq .
Possible Causes¶
- Running on CPU instead of GPU - Most common
- GPU thermal throttling - Temperature too high
- VRAM exhausted - Too many models loaded
- Other GPU processes - Competing for resources
Solutions¶
1. Verify GPU is being used:
# YOLO26 should show "cuda" or "cuda:0"
curl -s http://localhost:8095/health | jq '.device'
# Check GPU processes
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv
2. Check temperature:
3. Free GPU memory:
# Kill conflicting GPU processes if any
sudo fuser -k /dev/nvidia*
# Restart AI services
./scripts/start-ai.sh restart
See: GPU Issues
CORS Errors in Browser¶
What You See¶
- Browser console: "CORS policy blocked"
- API calls work in curl but fail in browser
- Dashboard shows errors loading data
Quick Diagnosis¶
# Test CORS headers
curl -v -X OPTIONS -H "Origin: http://localhost:5173" \
http://localhost:8000/api/events 2>&1 | grep -i "access-control"
Solutions¶
1. Update CORS_ORIGINS in .env:
2. Restart backend after changes:
Emergency Procedures¶
System Won't Start At All¶
Symptoms: All services fail to start, immediate crashes
Steps:
- Check Docker is running:
docker info - Check for conflicting containers:
docker ps -a - Remove stuck containers:
docker compose -f docker-compose.prod.yml down -v - Check environment file:
cat .env - Rebuild if needed:
docker compose -f docker-compose.prod.yml build --no-cache - Start with verbose logs:
docker compose -f docker-compose.prod.yml up
Database Corruption Suspected¶
Symptoms: SQL errors, missing data, inconsistent state
Steps:
- Stop all services:
docker compose -f docker-compose.prod.yml down - Create backup:
pg_dump -h localhost -U security security > backup_emergency.sql - Check for issues:
psql -h localhost -U security -d security -c "\dt" - If corrupted, restore from backup
- As last resort, recreate database:
Complete Data Loss Recovery¶
Symptoms: Database empty, no events, no history
Steps:
- Check if data exists but service cannot connect
- Check Docker volumes:
docker volume ls - Restore from backup if available:
- If no backup, the system will start fresh when cameras resume uploading
Security Breach Suspected¶
Symptoms: Unauthorized access, unknown API activity, modified settings
Steps:
- Immediately stop external access
- Check API key usage if enabled:
- Review logs for suspicious activity:
- Rotate API keys if used
- Review CORS origins and network exposure
- Consider enabling
API_KEY_ENABLED=trueif not already
Information to Gather for Bug Reports¶
If you need to report an issue, collect this information:
# 1. System health summary
curl -s http://localhost:8000/api/system/health | jq . > health.json
# 2. System configuration (REDACT PASSWORDS)
curl -s http://localhost:8000/api/system/config | jq . > config.json
# 3. Recent backend logs
docker compose -f docker-compose.prod.yml logs --tail=200 backend > backend.log
# 4. GPU information
nvidia-smi > gpu.txt
# 5. Container status
docker compose -f docker-compose.prod.yml ps -a > containers.txt
# 6. Queue status
curl -s http://localhost:8000/api/system/telemetry | jq . > telemetry.json
# 7. Environment (REDACT SENSITIVE VALUES)
cat .env | grep -v PASSWORD | grep -v SECRET | grep -v KEY > env_safe.txt
Troubleshooting Detailed Guides¶
- GPU Issues - CUDA, VRAM, temperature, container GPU access
- Triton Rootless CUDA - ai-gateway cudaGetDeviceCount err=3 in rootless Podman
- Connection Issues - Network, containers, WebSocket, CORS
- AI Issues - YOLO26, Nemotron, pipeline, batch processing
- Database Issues - PostgreSQL connection, migrations, disk space
Getting Help¶
If you can't resolve an issue:
- Check this index first - Most common problems are covered
- Review specific troubleshooting pages - Detailed solutions for each area
- Search GitHub Issues - Someone may have solved it
- Open a new issue with:
- Clear description of the problem
- Steps to reproduce
- Information gathered (see above)
- What you've already tried
See Also¶
- AI Troubleshooting (Operator) - Quick AI fixes
- GPU Setup - GPU configuration guide
- Environment Variable Reference - Configuration options
- Glossary - Terms and definitions