AI Services Troubleshooting¶

Diagnose and fix common AI service issues.

Time to read: ~10 min Prerequisites: AI Services Management

Quick Diagnostics¶

Run these commands first to identify the issue:

# Check if services are running
./scripts/start-ai.sh status

# Test service health endpoints
curl http://localhost:8095/health   # YOLO26
curl http://localhost:8091/health   # Nemotron
curl http://localhost:8092/health   # Florence-2 (optional)
curl http://localhost:8093/health   # CLIP (optional)
curl http://localhost:8094/health   # Enrichment (optional)

# Check GPU availability
nvidia-smi

# View recent logs
tail -100 /tmp/yolo26-detector.log
tail -100 /tmp/nemotron-llm.log

YOLO26 Issues¶

Service Won't Start¶

Check logs:

tail -f /tmp/yolo26-detector.log

CUDA out of memory:

RuntimeError: CUDA out of memory

Solution:

Close other GPU applications
Check VRAM usage: nvidia-smi
Restart services: ./scripts/start-ai.sh restart

Python dependency not found (YOLO26):

ModuleNotFoundError: No module named 'transformers'

Solution:

cd "$PROJECT_ROOT"
uv sync --extra dev

Model file not found:

ImportError / ModuleNotFoundError in `ai/yolo26/model.py`

Solution: Model auto-downloads on first use. Wait for download to complete (check logs).

Nemotron LLM Issues¶

Service Won't Start¶

Check logs:

tail -f /tmp/nemotron-llm.log

llama-server not found:

command not found: llama-server

Solution: Install llama.cpp. See AI Installation.

Model file not found:

error: failed to load model

Solution:

./ai/download_models.sh

CUDA initialization failed:

ggml_init_cublas: failed to initialize CUDA

Solution:

nvidia-smi
sudo systemctl restart nvidia-persistenced

Port already in use:

error: bind: Address already in use

Solution:

lsof -ti:8091 | xargs kill -9
./scripts/start-ai.sh restart

Service Unhealthy¶

Symptoms¶

Service running but health check fails.

Diagnosis¶

# Check if service is responding
curl -v http://localhost:8095/health
curl -v http://localhost:8091/health

# Check process status
./scripts/start-ai.sh status

# Monitor GPU
nvidia-smi -l 1

Solutions¶

Restart services: ./scripts/start-ai.sh restart
Check logs for errors
Verify CUDA: python3 -c "import torch; print(torch.cuda.is_available())"

Slow Inference¶

Symptoms¶

Detection: 200-500ms instead of 30-50ms
LLM: 30-60 seconds instead of 2-5 seconds

Check GPU Utilization¶

nvidia-smi -l 1

Common Causes¶

CPU fallback (GPU not being used):

# Verify CUDA is being used
python3 -c "import torch; print(torch.cuda.is_available())"

Solution: Verify CUDA installation, restart services.

Thermal throttling:

# Check GPU temperature
nvidia-smi --query-gpu=temperature.gpu --format=csv

If > 85C, improve cooling or reduce load.

Concurrent load:

Other processes using GPU. Close unnecessary GPU applications.

Out of Memory (OOM)¶

Symptoms¶

Services crash with OOM errors.

Check VRAM Usage¶

nvidia-smi

Solutions¶

1. Free VRAM:

# Stop other GPU processes
fuser -k /dev/nvidia*

# Restart AI services
./scripts/start-ai.sh restart

2. Use smaller model:

Download Q4_K_S quantization instead of Q4_K_M (saves ~500MB). Edit ai/start_llm.sh to use smaller model.

3. Reduce batch sizes:

Edit backend configuration in backend/core/config.py. Reduce batch_window_seconds or concurrent requests.

Connection Issues¶

Backend Can't Reach AI Services¶

Symptoms:

httpx.ConnectError: [Errno 111] Connection refused

Check:

Are AI services running? ./scripts/start-ai.sh status
Is the URL correct in .env?
Is there a firewall blocking the ports?

Docker/Podman networking:

The correct URL depends on your deployment mode (production compose DNS vs host-run AI vs “backend container + host AI”).

Start here:

Deployment Modes & AI Networking (decision table + copy/paste .env snippets)

From Container Can't Reach Host¶

# Docker - verify host.docker.internal resolves
docker exec <container> getent hosts host.docker.internal

# Podman - verify host.containers.internal resolves
podman exec <container> getent hosts host.containers.internal

If not resolving, use host IP directly.

Log Analysis¶

YOLO26 Common Log Messages¶

Message	Meaning	Action
`Model loaded successfully`	Service ready	None
`CUDA out of memory`	Not enough VRAM	Free VRAM
`Connection refused`	Port conflict	Check port usage
`Failed to load image`	Invalid image	Check image format

Nemotron Common Log Messages¶

Message	Meaning	Action
`Model loaded`	Service ready	None
`ggml_cuda_init: failed`	CUDA not available	Check NVIDIA drivers
`bind: Address already in use`	Port conflict	Kill existing process
`context length exceeded`	Prompt too long	Reduce context size

Getting Help¶

When reporting issues, collect:

# System info
nvidia-smi
python3 --version
llama-server --version

# Service status
./scripts/start-ai.sh status

# Health checks
curl http://localhost:8095/health 2>&1
curl http://localhost:8091/health 2>&1

# Recent logs
tail -100 /tmp/yolo26-detector.log
tail -100 /tmp/nemotron-llm.log

Next Steps¶

AI Performance - Optimize performance
AI TLS - Secure communications

AI Services Troubleshooting¶

Quick Diagnostics¶

YOLO26 Issues¶

Service Won't Start¶

Nemotron LLM Issues¶

Service Won't Start¶

Service Unhealthy¶

Symptoms¶

Diagnosis¶

Solutions¶

Slow Inference¶

Symptoms¶

Check GPU Utilization¶

Common Causes¶

Out of Memory (OOM)¶

Symptoms¶

Check VRAM Usage¶

Solutions¶

Connection Issues¶

Backend Can't Reach AI Services¶

From Container Can't Reach Host¶

Log Analysis¶

YOLO26 Common Log Messages¶

Nemotron Common Log Messages¶

Getting Help¶

Next Steps¶

See Also¶