Skip to content

AI Services Performance Tuning

Optimize AI inference for throughput and latency.

Time to read: ~6 min Prerequisites: AI Services Management


Performance Baselines

Expected performance on NVIDIA RTX A5500 (24GB):

Service Metric Target Acceptable
YOLO26 Latency (single) 30-50ms < 100ms
YOLO26 Throughput 20-30 fps 10 fps
Nemotron Latency 2-5s < 10s
Nemotron Tokens/sec 30-50 15

GPU Monitoring

Real-time Monitoring

# Basic monitoring (1 second refresh)
nvidia-smi -l 1

# Detailed process view
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv

# Temperature and power
nvidia-smi --query-gpu=temperature.gpu,power.draw --format=csv -l 1

Dashboard Metrics

The system broadcasts GPU stats via WebSocket at /ws/system. Check the dashboard for:

  • GPU utilization %
  • VRAM usage
  • Temperature
  • Inference FPS

Historical Metrics

# Query GPU stats from API
curl "http://localhost:8000/api/system/gpu/history?since=2025-12-30T09:45:00Z&limit=300"

YOLO26 Tuning

Confidence Threshold

Higher thresholds reduce false positives but may miss detections.

# In .env
DETECTION_CONFIDENCE_THRESHOLD=0.5  # Default

# Conservative (fewer false positives)
DETECTION_CONFIDENCE_THRESHOLD=0.7

# Aggressive (catch more objects)
DETECTION_CONFIDENCE_THRESHOLD=0.3

Batch Processing

For high-throughput scenarios (multiple cameras):

# Send multiple images per request
curl -X POST http://localhost:8095/detect/batch \
  -F "images=@image1.jpg" \
  -F "images=@image2.jpg" \
  -F "images=@image3.jpg"

Hardware-Specific Tuning

GPU Series Recommendations
RTX 30xx Default settings work well
RTX 40xx Can increase batch size
A-series Enable FP16 for lower VRAM GPUs

Nemotron LLM Tuning

GPU Layers Configuration

Controls how many model layers run on GPU vs CPU. More GPU layers = faster inference but more VRAM.

Configuration locations and their defaults:

Location Default Model Rationale
docker-compose.prod.yml 35 Nano 30B Conservative for 16GB GPUs
ai/nemotron/Dockerfile 35 Nano 30B Matches compose default
ai/start_llm.sh 99 Mini 4B All layers on GPU (small model)

Recommended settings by GPU VRAM:

VRAM GPU_LAYERS Notes
8GB 20-25 Partial offload, slower inference
12GB 30-35 Balanced performance
16GB 40-45 Good performance
24GB+ 50-53 Maximum layers (Nano 30B has ~53 layers)

Setting GPU_LAYERS:

# In .env or shell
export GPU_LAYERS=45

# Or override in docker-compose command
GPU_LAYERS=45 docker compose -f docker-compose.prod.yml up ai-llm -d

# For host-run development (Mini 4B), use all layers
# ai/start_llm.sh already uses --n-gpu-layers 99

Context Size Configuration

Controls the maximum context window. Larger contexts allow more batch data but use significantly more VRAM.

Configuration locations and their defaults:

Location Default Rationale
docker-compose.prod.yml 131072 Production batch processing with multiple detections
ai/nemotron/Dockerfile 131072 Match production compose default
ai/start_llm.sh 4096 Development use, smaller context sufficient

VRAM impact of context size (approximate for Nano 30B):

CTX_SIZE Additional VRAM Use Case
2048 ~500MB Minimal, single detection analysis
4096 ~1GB Development, testing
8192 ~2GB Small batches
32768 ~4GB Medium batches
131072 ~8-12GB Production batch processing

Setting CTX_SIZE:

# In .env or shell
export CTX_SIZE=8192

# Or override in docker-compose command
CTX_SIZE=8192 docker compose -f docker-compose.prod.yml up ai-llm -d

# For memory-constrained systems
CTX_SIZE=4096 GPU_LAYERS=25 docker compose -f docker-compose.prod.yml up ai-llm -d

Trade-offs:

  • Large context (131072): Can analyze many detections in a single batch, better contextual reasoning, higher VRAM
  • Small context (4096): Faster startup, lower VRAM, may need to split large batches

Parallelism

Handle multiple concurrent requests:

# Default: 2 parallel requests
--parallel 2

# High-throughput (more VRAM)
--parallel 4

# Single request (lowest VRAM)
--parallel 1

Continuous Batching

Improves throughput for concurrent requests:

--cont-batching

Already enabled by default in ai/start_llm.sh.


System-Wide Tuning

Backend Worker Count

In docker-compose.prod.yml, adjust uvicorn workers:

command: ['uvicorn', 'backend.main:app', '--workers', '4']
  • 2 workers: Low memory usage
  • 4 workers: Balanced (default)
  • 8 workers: High concurrency (CPU-bound)

Detection Queue Size

In backend/core/config.py:

DETECTION_QUEUE_MAX_SIZE = 10000  # Default

Increase for high camera counts, decrease for memory efficiency.

Batch Timing

Trade-off between latency and context quality:

# .env

# Fast response (less context)
BATCH_WINDOW_SECONDS=30
BATCH_IDLE_TIMEOUT_SECONDS=10

# Better context (slower response)
BATCH_WINDOW_SECONDS=120
BATCH_IDLE_TIMEOUT_SECONDS=45

# Default
BATCH_WINDOW_SECONDS=90
BATCH_IDLE_TIMEOUT_SECONDS=30

Monitoring Performance

Inference Latency

Enable detailed timing in logs:

# backend/services/detector_client.py
logging.DEBUG  # Set log level

Prometheus Metrics

If monitoring stack enabled:

curl http://localhost:8000/api/metrics | grep ai_

Available metrics:

  • ai_detection_latency_seconds
  • ai_analysis_latency_seconds
  • ai_detection_count
  • ai_error_count

Performance Checklist

Before production deployment:

  • [ ] GPU utilization under load < 90%
  • [ ] VRAM usage under load < 80% of total
  • [ ] Detection latency < 100ms
  • [ ] LLM latency < 10s
  • [ ] GPU temperature < 80C
  • [ ] No OOM errors in logs
  • [ ] Backend response time < 500ms

Next Steps


See Also


Back to Operator Hub