Multi-GPU Support¶
This guide covers configuring and managing multi-GPU support for AI services in Home Security Intelligence.
Overview¶
Multi-GPU support allows you to distribute AI workloads across multiple GPUs, providing:
- Improved Performance - Run models concurrently to reduce latency
- Better Capacity - Utilize all available VRAM across GPUs
- Model Isolation - Keep the LLM stable by separating it from smaller models
- Future-Proofing - Prepare for larger models as the system evolves
Hardware Requirements¶
Minimum Requirements¶
| Component | Requirement |
|---|---|
| GPU | NVIDIA with CUDA support |
| VRAM (Single) | 24 GB minimum for all AI services |
| Container Toolkit | NVIDIA Container Toolkit installed |
Reference Hardware¶
The system is optimized for configurations like:
| GPU | Model | VRAM | Best For |
|---|---|---|---|
| GPU 0 | RTX A5500 | 24 GB | Large models (LLM - Nemotron 30B) |
| GPU 1 | RTX A400 | 4 GB | Small/medium models (Enrichment) |
VRAM Requirements by Service¶
| Service | Model | VRAM Estimate | Default GPU |
|---|---|---|---|
| ai-llm | Nemotron-3-Nano-30B (Q4_K_M) | ~14-18 GB | 0 |
| ai-yolo26 | YOLO26 TensorRT FP16 | ~5 MB | 1 |
| ai-florence | Florence-2-Large | ~1.5 GB | 1 |
| ai-clip | CLIP ViT-L TensorRT | ~0.8 GB | 1 |
| ai-enrichment-light | Pose/Threat/Reid/Pet/Depth | ~1.2 GB | 1 |
| ai-enrichment | Vehicle/Fashion/Demographics | ~4.3 GB budget | 1 |
Note (NEM-5369): VRAM estimates for ai-llm depend on context size and flash attention settings:
- 131K context: ~21.7 GB
- 65K context: ~18 GB
- 32K context with flash attention: ~14-15 GB (default configuration)
Accessing GPU Settings¶
The GPU Configuration page is accessible via the Settings menu:
- Navigate to Settings in the main navigation
- Select the GPU Configuration tab
- The page displays detected GPUs and current assignments
Understanding GPU Cards¶
Each detected GPU is displayed with real-time utilization:
GPU 0: RTX A5500 24 GB [##########------] 19.3/24 GB used
GPU 1: RTX A400 4 GB [##--------------] 0.3/4 GB used
Information shown per GPU:
- Index - Zero-based GPU identifier
- Name - GPU model name
- Total VRAM - Total video memory capacity
- Used VRAM - Current memory utilization
- Compute Capability - CUDA compute capability version
Assignment Strategies¶
Select a strategy based on your priorities:
Manual¶
- Description: You control each service-to-GPU assignment explicitly
- Best For: Fine-tuned control, specific hardware configurations
- Algorithm: No automatic assignment; user sets each service manually
VRAM-Based (Recommended)¶
- Description: Largest models assigned to GPU with most available VRAM
- Best For: Maximizing VRAM utilization, general use
- Algorithm: Sort models by VRAM requirement descending, assign to GPU with most free space
Latency-Optimized¶
- Description: Critical path models on fastest GPU
- Best For: Minimizing detection-to-analysis latency
- Algorithm: Assigns ai-yolo26 and ai-llm to GPU 0 (typically fastest), distributes others
Isolation-First¶
- Description: LLM gets dedicated GPU, all other services share remaining GPUs
- Best For: Preventing LLM memory pressure from affecting other models
- Algorithm: ai-llm alone on largest GPU, everything else on remaining GPU(s)
Balanced¶
- Description: Distribute VRAM evenly across all GPUs
- Best For: Multi-GPU systems where you want even utilization
- Algorithm: Bin-packing to minimize VRAM variance across GPUs
Manual Assignment¶
When using Manual strategy or overriding automatic assignments:
- Select Manual from the strategy dropdown (or leave current strategy)
- For each service in the assignment table:
- Select the target GPU from the dropdown
- Optionally adjust VRAM budget for ai-enrichment
- Review any warnings about VRAM capacity
- Click Save to persist changes
VRAM Budget Override¶
The ai-enrichment service supports a VRAM budget override:
- Default budget: 6.8 GB
- Adjust when assigning to a smaller GPU
- The system will auto-suggest appropriate budgets
Example: Assigning ai-enrichment to a 4 GB GPU will suggest a 3.5 GB budget.
Applying Changes and Restart Flow¶
Changes to GPU assignments require container restarts:
- Save Configuration - Saves to database and generates YAML files
- Click "Apply & Restart Services" - Triggers container recreation
- Monitor Status - UI shows restart progress and health status
- Verify Health - All services should return to "running" status
Generated Files¶
The system generates two configuration files:
| File | Purpose |
|---|---|
config/docker-compose.gpu-override.yml | Docker Compose override for container orchestration |
config/gpu-assignments.yml | Human-readable reference file |
Example Override File¶
# Auto-generated by GPU Config Service - DO NOT EDIT MANUALLY
services:
ai-llm:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids:
- '0'
capabilities:
- gpu
ai-enrichment:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids:
- '1'
capabilities:
- gpu
environment:
- VRAM_BUDGET_GB=3.5
Troubleshooting¶
No GPUs Detected¶
Symptoms: UI shows "No GPUs detected" message
Solutions:
- Verify NVIDIA drivers are installed:
nvidia-smi - Check NVIDIA Container Toolkit:
docker run --gpus all nvidia/cuda:12.0-base nvidia-smi - Ensure containers have GPU access in docker-compose.prod.yml
- Click "Re-scan GPUs" to trigger detection
Service Fails to Start After Assignment Change¶
Symptoms: Service remains in "starting" or "error" state
Solutions:
- Check container logs:
podman logs ai-llm - Verify GPU is available: The assigned GPU may be in use by another process
- Check VRAM capacity: The model may exceed available VRAM
- Review warnings shown during configuration
VRAM Budget Exceeded Warning¶
Symptoms: Warning about VRAM budget exceeding GPU capacity
Solutions:
- Accept the auto-adjusted budget suggestion
- Manually set a lower VRAM budget
- Assign the service to a larger GPU
- Note: Some models in Model Zoo may not load with reduced budget
Fallback Behavior¶
If an assigned GPU becomes unavailable at runtime:
- Services automatically fall back to any available GPU
- A warning is logged
- Performance may be affected
- Reassign services via UI once GPU is restored
FAQ¶
Can I assign multiple services to the same GPU?¶
Yes. This is the default configuration for single-GPU systems. Services share VRAM, so ensure total requirements don't exceed GPU capacity.
What happens during a container restart?¶
- The affected container is stopped
- The override file is applied
- The container is recreated with new GPU assignment
- Models are reloaded on the new GPU
- Health checks verify the service is operational
How do I revert to default configuration?¶
- Select "VRAM-Based" strategy
- Click "Preview Changes" to see proposed assignments
- Click "Apply" to restore recommended configuration
- Alternatively, delete
config/docker-compose.gpu-override.ymland restart all services
Can I use AMD GPUs?¶
Currently, only NVIDIA GPUs with CUDA support are supported. AMD GPU support (ROCm) is a future consideration.
How often should I reconfigure GPU assignments?¶
- After adding/removing GPUs
- When changing AI models
- If you notice VRAM pressure or performance issues
- After system updates that affect GPU drivers
API Reference¶
For programmatic access, see the GPU Configuration API endpoints:
| Method | Endpoint | Purpose |
|---|---|---|
| GET | /api/system/gpus | List detected GPUs with utilization |
| GET | /api/system/gpu-config | Get current assignments and strategies |
| PUT | /api/system/gpu-config | Update assignments (saves to DB/YAML) |
| POST | /api/system/gpu-config/apply | Apply config and restart services |
| GET | /api/system/gpu-config/status | Get restart progress and health |
| POST | /api/system/gpu-config/detect | Re-scan for GPUs |
| GET | /api/system/gpu-config/preview | Preview auto-assignment for strategy |
Related Documentation¶
- Design Document - Technical design and implementation details
- AI Pipeline - AI service architecture and VRAM requirements
- Container Orchestration - Container startup and health checks