Multi-GPU Support¶

This guide covers configuring and managing multi-GPU support for AI services in Home Security Intelligence.

Overview¶

Multi-GPU support allows you to distribute AI workloads across multiple GPUs, providing:

Improved Performance - Run models concurrently to reduce latency
Better Capacity - Utilize all available VRAM across GPUs
Model Isolation - Keep the LLM stable by separating it from smaller models
Future-Proofing - Prepare for larger models as the system evolves

Hardware Requirements¶

Minimum Requirements¶

Component	Requirement
GPU	NVIDIA with CUDA support
VRAM (Single)	24 GB minimum for all AI services
Container Toolkit	NVIDIA Container Toolkit installed

Reference Hardware¶

The system is optimized for configurations like:

GPU	Model	VRAM	Best For
GPU 0	RTX A5500	24 GB	Large models (LLM - Nemotron 30B)
GPU 1	RTX A400	4 GB	Small/medium models (Enrichment)

VRAM Requirements by Service¶

Service	Model	VRAM Estimate	Default GPU
ai-llm	Nemotron-3-Nano-30B (Q4_K_M)	~14-18 GB	0
ai-yolo26	YOLO26 TensorRT FP16	~5 MB	1
ai-florence	Florence-2-Large	~1.5 GB	1
ai-clip	CLIP ViT-L TensorRT	~0.8 GB	1
ai-enrichment-light	Pose/Threat/Reid/Pet/Depth	~1.2 GB	1
ai-enrichment	Vehicle/Fashion/Demographics	~4.3 GB budget	1

Note (NEM-5369): VRAM estimates for ai-llm depend on context size and flash attention settings:

131K context: ~21.7 GB
65K context: ~18 GB
32K context with flash attention: ~14-15 GB (default configuration)

Accessing GPU Settings¶

The GPU Configuration page is accessible via the Settings menu:

Navigate to Settings in the main navigation
Select the GPU Configuration tab
The page displays detected GPUs and current assignments

Understanding GPU Cards¶

Each detected GPU is displayed with real-time utilization:

GPU 0: RTX A5500    24 GB   [##########------] 19.3/24 GB used
GPU 1: RTX A400      4 GB   [##--------------]  0.3/4 GB used

Information shown per GPU:

Index - Zero-based GPU identifier
Name - GPU model name
Total VRAM - Total video memory capacity
Used VRAM - Current memory utilization
Compute Capability - CUDA compute capability version

Assignment Strategies¶

Select a strategy based on your priorities:

Manual¶

Description: You control each service-to-GPU assignment explicitly
Best For: Fine-tuned control, specific hardware configurations
Algorithm: No automatic assignment; user sets each service manually

VRAM-Based (Recommended)¶

Description: Largest models assigned to GPU with most available VRAM
Best For: Maximizing VRAM utilization, general use
Algorithm: Sort models by VRAM requirement descending, assign to GPU with most free space

Latency-Optimized¶

Description: Critical path models on fastest GPU
Best For: Minimizing detection-to-analysis latency
Algorithm: Assigns ai-yolo26 and ai-llm to GPU 0 (typically fastest), distributes others

Isolation-First¶

Description: LLM gets dedicated GPU, all other services share remaining GPUs
Best For: Preventing LLM memory pressure from affecting other models
Algorithm: ai-llm alone on largest GPU, everything else on remaining GPU(s)

Balanced¶

Description: Distribute VRAM evenly across all GPUs
Best For: Multi-GPU systems where you want even utilization
Algorithm: Bin-packing to minimize VRAM variance across GPUs

Manual Assignment¶

When using Manual strategy or overriding automatic assignments:

Select Manual from the strategy dropdown (or leave current strategy)
For each service in the assignment table:
Select the target GPU from the dropdown
Optionally adjust VRAM budget for ai-enrichment
Review any warnings about VRAM capacity
Click Save to persist changes

VRAM Budget Override¶

The ai-enrichment service supports a VRAM budget override:

Default budget: 6.8 GB
Adjust when assigning to a smaller GPU
The system will auto-suggest appropriate budgets

Example: Assigning ai-enrichment to a 4 GB GPU will suggest a 3.5 GB budget.

Applying Changes and Restart Flow¶

Changes to GPU assignments require container restarts:

Save Configuration - Saves to database and generates YAML files
Click "Apply & Restart Services" - Triggers container recreation
Monitor Status - UI shows restart progress and health status
Verify Health - All services should return to "running" status

Generated Files¶

The system generates two configuration files:

File	Purpose
`config/docker-compose.gpu-override.yml`	Docker Compose override for container orchestration
`config/gpu-assignments.yml`	Human-readable reference file

Example Override File¶

# Auto-generated by GPU Config Service - DO NOT EDIT MANUALLY
services:
  ai-llm:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids:
                - '0'
              capabilities:
                - gpu
  ai-enrichment:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids:
                - '1'
              capabilities:
                - gpu
    environment:
      - VRAM_BUDGET_GB=3.5

Troubleshooting¶

No GPUs Detected¶

Symptoms: UI shows "No GPUs detected" message

Solutions:

Verify NVIDIA drivers are installed: nvidia-smi
Check NVIDIA Container Toolkit: docker run --gpus all nvidia/cuda:12.0-base nvidia-smi
Ensure containers have GPU access in docker-compose.prod.yml
Click "Re-scan GPUs" to trigger detection

Service Fails to Start After Assignment Change¶

Symptoms: Service remains in "starting" or "error" state

Solutions:

Check container logs: podman logs ai-llm
Verify GPU is available: The assigned GPU may be in use by another process
Check VRAM capacity: The model may exceed available VRAM
Review warnings shown during configuration

VRAM Budget Exceeded Warning¶

Symptoms: Warning about VRAM budget exceeding GPU capacity

Solutions:

Accept the auto-adjusted budget suggestion
Manually set a lower VRAM budget
Assign the service to a larger GPU
Note: Some models in Model Zoo may not load with reduced budget

Fallback Behavior¶

If an assigned GPU becomes unavailable at runtime:

Services automatically fall back to any available GPU
A warning is logged
Performance may be affected
Reassign services via UI once GPU is restored

FAQ¶

Can I assign multiple services to the same GPU?¶

Yes. This is the default configuration for single-GPU systems. Services share VRAM, so ensure total requirements don't exceed GPU capacity.

What happens during a container restart?¶

The affected container is stopped
The override file is applied
The container is recreated with new GPU assignment
Models are reloaded on the new GPU
Health checks verify the service is operational

How do I revert to default configuration?¶

Select "VRAM-Based" strategy
Click "Preview Changes" to see proposed assignments
Click "Apply" to restore recommended configuration
Alternatively, delete config/docker-compose.gpu-override.yml and restart all services

Can I use AMD GPUs?¶

Currently, only NVIDIA GPUs with CUDA support are supported. AMD GPU support (ROCm) is a future consideration.

How often should I reconfigure GPU assignments?¶

After adding/removing GPUs
When changing AI models
If you notice VRAM pressure or performance issues
After system updates that affect GPU drivers

API Reference¶

For programmatic access, see the GPU Configuration API endpoints:

Method	Endpoint	Purpose
GET	`/api/system/gpus`	List detected GPUs with utilization
GET	`/api/system/gpu-config`	Get current assignments and strategies
PUT	`/api/system/gpu-config`	Update assignments (saves to DB/YAML)
POST	`/api/system/gpu-config/apply`	Apply config and restart services
GET	`/api/system/gpu-config/status`	Get restart progress and health
POST	`/api/system/gpu-config/detect`	Re-scan for GPUs
GET	`/api/system/gpu-config/preview`	Preview auto-assignment for strategy

Design Document - Technical design and implementation details
AI Pipeline - AI service architecture and VRAM requirements
Container Orchestration - Container startup and health checks

Multi-GPU Support¶

Overview¶

Hardware Requirements¶

Minimum Requirements¶

Reference Hardware¶

VRAM Requirements by Service¶

Accessing GPU Settings¶

Understanding GPU Cards¶

Assignment Strategies¶

Manual¶

VRAM-Based (Recommended)¶

Latency-Optimized¶

Isolation-First¶

Balanced¶

Manual Assignment¶

VRAM Budget Override¶

Applying Changes and Restart Flow¶

Generated Files¶

Example Override File¶

Troubleshooting¶

No GPUs Detected¶

Service Fails to Start After Assignment Change¶

VRAM Budget Exceeded Warning¶

Fallback Behavior¶

FAQ¶

Can I assign multiple services to the same GPU?¶

What happens during a container restart?¶

How do I revert to default configuration?¶

Can I use AMD GPUs?¶

How often should I reconfigure GPU assignments?¶

API Reference¶

Related Documentation¶