Multi-GPU Support Design¶

Date: 2025-01-23 Status: Approved Author: Claude (with Mike Svoboda)

Overview¶

Enable multi-GPU support for AI services, allowing users to pin specific models to different GPUs. This improves performance, capacity, isolation, and future-proofs the system for larger model deployments.

Goals¶

Performance - Run more models concurrently to reduce latency
Capacity - Utilize all available VRAM across GPUs
Isolation - Keep the LLM stable by separating it from smaller models
Future-proofing - Prepare for adding more/larger models later
Generalization - Support any multi-GPU configuration, not just the reference hardware

Reference Hardware¶

GPU	Model	VRAM	Power	Best For
GPU 0	RTX A5500	24 GB	230W	Large models (LLM)
GPU 1	RTX A400	4 GB	50W	Small/medium models

Design Decisions¶

Decision	Choice	Rationale
Configuration approach	Hybrid (auto + manual)	Sensible defaults with user override capability
Configuration location	UI settings panel	User-friendly, no CLI required
Assignment strategies	5 options (manual, VRAM, latency, isolation, balanced)	Different users have different priorities
When changes apply	Container restart via UI	Reliable, atomic, user-controlled
Storage	Database + config file	DB for runtime, file for inspection/recovery
Validation	Warn but allow	Informative, not restrictive
Runtime errors	Fallback to available GPU	Graceful degradation

Architecture¶

Data Flow¶

┌─────────────────────────────────────────────────────────────────────────┐
│                         UI: GPU Configuration                           │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────┐  │
│  │ ai-llm: GPU 0   │  │ ai-yolo26: 0  │  │ ai-enrichment: GPU 1   │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────────────┘  │
│                                                                         │
│  ⚠️ Warning: ai-enrichment VRAM budget (6.8GB) exceeds GPU 1 (4GB)     │
│                                                                         │
│                    [ Save ]  [ Apply & Restart Services ]               │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      Backend: GPU Config Service                        │
│                                                                         │
│  1. Save assignments to PostgreSQL                                      │
│  2. Generate docker-compose.gpu-override.yml                            │
│  3. Call podman-compose to recreate affected containers                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│              docker-compose.gpu-override.yml (generated)                │
│                                                                         │
│  services:                                                              │
│    ai-llm:                                                              │
│      deploy:                                                            │
│        resources:                                                       │
│          reservations:                                                  │
│            devices:                                                     │
│              - driver: nvidia                                           │
│                device_ids: ['0']  # RTX A5500                           │
│                capabilities: [gpu]                                      │
│    ai-enrichment:                                                       │
│      deploy:                                                            │
│        resources:                                                       │
│          reservations:                                                  │
│            devices:                                                     │
│              - driver: nvidia                                           │
│                device_ids: ['1']  # RTX A400                            │
│                capabilities: [gpu]                                      │
│      environment:                                                       │
│        - VRAM_BUDGET_GB=3.5  # Auto-adjusted for smaller GPU            │
└─────────────────────────────────────────────────────────────────────────┘

Database Schema¶

New Tables¶

CREATE TABLE gpu_configurations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    service_name VARCHAR(64) NOT NULL UNIQUE,  -- 'ai-llm', 'ai-yolo26', etc.
    gpu_index INTEGER,                          -- NULL = auto-assign
    strategy VARCHAR(32) DEFAULT 'manual',      -- 'manual', 'vram_based', 'latency_optimized', 'isolation_first', 'balanced'
    vram_budget_override FLOAT,                 -- Override VRAM budget (for enrichment)
    enabled BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE gpu_devices (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    gpu_index INTEGER NOT NULL UNIQUE,
    name VARCHAR(128),                          -- 'NVIDIA RTX A5500'
    vram_total_mb INTEGER,
    vram_available_mb INTEGER,
    compute_capability VARCHAR(16),             -- '8.6'
    last_seen_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE system_settings (
    key VARCHAR(64) PRIMARY KEY,
    value JSONB NOT NULL,
    updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Example: key='gpu_assignment_strategy', value='{"default": "vram_based"}'

Config File (synced from DB)¶

# config/gpu-assignments.yml (auto-generated, do not edit manually)
generated_at: '2025-01-23T10:30:00Z'
strategy: vram_based
assignments:
  ai-llm: { gpu: 0, vram_budget: null }
  ai-yolo26: { gpu: 0, vram_budget: null }
  ai-enrichment: { gpu: 1, vram_budget: 3.5 }

API Endpoints¶

Method	Endpoint	Purpose
GET	`/api/system/gpus`	Returns detected GPUs with current utilization
GET	`/api/system/gpu-config`	Returns current GPU assignments + available strategies
PUT	`/api/system/gpu-config`	Updates assignments (saves to DB, syncs to YAML)
POST	`/api/system/gpu-config/apply`	Applies config and restarts affected services
GET	`/api/system/gpu-config/status`	Returns restart progress / container health
POST	`/api/system/gpu-config/detect`	Re-scans GPUs (updates gpu_devices table)
GET	`/api/system/gpu-config/preview`	Preview auto-assignment for a given strategy

Request/Response Examples¶

GET /api/system/gpus

{
  "gpus": [
    {
      "index": 0,
      "name": "RTX A5500",
      "vram_total_mb": 24564,
      "vram_used_mb": 19304,
      "compute_capability": "8.6"
    },
    {
      "index": 1,
      "name": "RTX A400",
      "vram_total_mb": 4094,
      "vram_used_mb": 329,
      "compute_capability": "8.6"
    }
  ]
}

PUT /api/system/gpu-config

{
  "strategy": "manual",
  "assignments": [
    { "service": "ai-llm", "gpu_index": 0 },
    { "service": "ai-yolo26", "gpu_index": 0 },
    { "service": "ai-enrichment", "gpu_index": 1, "vram_budget_override": 3.5 }
  ]
}

Response:

{
  "success": true,
  "warnings": ["ai-enrichment VRAM budget (6.8 GB) exceeds GPU 1 (4 GB). Auto-adjusted to 3.5 GB."]
}

Container Orchestration¶

Override File Generation¶

def generate_override_file(assignments: list[GpuAssignment]) -> str:
    """Generate docker-compose.gpu-override.yml content."""
    services = {}
    for assignment in assignments:
        service_config = {
            "deploy": {
                "resources": {
                    "reservations": {
                        "devices": [{
                            "driver": "nvidia",
                            "device_ids": [str(assignment.gpu_index)],
                            "capabilities": ["gpu"]
                        }]
                    }
                }
            }
        }
        if assignment.vram_budget_override:
            service_config["environment"] = [
                f"VRAM_BUDGET_GB={assignment.vram_budget_override}"
            ]
        services[assignment.service_name] = service_config

    return yaml.dump({"services": services})

Container Restart Flow¶

async def apply_gpu_config(self, assignments: list[GpuAssignment]) -> ApplyResult:
    # 1. Write override file
    override_path = Path("config/docker-compose.gpu-override.yml")
    override_path.write_text(generate_override_file(assignments))

    # 2. Determine which services changed
    changed = self._diff_assignments(current=self._get_current(), new=assignments)

    # 3. Recreate changed containers via subprocess
    for service in changed:
        await self._recreate_service(service)

    return ApplyResult(restarted=changed, success=True)

async def _recreate_service(self, service_name: str):
    """Recreate a single service with the new GPU config."""
    cmd = [
        "podman-compose",
        "-f", "docker-compose.prod.yml",
        "-f", "config/docker-compose.gpu-override.yml",
        "up", "-d", "--force-recreate", "--no-deps",
        service_name
    ]
    proc = await asyncio.create_subprocess_exec(*cmd)
    await proc.wait()

Frontend UI¶

Component Structure¶

src/components/settings/
├── GpuSettings.tsx           # Main container
├── GpuDeviceCard.tsx         # Shows each GPU with stats
├── GpuAssignmentTable.tsx    # Service → GPU mapping table
├── GpuStrategySelector.tsx   # Strategy dropdown with descriptions
└── GpuApplyButton.tsx        # Apply & Restart with status

UI Layout¶

┌─────────────────────────────────────────────────────────────────────────┐
│  Settings > GPU Configuration                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─ Detected GPUs ─────────────────────────────────────────────────┐   │
│  │  GPU 0: RTX A5500    24 GB   ████████████░░░░ 19.3/24 GB used   │   │
│  │  GPU 1: RTX A400      4 GB   ██░░░░░░░░░░░░░░  0.3/4 GB used    │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  Assignment Strategy: [ VRAM-based (Recommended) ▼ ]                    │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ ○ Manual         - You control each assignment                   │  │
│  │ ● VRAM-based     - Largest models on largest GPU                 │  │
│  │ ○ Latency-opt.   - Critical path models on fastest GPU           │  │
│  │ ○ Isolation      - LLM gets dedicated GPU                        │  │
│  │ ○ Balanced       - Distribute VRAM evenly                        │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌─ Service Assignments ───────────────────────────────────────────┐   │
│  │  Service          Model              VRAM Est.   GPU            │   │
│  │  ────────────────────────────────────────────────────────────── │   │
│  │  ai-llm           Nemotron-30B       ~21.7 GB    [ GPU 0 ▼ ]    │   │
│  │  ai-yolo26      YOLO26          ~650 MB     [ GPU 0 ▼ ]    │   │
│  │  ai-florence      Florence-2-L       ~1.5 GB     [ GPU 0 ▼ ]    │   │
│  │  ai-clip          CLIP ViT-L         ~1.2 GB     [ GPU 0 ▼ ]    │   │
│  │  ai-enrichment    Model Zoo          ~6.8 GB     [ GPU 1 ▼ ]    │   │
│  │                   └─ VRAM Budget: [ 3.5 ] GB  ⚠️ Adjusted       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  ⚠️ Warning: ai-enrichment budget (6.8 GB) exceeds GPU 1 (4 GB).       │
│     Budget auto-adjusted to 3.5 GB. Some models may not load.          │
│                                                                         │
│  [ Preview Changes ]  [ Save ]  [ Apply & Restart Services ]           │
│                                                                         │
│  Status: ● All services running                                         │
└─────────────────────────────────────────────────────────────────────────┘

Assignment Strategies¶

Strategy	Description	Algorithm
Manual	User controls each assignment	No auto-assignment
VRAM-based	Largest models on largest GPU	Sort models by VRAM desc, assign to GPU with most free space
Latency-optimized	Critical path on fastest GPU	ai-yolo26 + ai-llm on GPU 0, others distributed
Isolation-first	LLM gets dedicated GPU	ai-llm alone on largest GPU, everything else shares remaining
Balanced	Distribute VRAM evenly	Bin-packing to minimize VRAM variance across GPUs

Error Handling¶

Validation Warnings (Non-blocking)¶

Condition	Warning Message
VRAM exceeds GPU capacity	"ai-enrichment budget (6.8 GB) exceeds GPU 1 (4 GB)"
Multiple large models on small GPU	"Combined VRAM (~23 GB) exceeds GPU 1 capacity (4 GB)"
LLM on small GPU	"Nemotron-30B requires ~21.7 GB, GPU 1 only has 4 GB"

Runtime Fallbacks¶

def get_target_gpu() -> int:
    """Get assigned GPU, fall back to any available if assignment fails."""
    assigned = os.environ.get("CUDA_VISIBLE_DEVICES")
    if assigned:
        try:
            torch.cuda.set_device(int(assigned))
            return int(assigned)
        except RuntimeError as e:
            logger.warning(f"Assigned GPU {assigned} unavailable: {e}, falling back")

    if torch.cuda.is_available():
        return 0
    raise RuntimeError("No GPU available")

Edge Cases¶

Scenario	Behavior
GPU removed/failed	Service falls back to available GPU, logs warning
User assigns same GPU to all	Allowed (current behavior)
No GPUs detected	UI shows "No GPUs detected", disables controls
Container restart fails	UI shows error, rollback option offered
Config file write fails	API returns 500, DB transaction rolled back
Podman API unavailable	Graceful error message in UI

Implementation Phases¶

Phase 1: Backend Foundation¶

Add gpu_configurations and gpu_devices tables + Alembic migration
Create GpuConfigService with GPU detection via pynvml or nvidia-smi
Implement API endpoints: GET /gpus, GET/PUT /gpu-config
Add model VRAM estimates to existing model registry

Phase 2: Override File Generation¶

Implement generate_override_file() for docker-compose override
Add config file sync on save (config/docker-compose.gpu-override.yml)
Add config/gpu-assignments.yml generation for human inspection
Unit tests for YAML generation

Phase 3: Container Orchestration¶

Implement apply_gpu_config() with podman-compose subprocess
Add restart status tracking and polling endpoint
Add rollback capability (restore previous override file)
Integration tests with container mocking

Phase 4: Frontend UI¶

GPU device cards with real-time VRAM display
Strategy selector with descriptions
Assignment table with dropdowns
Warning display for VRAM overages
Apply & Restart button with progress indicator

Phase 5: Polish & Documentation¶

Auto-adjustment of VRAM budget when assigning to smaller GPU
"Preview Changes" diff view
Update CLAUDE.md and docs with multi-GPU setup instructions
Add GPU config to system backup/restore

File Changes Summary¶

Area	New/Modified Files
Database	`alembic/versions/xxx_add_gpu_config.py`
Models	`backend/models/gpu_config.py`
Services	`backend/services/gpu_config_service.py`
API	`backend/api/routes/gpu_config.py`
Frontend	`frontend/src/components/settings/Gpu*.tsx`
Frontend	`frontend/src/services/gpuConfigApi.ts`
Config	`config/docker-compose.gpu-override.yml` (generated)
Docs	`docs/development/multi-gpu.md`

Model VRAM Estimates¶

Service	Model	VRAM Estimate
ai-llm	Nemotron-3-Nano-30B (Q4_K_M)	~21.7 GB
ai-yolo26	YOLO26	~650 MB
ai-florence	Florence-2-Large	~1.5 GB
ai-clip	CLIP ViT-L	~1.2 GB
ai-enrichment	Model Zoo (9 models)	~6.8 GB budget

Usage Example¶

After implementation, users with multiple GPUs would:

Navigate to Settings > GPU Configuration
See detected GPUs with current utilization
Either select an auto-assignment strategy or manually assign services
Review warnings if VRAM allocations exceed capacity
Click Apply & Restart Services
Monitor restart progress in the UI
Verify services are healthy on their assigned GPUs

Future Considerations¶

Support for AMD GPUs (ROCm) - would require separate detection path
Multi-node GPU support (distributed across machines)
Dynamic model migration between GPUs without restart
GPU memory fragmentation monitoring
Automatic strategy recommendation based on workload patterns