Skip to content

AI Services GHCR Deployment

Deploy AI services using containers from GitHub Container Registry (GHCR).

Time to read: ~12 min Prerequisites: AI Installation, GPU Setup


Overview

This guide covers deploying the AI services stack from source. Currently, the CI/CD pipeline publishes the backend and frontend images to GHCR, while AI service images are built locally due to their GPU-specific requirements and large model dependencies.

Image Availability

Service GHCR Image Notes
backend ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:latest Published on every merge to main
frontend ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:latest Published on every merge to main
ai-yolo26 Build locally YOLO26 object detection
ai-llm Build locally Nemotron LLM (llama.cpp)
ai-florence Build locally Florence-2 vision-language
ai-clip Build locally CLIP embeddings
ai-enrichment Build locally Vehicle/pet/clothing classification

Why AI Services Are Built Locally

AI service containers are intentionally not published to GHCR because:

  1. Model files: Large AI models (2-18GB) need to be mounted at runtime
  2. GPU drivers: CUDA version must match the host's nvidia-container-toolkit
  3. Build customization: Operators may need different quantization levels or model versions
  4. Storage costs: Multi-GB images would be expensive to host and transfer

Quick Start

Deploy Full Stack (Backend/Frontend from GHCR + Local AI)

# 1. Pull latest backend and frontend from GHCR
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:latest
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:latest

# 2. Build AI services locally (first time only, ~10-15 min)
podman-compose -f docker-compose.prod.yml build ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

# 3. Start the full stack
podman-compose -f docker-compose.prod.yml up -d

# 4. Verify deployment
curl http://localhost:8000/api/system/health/ready

Deploy Core AI Only (No Optional Services)

# Build and start only YOLO26 and Nemotron
podman-compose -f docker-compose.prod.yml build ai-yolo26 ai-llm
podman-compose -f docker-compose.prod.yml up -d ai-yolo26 ai-llm

# Verify
curl http://localhost:8095/health  # YOLO26
curl http://localhost:8091/health  # Nemotron

AI Service Container Reference

ai-yolo26 (YOLO26)

Object detection service using YOLO26 transformer model.

Property Value
Port 8095
VRAM ~3-4GB
Base Image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
Model Auto-downloads from HuggingFace on first start
Health Check GET /health

Build:

podman-compose -f docker-compose.prod.yml build ai-yolo26

Environment Variables:

Variable Default Description
YOLO26_CONFIDENCE 0.5 Detection confidence threshold (0.0-1.0)
YOLO26_MODEL_PATH PekingU/yolo26_r50vd_coco_o365 HuggingFace model ID

Volume Mounts:

volumes:
  # :U tells Podman to recursively chown the volume to match container user
  # Docker ignores the :U flag, making this backward compatible
  - ${HF_CACHE:-~/.cache/huggingface}:/cache/huggingface:U

ai-llm (Nemotron)

Large language model service for risk analysis using llama.cpp.

Property Value
Port 8091
VRAM ~3GB (Mini 4B) or ~14GB (Nano 30B)
Base Image nvidia/cuda:12.4.1-runtime-ubuntu22.04
Model Requires manual download (see below)
Health Check GET /health

Build:

podman-compose -f docker-compose.prod.yml build ai-llm

Environment Variables:

Variable Default Description
GPU_LAYERS 35 Number of model layers to offload to GPU
CTX_SIZE 131072 Context window size (tokens)
PARALLEL 1 Number of parallel inference slots

Volume Mounts:

volumes:
  - ${AI_MODELS_PATH:-/export/ai_models}/nemotron/nemotron-3-nano-30b-a3b-q4km:/models:ro

Model Download (Production - Nano 30B):

Download from the official NVIDIA HuggingFace repository: nvidia/Nemotron-3-Nano-30B-A3B-GGUF

mkdir -p /export/ai_models/nemotron/nemotron-3-nano-30b-a3b-q4km
cd /export/ai_models/nemotron/nemotron-3-nano-30b-a3b-q4km
wget https://huggingface.co/nvidia/Nemotron-3-Nano-30B-A3B-GGUF/resolve/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf

ai-florence (Florence-2)

Vision-language model for dense captioning and visual understanding.

Property Value
Port 8092
VRAM ~1.2GB
Base Image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
Model Requires manual download
Health Check GET /health

Build:

podman-compose -f docker-compose.prod.yml build ai-florence

Environment Variables:

Variable Default Description
MODEL_PATH /models/florence-2-large Path to Florence-2 model

Volume Mounts:

volumes:
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/florence-2-large:/models/florence-2-large:ro

Model Download:

mkdir -p /export/ai_models/model-zoo/florence-2-large
cd /export/ai_models/model-zoo/florence-2-large
git lfs install
git clone https://huggingface.co/microsoft/Florence-2-large .

ai-clip (CLIP ViT-L)

CLIP embedding service for entity re-identification.

Property Value
Port 8093
VRAM ~800MB
Base Image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
Model Requires manual download
Health Check GET /health

Build:

podman-compose -f docker-compose.prod.yml build ai-clip

Environment Variables:

Variable Default Description
CLIP_MODEL_PATH /models/clip-vit-l Path to CLIP model

Volume Mounts:

volumes:
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/clip-vit-l:/models/clip-vit-l:ro

Model Download:

mkdir -p /export/ai_models/model-zoo/clip-vit-l
cd /export/ai_models/model-zoo/clip-vit-l
git lfs install
git clone https://huggingface.co/openai/clip-vit-large-patch14 .

ai-enrichment (Combined Classification)

Combined service for vehicle, pet, and clothing classification.

Property Value
Port 8094
VRAM ~2.5GB (all models loaded)
Base Image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
Models Requires manual download (4 models)
Health Check GET /health

Build:

podman-compose -f docker-compose.prod.yml build ai-enrichment

Environment Variables:

Variable Default Description
VEHICLE_MODEL_PATH /models/vehicle-segment-classification Vehicle classifier model
PET_MODEL_PATH /models/pet-classifier Pet classifier model
CLOTHING_MODEL_PATH /models/fashion-clip FashionCLIP model
DEPTH_MODEL_PATH /models/depth-anything-v2-small Depth estimation model

Volume Mounts:

volumes:
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/vehicle-segment-classification:/models/vehicle-segment-classification:ro
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/pet-classifier:/models/pet-classifier:ro
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/fashion-clip:/models/fashion-clip:ro
  - ${AI_MODELS_PATH:-/export/ai_models}/model-zoo/depth-anything-v2-small:/models/depth-anything-v2-small:ro

Model Downloads:

# Create directories
mkdir -p /export/ai_models/model-zoo/{vehicle-segment-classification,pet-classifier,fashion-clip,depth-anything-v2-small}

# Vehicle classification
cd /export/ai_models/model-zoo/vehicle-segment-classification
git lfs install
git clone https://huggingface.co/lxyuan/vit-base-patch16-224-vehicle-segment-classification .

# Pet classifier
cd /export/ai_models/model-zoo/pet-classifier
git clone https://huggingface.co/microsoft/resnet-18 .

# FashionCLIP
cd /export/ai_models/model-zoo/fashion-clip
git clone https://huggingface.co/patrickjohncyh/fashion-clip .

# Depth estimation
cd /export/ai_models/model-zoo/depth-anything-v2-small
git clone https://huggingface.co/depth-anything/Depth-Anything-V2-Small .

GPU Configuration

GPU Passthrough (Docker Compose)

All AI services require GPU access. The docker-compose.prod.yml includes this configuration:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Verify GPU Access

# Test GPU access from container
podman run --rm --device nvidia.com/gpu=all \
  nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

# Or for Docker
docker run --rm --gpus all \
  nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

VRAM Requirements

Deployment Scenario Services Total VRAM
Core only ai-yolo26 + ai-llm (Mini 4B) ~7GB
Core (production) ai-yolo26 + ai-llm (Nano 30B) ~18GB
Full stack All 5 AI services ~22-24GB

Deployment Patterns

Pattern 1: Full Production Stack

Deploy everything from GHCR + locally built AI:

# Clone the repository
git clone https://github.com/mikesvoboda/nemotron-v3-home-security-intelligence.git
cd nemotron-v3-home-security-intelligence

# Run setup script
./setup.sh

# Pull backend/frontend from GHCR
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:latest
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:latest

# Build AI services
podman-compose -f docker-compose.prod.yml build ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

# Download models (see Model Downloads section above)

# Start all services
podman-compose -f docker-compose.prod.yml up -d

Pattern 2: Core Services Only

Deploy without optional AI services (Florence, CLIP, Enrichment):

# Build only core AI
podman-compose -f docker-compose.prod.yml build ai-yolo26 ai-llm

# Download Nemotron model from official NVIDIA repository
mkdir -p /export/ai_models/nemotron/nemotron-3-nano-30b-a3b-q4km
wget -O /export/ai_models/nemotron/nemotron-3-nano-30b-a3b-q4km/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf \
  https://huggingface.co/nvidia/Nemotron-3-Nano-30B-A3B-GGUF/resolve/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf

# Start core services only
podman-compose -f docker-compose.prod.yml up -d \
  postgres redis backend frontend ai-yolo26 ai-llm

Pattern 3: AI Services on Separate GPU Host

Run AI services on a dedicated GPU machine:

On GPU host:

# Clone repo on GPU host
git clone https://github.com/mikesvoboda/nemotron-v3-home-security-intelligence.git
cd nemotron-v3-home-security-intelligence

# Build and start AI services only
podman-compose -f docker-compose.prod.yml build ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment
podman-compose -f docker-compose.prod.yml up -d ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

On application host:

Configure .env to point to the GPU host:

GPU_HOST=10.0.0.50  # Your GPU host IP
YOLO26_URL=http://${GPU_HOST}:8095
NEMOTRON_URL=http://${GPU_HOST}:8091
FLORENCE_URL=http://${GPU_HOST}:8092
CLIP_URL=http://${GPU_HOST}:8093
ENRICHMENT_URL=http://${GPU_HOST}:8094

Start non-AI services:

podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:latest
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:latest

podman-compose -f docker-compose.prod.yml up -d postgres redis backend frontend

Updating Containers

Update Backend/Frontend from GHCR

# Pull latest images
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:latest
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:latest

# Recreate containers with new images
podman-compose -f docker-compose.prod.yml up -d backend frontend

Update AI Services (Rebuild)

# Pull latest source code
git pull origin main

# Rebuild AI containers
podman-compose -f docker-compose.prod.yml build --no-cache ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

# Recreate containers
podman-compose -f docker-compose.prod.yml up -d ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

Use Specific Version (SHA Tag)

# Deploy specific commit version
SHA=abc123
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/backend:${SHA}
podman pull ghcr.io/mikesvoboda/nemotron-v3-home-security-intelligence/frontend:${SHA}

Health Verification

Quick Health Check

# All services
curl http://localhost:8000/api/system/health/ready  # Backend

# AI services individually
curl http://localhost:8095/health  # YOLO26
curl http://localhost:8091/health  # Nemotron
curl http://localhost:8092/health  # Florence-2
curl http://localhost:8093/health  # CLIP
curl http://localhost:8094/health  # Enrichment

Comprehensive Check Script

#!/bin/bash
echo "=== Health Check ==="

services=(
  "backend:8000/api/system/health/ready"
  "ai-yolo26:8095/health"
  "ai-llm:8091/health"
  "ai-florence:8092/health"
  "ai-clip:8093/health"
  "ai-enrichment:8094/health"
)

for svc in "${services[@]}"; do
  name="${svc%%:*}"
  url="http://localhost:${svc#*:}"
  if curl -sf "$url" > /dev/null 2>&1; then
    echo "[OK] $name"
  else
    echo "[FAIL] $name ($url)"
  fi
done

GPU Utilization Check

# Watch GPU memory and utilization
watch -n 1 nvidia-smi

# Per-process VRAM usage
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

Troubleshooting

Container Fails to Start

# Check container logs
podman-compose -f docker-compose.prod.yml logs ai-yolo26
podman-compose -f docker-compose.prod.yml logs ai-llm

# Check if model files exist
ls -la /export/ai_models/nemotron/
ls -la /export/ai_models/model-zoo/

GPU Not Available in Container

# Verify nvidia-container-toolkit is installed
nvidia-ctk --version

# Regenerate CDI spec (Podman)
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Test GPU access
podman run --rm --device nvidia.com/gpu=all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

CUDA Out of Memory

# Check current VRAM usage
nvidia-smi

# Reduce GPU layers for Nemotron (in .env or docker-compose.override.yml)
GPU_LAYERS=25  # Default is 35

# Stop optional AI services to free VRAM
podman-compose -f docker-compose.prod.yml stop ai-florence ai-clip ai-enrichment

Service Timeout on Startup

AI services have long startup times due to model loading:

Service Expected Startup Time
ai-yolo26 60-90 seconds
ai-llm 120-180 seconds
ai-florence 60-120 seconds
ai-clip 30-60 seconds
ai-enrichment 120-180 seconds

Wait for health checks to pass before testing:

# Watch service logs during startup
podman-compose -f docker-compose.prod.yml logs -f ai-llm

Next Steps


See Also


Back to Operator Hub