GPU Memory Limits Configuration¶

This guide documents GPU memory limits per service for preventing out-of-memory (OOM) errors and enabling stable multi-container GPU sharing.

Overview¶

GPU memory limits are configured at the Docker/Podman container level using NVIDIA Container Runtime options. Combined with PyTorch memory allocator configuration, this allows multiple containers to safely share a single GPU without causing system crashes.

Key Benefits¶

OOM Prevention: Hard limits prevent one container from consuming all GPU memory
Fair Resource Sharing: Multiple models can run concurrently on shared GPUs
Predictable Performance: Applications behave consistently under memory pressure
Multi-GPU Support: Foundation for distributing workloads across multiple GPUs

Configuration Methods¶

1. Docker Compose Deploy Options¶

GPU memory limits are set in docker-compose.prod.yml under deploy.resources.reservations.devices:

services:
  ai-llm:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
              options:
                memory: 24g # GPU memory limit for this container

The options.memory field specifies the maximum GPU memory available to the container.

2. PyTorch Memory Allocator Configuration¶

PyTorch's CUDA allocator is configured via environment variable to reduce fragmentation:

environment:
  - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

This setting:

Limits maximum allocation block size to 512 MB
Reduces memory fragmentation under concurrent loads
Helps prevent "out of memory" errors when models load incrementally

Service Memory Allocations¶

All GPU-enabled services have memory limits configured. Here are the recommended allocations based on model sizes:

Service	Model(s)	Memory Limit	GPU	Notes
`ai-llm`	Nemotron-3-Nano-30B (Q4_K_M)	24G	0	Large LLM requires dedicated high-VRAM GPU
`ai-yolo26`	YOLO26	2G	Any	Object detection, runs frequently
`ai-yolo26`	YOLO26 TensorRT	1.5G	1	Optional TensorRT-optimized variant
`ai-florence`	Florence-2-Large	2G	1	Vision-language dense captioning
`ai-clip`	CLIP ViT-L	1.5G	1	Entity re-identification embeddings
`ai-enrichment-light`	Pose, Threat, ReID, Pet, Depth (~1.2GB)	2G	1	Light-weight models, efficient inference
`ai-enrichment`	Vehicle, Fashion, Demographics (~4.3GB)	3G	1	Large transformer models with lazy load
`backend`	FastAPI application	2G	Any	GPU for inference acceleration

Memory Management Best Practices¶

1. Monitor GPU Memory Usage¶

Use nvidia-smi to monitor real-time GPU memory:

# Real-time monitoring
watch -n 1 nvidia-smi

# Single snapshot
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free \
  --format=csv,noheader,nounits

2. Troubleshooting¶

Container crashes with CUDA Out of Memory: Check current GPU memory with nvidia-smi and increase memory limit in docker-compose.prod.yml.

Multi-GPU Support - User-facing GPU configuration guide
Container Orchestration - Container management and health checks