AI Model Zoo Architecture¶
Overview¶
The AI model zoo provides comprehensive visual analysis for home security through multiple specialized models working together. The system uses VRAM-efficient on-demand loading to maximize capability while respecting GPU memory constraints.
The architecture separates models into two categories:
- Always-loaded services - Core detection and analysis models that run continuously
- On-demand models - Specialized enrichment models loaded dynamically based on detection type
Architecture Diagram¶
%%{init: {
'theme': 'dark',
'themeVariables': {
'primaryColor': '#3B82F6',
'primaryTextColor': '#FFFFFF',
'primaryBorderColor': '#60A5FA',
'secondaryColor': '#A855F7',
'tertiaryColor': '#009688',
'background': '#121212',
'mainBkg': '#1a1a2e',
'lineColor': '#666666'
}
}}%%
flowchart TB
subgraph Pipeline["Detection Pipeline"]
CAM[Camera Frame] --> YOLO[YOLO26]
YOLO --> DET[Detections]
DET --> ENR[Enrichment]
ENR --> NEM[Nemotron]
end
subgraph EnrichmentHeavy["Enrichment Heavy (ai-enrichment:8094) - GPU 0"]
subgraph HeavyModels["Heavy Models (~6.0GB budget)"]
VEHICLE[Vehicle Classification]
FASHION[Fashion Analysis]
DEMO[Demographics]
ACTION[Action Recognition]
end
end
subgraph EnrichmentLight["Enrichment Light (ai-enrichment-light:8096) - GPU 1"]
subgraph LightModels["Light Models (~1.2GB)"]
POSE[Pose Estimation]
THREAT[Threat Detection]
REID[Person Re-ID]
PET[Pet Classifier]
DEPTH[Depth Estimation]
end
end
Pipeline --> EnrichmentHeavy
Pipeline --> EnrichmentLight Service Architecture¶
Core Services (Containerized)¶
| Service | Container | Port | Model | VRAM | Purpose |
|---|---|---|---|---|---|
| YOLO26 | ai-yolo26 | 8095 | YOLO26m (TensorRT FP16) | ~2GB | Primary object detection |
| Nemotron | ai-llm | 8091 | Nemotron-3-Nano-30B-A3B | ~14.7GB | Risk reasoning and analysis |
| Florence-2 | ai-florence | 8092 | Florence-2-Large | ~1.2GB | Scene understanding, OCR |
| CLIP | ai-clip | 8093 | CLIP ViT-L/14 | ~800MB | Embeddings, anomaly detection |
| Enrichment (Heavy) | ai-enrichment | 8094 | Transformers (on-demand) | ~6.0GB | Heavy models (vehicle, fashion, etc.) |
| Enrichment (Light) | ai-enrichment-light | 8096 | Small models (on-demand) | ~1.2GB | Light models (pose, threat, reid) |
Enrichment Service Split¶
The enrichment models are split across two services based on GPU requirements:
| Service | Port | Target GPU | Models |
|---|---|---|---|
| ai-enrichment | 8094 | GPU 0 (24GB) | Vehicle, fashion, age, gender, action |
| ai-enrichment-light | 8096 | GPU 1 (4GB) | Pose, threat, reid, pet, depth |
The backend routes requests to the appropriate service based on ENRICHMENT_*_SERVICE environment variables. See AI Enrichment Light Service for detailed documentation.
Enrichment Pipeline Split¶
The enrichment router in the backend directs each detection to either the heavy or light service based on the model requirements. Heavy models run on a GPU with larger VRAM, while light models run on a separate GPU with a smaller budget.
flowchart TB
DET[Detection from YOLO26] --> ROUTER{Enrichment Router}
ROUTER -->|Heavy models| HEAVY["Enrichment Heavy<br/>:8094"]
ROUTER -->|Light models| LIGHT["Enrichment Light<br/>:8096"]
subgraph HEAVY_SVC["Heavy Service (full GPU models)"]
FC[FashionCLIP]
VC[Vehicle Classifier]
WX[Weather Classification]
XC[X-CLIP Action Recognition]
end
subgraph LIGHT_SVC["Light Service (minimal GPU)"]
PR[Person Re-ID]
PE[Pose Estimation]
TD[Threat Detection]
end
HEAVY --> HEAVY_SVC
LIGHT --> LIGHT_SVC Data Flow¶
%%{init: {
'theme': 'dark',
'themeVariables': {
'primaryColor': '#3B82F6',
'primaryTextColor': '#FFFFFF',
'primaryBorderColor': '#60A5FA',
'secondaryColor': '#A855F7',
'tertiaryColor': '#009688',
'background': '#121212',
'mainBkg': '#1a1a2e',
'lineColor': '#666666'
}
}}%%
flowchart TB
CAM[Camera Images]
CAM --> YOLO["YOLO26<br/>(8095)"]
YOLO --> ENR["Enrichment<br/>(8094)"]
YOLO -->|Detections| NEM
ENR -->|Classified Detections| NEM
NEM["Nemotron (8091)<br/>Risk Analysis & Scoring"]
NEM --> EVT[Risk Events] Model Details¶
Always-Loaded Models¶
YOLO26 (ai-yolo26:8095)¶
- Model: YOLO26m (Ultralytics with TensorRT)
- Architecture: YOLO26 object detection with TensorRT FP16 optimization
- VRAM: ~2GB (with TensorRT FP16 engine)
- Inference Time: 10-20ms (FP16 TensorRT)
- Output: Bounding boxes with class labels and confidence scores
Security Classes (Filtered):
SECURITY_CLASSES = {
"person", "car", "truck", "dog", "cat",
"bird", "bicycle", "motorcycle", "bus"
}
API Endpoints:
GET /health- Health statusPOST /detect- Single image detectionPOST /detect/batch- Batch detection
Florence-2-Large (ai-florence:8092)¶
- Model: microsoft/Florence-2-large
- Architecture: Vision-language transformer
- VRAM: ~1.2GB
- Inference Time: 100-300ms per query
- Tasks: Caption, Dense Region Caption, OCR, Open Vocabulary Detection
Supported Prompts: | Prompt | Output | Use Case | | ------ | ------ | -------- | | <CAPTION> | Brief description | Quick scene summary | | <DETAILED_CAPTION> | Detailed paragraph | Event logging | | <OD> | Objects with bboxes | Object localization | | <DENSE_REGION_CAPTION> | Caption per region | Scene understanding | | <OCR> | Detected text | License plates, signs | | <OCR_WITH_REGION> | Text with bboxes | Text localization |
API Endpoints:
GET /health- Health statusPOST /extract- Generic extraction with promptPOST /ocr- Text extractionPOST /detect- Object detectionPOST /dense-caption- Region captioning
CLIP ViT-L/14 (ai-clip:8093)¶
- Model: openai/clip-vit-large-patch14
- Architecture: Vision-Language contrastive model
- VRAM: ~800MB
- Embedding Dimension: 768
- Output: Normalized image embeddings, similarity scores
Use Cases:
- Entity re-identification across cameras
- Scene anomaly detection via baseline comparison
- Zero-shot classification
API Endpoints:
GET /health- Health statusPOST /embed- Generate 768-dim embeddingPOST /anomaly-score- Compare to baselinePOST /classify- Zero-shot classificationPOST /similarity- Image-text similarityPOST /batch-similarity- Batch similarity comparison
Nemotron (ai-llm:8091)¶
Production Model:
- Model: nvidia/Nemotron-3-Nano-30B-A3B-GGUF (Q4_K_M quantization)
- Architecture: Large Language Model via llama.cpp with Mixture-of-Experts (MoE) routing
- Parameters: 30 billion (A3B active routing variant)
- VRAM: ~14.7GB
- Context Window: 131,072 tokens (128K)
- Purpose: Risk reasoning, threat analysis, natural language generation
Development Model (resource-constrained environments):
- Model: bartowski/nemotron-mini-4b-instruct-GGUF (Q4_K_M quantization)
- Parameters: 4 billion
- VRAM: ~3GB
- Context Window: 4,096 tokens
Risk Score Ranges:
On-Demand Models (Enrichment Services)¶
The enrichment models are split across two services:
- ai-enrichment (port 8094): Heavy transformer models running on GPU 0 (~6.0GB budget)
- ai-enrichment-light (port 8096): Small efficient models running on GPU 1 (~1.2GB budget)
The backend routes requests to the appropriate service based on the ENRICHMENT_*_SERVICE environment variables (e.g., ENRICHMENT_POSE_SERVICE=light).
Threat Detection (CRITICAL Priority)¶

- Model: Subh775/Threat-Detection-YOLOv8n
- VRAM: ~300MB (per model_zoo.py)
- Purpose: Detect weapons (knives, guns, etc.)
- Trigger: All person detections in security-sensitive contexts
- Eviction: Never evicted if possible
Input: Cropped person image Output:
Pose Estimation (HIGH Priority)¶
- Model: ViTPose+ Small (primary, ~1.5GB), YOLOv8n-pose (fallback, ~200MB)
- VRAM: ~1.5GB (ViTPose) or ~200MB (YOLOv8n-pose)
- Purpose: Body pose detection, posture classification
- Trigger: Person detections
- Output: 17 COCO keypoints per person
COCO Keypoints:
COCO_KEYPOINT_NAMES = [
"nose", "left_eye", "right_eye", "left_ear", "right_ear",
"left_shoulder", "right_shoulder", "left_elbow", "right_elbow",
"left_wrist", "right_wrist", "left_hip", "right_hip",
"left_knee", "right_knee", "left_ankle", "right_ankle"
]
Posture Classifications: standing, walking, running, sitting, crouching, lying_down
Security Alerts:
crouching- Potential hiding/break-in behaviorlying_down- Possible medical emergencyhands_raised- Potential surrender/robbery scenariofighting_stance- Aggressive posture
Demographics (HIGH Priority)¶
- Model: ViT Age Classifier (~200MB) + ViT Gender Classifier (~200MB)
- VRAM: ~400MB total (200MB each, loaded separately)
- Purpose: Age range and gender estimation from face crops
- Trigger: Person detections with visible face
Input: Cropped face image Output:
Clothing Analysis (HIGH Priority)¶
- Model: Marqo/marqo-fashionSigLIP (FashionSigLIP) - upgraded from FashionCLIP for 57% accuracy improvement
- VRAM: ~800MB
- Purpose: Identify clothing types, uniforms, suspicious attire
- Trigger: Person detections
Security-Focused Prompts:
SECURITY_CLOTHING_PROMPTS = [
"person wearing dark hoodie",
"person wearing face mask",
"person wearing ski mask or balaclava",
"delivery uniform", "Amazon delivery vest",
"FedEx uniform", "UPS uniform", "USPS postal worker uniform",
"casual clothing", "business attire or suit", ...
]
Output:
{
"clothing_type": "hoodie",
"color": "dark",
"style": "suspicious",
"confidence": 0.85,
"is_suspicious": true,
"is_service_uniform": false
}
Vehicle Classification (MEDIUM Priority)¶
- Model: lxyuan/vit-base-patch16-224-vehicle-segment-classification
- VRAM: ~1.5GB
- Purpose: Vehicle make, model, type identification
- Trigger: Vehicle detections (car, truck, bus, motorcycle, bicycle)
Vehicle Classes:
VEHICLE_SEGMENT_CLASSES = [
"articulated_truck", "background", "bicycle", "bus", "car",
"motorcycle", "non_motorized_vehicle", "pedestrian",
"pickup_truck", "single_unit_truck", "work_van"
]
Output:
{
"vehicle_type": "pickup_truck",
"display_name": "pickup truck",
"confidence": 0.92,
"is_commercial": false
}
Pet Classification (MEDIUM Priority)¶
- Model: microsoft/resnet-18
- VRAM: ~200MB
- Purpose: Cat/dog classification and breed identification
- Trigger: Animal detections (dog, cat, bird)
Output:
Person Re-ID (MEDIUM Priority)¶
- Model: OSNet-x0.25
- VRAM: ~100MB
- Purpose: Generate 512-dimensional embeddings for tracking individuals
- Trigger: Person detections requiring cross-camera tracking
Output:
Vehicle Damage Detection (MEDIUM Priority)¶
- Model: harpreetsahota/car-dd-segmentation-yolov11 (YOLOv11x-seg)
- VRAM: ~2GB
- Purpose: Detect and segment vehicle damage for security analysis
- Trigger: Vehicle detections (car, truck, bus, motorcycle)
- License: AGPL-3.0
Damage Classes:
DAMAGE_CLASSES = [
"crack", # Surface cracks in paint/body
"dent", # Impact dents on body panels
"glass_shatter", # Broken/shattered glass (HIGH SECURITY)
"lamp_broken", # Damaged headlights/taillights (HIGH SECURITY)
"scratch", # Surface scratches on paint
"tire_flat", # Flat or damaged tires
]
High-Security Damage Types:
| Damage Type | Security Implication |
|---|---|
glass_shatter | Possible break-in attempt, vandalism, or collision |
lamp_broken | Possible vandalism, hit-and-run, or deliberate damage |
Input: Cropped vehicle image from YOLO26 detection
Output:
{
"detections": [
{
"damage_type": "glass_shatter",
"confidence": 0.87,
"bbox": {
"x1": 120.5,
"y1": 80.2,
"x2": 180.3,
"y2": 150.8
},
"has_mask": true,
"mask_area": 2450
}
],
"damage_types": ["glass_shatter", "dent"],
"has_high_security_damage": true,
"total_damage_count": 2,
"highest_confidence": 0.87
}
Security Alert Pattern Detection:
The model includes pattern analysis for identifying suspicious damage:
# Suspicious patterns that trigger elevated alerts
- glass_shatter + lamp_broken = "break-in pattern"
- Night time (22:00-06:00) + any damage = "elevated risk"
- Multiple damage types (>=2) = "likely incident"
- High-security damage at night = "critical alert"
Context String for LLM:
Vehicle Damage Detected (2 instances):
- glass_shatter: 1 instance(s) (avg conf: 87%)
- dent: 1 instance(s) (avg conf: 72%)
**HIGH SECURITY ALERT**: Suspicious damage types detected
Source Files:
- Loader:
backend/services/vehicle_damage_loader.py - Model Zoo Entry:
backend/services/model_zoo.py
Depth Estimation (LOW Priority)¶
- Model: depth-anything/Depth-Anything-V2-Small-hf
- VRAM: ~150MB
- Purpose: Spatial reasoning, distance estimation
- Trigger: Detections requiring distance context
Output:
{
"depth_map_base64": "<base64-png>",
"estimated_distance_m": 3.5,
"relative_depth": 0.35,
"proximity_label": "close"
}
Action Recognition (LOW Priority)¶
- Model: microsoft/xclip-base-patch32 (X-CLIP)
- VRAM: ~2GB (with float16)
- Purpose: Video-based action classification
- Trigger: Person detected >3 seconds with multiple frames, unusual pose detected
Input: Multiple frames (video clip) Output:
{
"action": "running",
"confidence": 0.87,
"action_scores": {
"running": 0.87,
"walking": 0.08,
"fighting": 0.03,
"falling": 0.02
}
}
VRAM Management¶

On-Demand Loading Architecture¶
The OnDemandModelManager class manages GPU memory efficiently:
class OnDemandModelManager:
def __init__(self, vram_budget_gb: float = 6.0):
self.vram_budget = vram_budget_gb * 1024 # MB
self.loaded_models = OrderedDict() # LRU tracking
self.model_registry = {}
Loading Algorithm¶
- Request comes in requiring a model
- If model loaded, use it and update last-used timestamp
- If not loaded, check VRAM budget
- If over budget, evict LRU models (respecting priority)
- Load requested model
Priority System¶
| Priority | Value | Models | Eviction Behavior |
|---|---|---|---|
| CRITICAL | 0 | Threat Detection | Never evicted unless absolutely necessary |
| HIGH | 1 | Pose, Demographics, Clothing | Evicted only when VRAM critical |
| MEDIUM | 2 | Vehicle, Pet, Re-ID | Standard LRU eviction |
| LOW | 3 | Depth, Action | Evicted first |
VRAM Allocation and Eviction Overview¶
The following diagram shows the two-tier VRAM architecture: always-loaded core models consume a fixed budget, while on-demand models share a dynamic budget and are evicted in LRU order respecting priority levels.
flowchart TB
subgraph Always["Always Loaded (~4GB)"]
Y["YOLO26<br/>~2GB"]
F["Florence-2<br/>~1.2GB"]
C["CLIP<br/>~800MB"]
end
subgraph Budget["On-Demand Budget (~6GB)"]
CRIT["CRITICAL<br/>Threat Detection"]
HIGH["HIGH<br/>Pose, Demographics"]
MED["MEDIUM<br/>Clothing, Vehicle, Re-ID"]
LOW["LOW<br/>Depth, Action Recognition"]
end
subgraph Eviction["LRU Eviction Order"]
LOW -.->|Evicted first| FREE[Free VRAM]
MED -.->|Evicted second| FREE
HIGH -.->|Evicted when necessary| FREE
CRIT -.->|Never evicted if possible| FREE
end Eviction Strategy¶
Models are sorted by (priority descending, last_used ascending):
- Higher priority numbers (LOW=3) evicted before lower (CRITICAL=0)
- Among same priority, oldest (least recently used) evicted first
VRAM Budget Calculation¶
Based on actual values from backend/services/model_zoo.py:
| Category | Budget | Models |
|---|---|---|
| Always Loaded | ~21.7GB | Nemotron LLM (production) |
| Primary Detection | ~2GB | YOLO26 (TensorRT) |
| Model Zoo Budget | ~1.65GB | On-demand models (loaded sequentially, not concurrently) |
| Total (Production) | ~25GB+ | Full stack with all models |
On-Demand Model VRAM (from model_zoo.py):
| Model | VRAM |
|---|---|
| yolo11-license-plate | 300MB |
| yolo11-face | 200MB |
| paddleocr | 100MB |
| clip-vit-l | 800MB |
| florence-2-large | 1.2GB |
| yolo-world-s | 1.5GB |
| vitpose-small | 1.5GB |
| depth-anything-v2-small | 150MB |
| violence-detection | 500MB |
| weather-classification | 200MB |
| segformer-b2-clothes | 1.5GB |
| xclip-base | 2GB |
| fashion-clip (FashionSigLIP) | 800MB |
| brisque-quality | 0MB (CPU) |
| vehicle-segment-classification | 1.5GB |
| vehicle-damage-detection | 2GB |
| pet-classifier | 200MB |
| osnet-x0-25 | 100MB |
| threat-detection-yolov8n | 300MB |
| vit-age-classifier | 200MB |
| vit-gender-classifier | 200MB |
| yolov8n-pose | 200MB |
API Reference¶
Unified Enrichment Endpoint¶
Request:
{
"image": "<base64>",
"detection_type": "person|vehicle|animal|object",
"bbox": {"x1": 0, "y1": 0, "x2": 100, "y2": 100},
"frames": ["<base64>", ...],
"options": {
"action_recognition": true,
"include_depth": true
}
}
Response:
{
"pose": {
"keypoints": [...],
"posture": "standing",
"alerts": []
},
"clothing": {
"type": "hoodie",
"color": "dark",
"is_suspicious": true
},
"demographics": {
"age_range": "25-35",
"gender": "male"
},
"threat": {
"detected": false
},
"reid_embedding": [...],
"vehicle": null,
"action": {
"action": "walking",
"confidence": 0.92
},
"depth": {
"estimated_distance_m": 3.5
},
"inference_time_ms": 245.6
}
Model Status Endpoint¶
Response:
{
"vram_budget_mb": 6963.2,
"vram_used_mb": 2300,
"vram_available_mb": 4663.2,
"vram_utilization_percent": 33.0,
"loaded_models": [
{
"name": "fashion_clip",
"vram_mb": 800,
"priority": "HIGH",
"last_used": "2024-01-15T10:30:00Z"
}
],
"registered_models": [
{
"name": "vehicle_classifier",
"vram_mb": 1500,
"priority": "MEDIUM",
"loaded": false
}
],
"pending_loads": []
}
Model Preload Endpoint¶
Response:
Individual Classification Endpoints¶
| Endpoint | Method | Purpose |
|---|---|---|
/health | GET | Service health status |
/vehicle-classify | POST | Vehicle type classification |
/pet-classify | POST | Pet type classification |
/clothing-classify | POST | Clothing analysis |
/depth-estimate | POST | Full depth map |
/object-distance | POST | Object distance estimation |
/pose-analyze | POST | Human pose keypoints |
Enrichment Result Schema¶
The enrichment pipeline returns structured results for each detection. These results are stored alongside detections and used by Nemotron for comprehensive risk analysis.
EnrichmentResult Structure¶
The EnrichmentResult object aggregates all enrichment outputs for a batch of detections:
class EnrichmentResult:
threat_detection: ThreatDetectionResult | None
age_classifications: dict[str, AgeClassificationResult] # detection_id -> result
gender_classifications: dict[str, GenderClassificationResult] # detection_id -> result
person_embeddings: dict[str, PersonEmbeddingResult] # detection_id -> result
ThreatDetectionResult¶
Weapon and threat detection results from the YOLO threat detector model.
| Field | Type | Description |
|---|---|---|
threat_type | string | Type of threat detected: knife, gun, weapon, none |
confidence | float | Detection confidence score (0.0-1.0) |
severity | string | Threat severity: low, medium, high, critical |
bbox | array | Bounding box [x1, y1, x2, y2] of detected threat |
AgeClassificationResult¶
Age range estimation from ViT age classifier.
| Field | Type | Description |
|---|---|---|
age_range | string | Estimated age range bracket |
confidence | float | Classification confidence (0.0-1.0) |
raw_prediction | float | Raw model prediction (estimated age in years) |
Age Range Brackets:
0-12(child)13-17(teenager)18-24(young adult)25-35(adult)36-50(middle-aged)51-65(mature adult)65+(senior)
GenderClassificationResult¶
Gender classification from ViT gender classifier.
| Field | Type | Description |
|---|---|---|
gender | string | Predicted gender: male, female, unknown |
confidence | float | Classification confidence (0.0-1.0) |
PersonEmbeddingResult¶
512-dimensional embedding vector from OSNet for person re-identification across cameras.
| Field | Type | Description |
|---|---|---|
embedding | array | 512-dimensional float vector (normalized L2) |
embedding_dimension | int | Embedding vector length (always 512) |
model | string | Model used for embedding generation |
Use Cases:
- Cross-camera person tracking
- Person re-identification over time
- Similarity search for matching individuals
Complete Enrichment Response Example¶
Full response from the /enrich endpoint for a person detection:
{
"threat_detection": {
"threat_type": "none",
"confidence": 0.0,
"severity": "low",
"bbox": null
},
"age_classifications": {
"det_abc123": {
"age_range": "25-35",
"confidence": 0.82,
"raw_prediction": 28.5
}
},
"gender_classifications": {
"det_abc123": {
"gender": "male",
"confidence": 0.94
}
},
"person_embeddings": {
"det_abc123": {
"embedding": [0.123, -0.456, ...],
"embedding_dimension": 512,
"model": "osnet_x0_25"
}
},
"pose": {
"keypoints": [...],
"posture": "standing",
"alerts": []
},
"clothing": {
"type": "casual",
"color": "blue",
"is_suspicious": false,
"is_service_uniform": false
},
"inference_time_ms": 312.5
}
Integration with Risk Analysis¶
The enrichment results feed directly into Nemotron's risk analysis prompt:
- Threat Detection: Weapons trigger immediate critical risk elevation
- Demographics: Age/gender provide context for behavior analysis
- Person Embeddings: Enable tracking individuals across multiple cameras
- Pose/Clothing: Inform behavioral assessment (suspicious posture, face coverings)
The backend stores enrichment results in the detection record and passes the complete context to Nemotron for comprehensive risk scoring.
Environment Variables¶
YOLO26¶
| Variable | Default | Description |
|---|---|---|
YOLO26_MODEL_PATH | /models/yolo26/exports/yolo26m_fp16.engine | TensorRT engine path |
YOLO26_CONFIDENCE | 0.5 | Min confidence threshold |
PORT | 8095 | Server port |
Nemotron¶
| Variable | Default | Description |
|---|---|---|
MODEL_PATH | /models/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf | GGUF model path |
PORT | 8091 | Server port |
GPU_LAYERS | 35 | Layers on GPU |
CTX_SIZE | 131072 | Context window |
Enrichment¶
| Variable | Default | Description |
|---|---|---|
VEHICLE_MODEL_PATH | /models/vehicle-segment-classification | Vehicle classifier |
PET_MODEL_PATH | /models/pet-classifier | Pet classifier |
CLOTHING_MODEL_PATH | /models/fashion-clip | FashionCLIP |
DEPTH_MODEL_PATH | /models/depth-anything-v2-small | Depth estimator |
VITPOSE_MODEL_PATH | /models/vitpose-plus-small | ViTPose+ |
ACTION_MODEL_PATH | microsoft/xclip-base-patch32 | X-CLIP |
VRAM_BUDGET_GB | 6.0 | On-demand VRAM budget |
PORT | 8094 | Server port |
Adding New Models¶
Step 1: Create Model Wrapper¶
Create a new file in ai/enrichment/models/:
# ai/enrichment/models/new_model.py
from typing import Any
import torch
from PIL import Image
class NewModelClassifier:
def __init__(self, model_path: str, device: str = "cuda:0"):
self.model_path = model_path
self.device = device
self.model = None
self.processor = None
def load_model(self) -> None:
"""Load model and processor."""
from transformers import AutoModelForXxx, AutoProcessor
self.processor = AutoProcessor.from_pretrained(self.model_path)
self.model = AutoModelForXxx.from_pretrained(self.model_path)
if "cuda" in self.device and torch.cuda.is_available():
self.model = self.model.to(self.device)
self.model.eval()
def predict(self, image: Image.Image) -> dict:
"""Run inference."""
inputs = self.processor(images=image, return_tensors="pt")
if "cuda" in self.device:
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs)
return self._postprocess(outputs)
def _postprocess(self, outputs) -> dict:
"""Convert model outputs to API response."""
return {"result": "..."}
Step 2: Add to Model Registry¶
Update ai/enrichment/model_registry.py:
def create_model_registry(device: str = "cuda:0") -> dict[str, ModelConfig]:
registry = {}
# ... existing models ...
# New Model (~XGB)
new_model_path = os.environ.get("NEW_MODEL_PATH", "/models/new-model")
registry["new_model"] = ModelConfig(
name="new_model",
vram_mb=1000, # Estimated VRAM usage
priority=ModelPriority.MEDIUM, # Choose appropriate priority
loader_fn=lambda: _create_and_load_model(
NewModelClassifier, new_model_path, device
),
unloader_fn=_unload_model,
)
return registry
Step 3: Update Detection Type Mapping¶
Add trigger conditions in model_registry.py:
def get_models_for_detection_type(detection_type: str, ...) -> list[str]:
detection_model_mapping = {
# ... existing mappings ...
"new_type": ["new_model", "depth_estimator"],
}
Step 4: Add API Endpoint (Optional)¶
Update ai/enrichment/model.py if direct endpoint access is needed:
@app.post("/new-classify")
async def new_classify(request: ImageRequest) -> NewClassifyResponse:
model = await model_manager.get_model("new_model")
result = model.predict(image)
return NewClassifyResponse(**result)
Step 5: Update Docker Compose¶
Add model volume mount in docker-compose.prod.yml:
Step 6: Update Documentation¶
- Add model details to this file (
docs/ai/model-zoo.md) - Update
ai/enrichment/AGENTS.mdwith endpoint documentation - Add HuggingFace link to model links table
Hardware Requirements¶
- GPU: NVIDIA with CUDA support (tested on RTX A5500 24GB)
- Minimum VRAM: 12GB for basic operation
- Recommended VRAM: 24GB for all models simultaneously
- Container Runtime: Docker or Podman with NVIDIA Container Toolkit
Related Documentation¶
- AI Pipeline AGENTS.md - Service overview
- Enrichment Service AGENTS.md - Detailed endpoint docs
- YOLO26 AGENTS.md - Detection service
- Florence-2 AGENTS.md - Vision-language service
- CLIP AGENTS.md - Embedding service