AI Services Management¶

Start, stop, verify, and monitor AI inference services.

Time to read: ~8 min Prerequisites: AI Configuration

Starting Services¶

[!IMPORTANT] This doc covers host-run AI (useful for development) and containerized AI (recommended for production). In production, docker-compose.prod.yml defines all AI services (8091–8096).

For “which URL should I use?” (container DNS vs host vs remote), start with: Deployment Modes & AI Networking.

Unified Startup (Recommended)¶

Use the unified startup script to manage the core host-run AI services (YOLO26 + Nemotron):

./scripts/start-ai.sh start

Expected output:

==========================================
Starting AI Services
==========================================

[INFO] Checking prerequisites...
[OK] NVIDIA GPU detected: NVIDIA RTX A5500
[OK] CUDA available
[OK] llama-server found: /usr/bin/llama-server
[OK] Python found: /usr/bin/python3
[OK] Nemotron model found (2.5G)
[WARN] YOLO26 model not found (will auto-download)
[OK] All prerequisites satisfied

[INFO] Starting YOLO26 detection server...
[OK] YOLO26 detection server started successfully
  Port: 8095
  PID: 12345
  Log: /tmp/yolo26-detector.log
  Expected VRAM: ~4GB

[INFO] Starting Nemotron LLM server...
[OK] Nemotron LLM server started successfully
  Port: 8091
  PID: 12346
  Log: /tmp/nemotron-llm.log
  Expected VRAM: ~3GB

First startup takes longer (~2-3 minutes) due to:

Model loading into VRAM
CUDA initialization
GPU warmup inferences

Individual Service Startup¶

Start services separately for debugging:

# YOLO26 detection server
./ai/start_detector.sh

# Nemotron LLM server (in separate terminal)
./ai/start_llm.sh

Production (containerized AI services)¶

Start the full stack (including Florence/CLIP/Enrichment):

docker compose -f docker-compose.prod.yml up -d

Start only the core services:

docker compose -f docker-compose.prod.yml up -d postgres redis backend frontend ai-yolo26 ai-llm

Start only AI services (all 5):

docker compose -f docker-compose.prod.yml up -d ai-yolo26 ai-llm ai-florence ai-clip ai-enrichment

Stop:

docker compose -f docker-compose.prod.yml down

Service Management¶

Check Status¶

./scripts/start-ai.sh status

Output shows:

Service status (RUNNING/STOPPED)
Process IDs
Health check results
GPU memory usage
Log file locations

Stop Services¶

./scripts/start-ai.sh stop

Performs graceful shutdown (10 second timeout), force kills if not responding.

Restart Services¶

./scripts/start-ai.sh restart

Useful after:

Model updates
Configuration changes
Service crashes

Health Check¶

./scripts/start-ai.sh health

Returns:

HTTP health check results
Service response times
Detailed status JSON

Verification¶

Test YOLO26 Detection¶

# Health check
curl http://localhost:8095/health

Expected response:

{
  "status": "healthy",
  "model_loaded": true,
  "device": "cuda:0",
  "cuda_available": true,
  "vram_used_gb": 4.2
}

Test detection (requires test image):

cd ai/yolo26
python example_client.py path/to/test/image.jpg

Test Nemotron LLM¶

Test Florence / CLIP / Enrichment (production)¶

curl http://localhost:8092/health  # Florence-2
curl http://localhost:8093/health  # CLIP
curl http://localhost:8094/health  # Enrichment

# Health check
curl http://localhost:8091/health

# Test completion
curl -X POST http://localhost:8091/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Analyze: A person detected at front door at 14:30.",
    "temperature": 0.7,
    "max_tokens": 200
  }'

Integration Test¶

Test full pipeline from backend:

cd backend

# Run integration tests
pytest tests/integration/ -v -k "test_ai_pipeline"

# Run full test suite
pytest tests/ -v

Service Logs¶

View Logs¶

# YOLO26 logs
tail -f /tmp/yolo26-detector.log

# Nemotron LLM logs
tail -f /tmp/nemotron-llm.log

# Both logs (parallel)
tail -f /tmp/yolo26-detector.log -f /tmp/nemotron-llm.log

Backend Integration Logging¶

The backend automatically monitors AI service health:

# Check backend logs for AI service status
cd backend
tail -f logs/app.log | grep -E "yolo26|nemotron"

Backend logs include:

Connection failures
Timeout errors
Health check results
Inference latencies

Production Deployment¶

Systemd Service Units¶

Create systemd services for production. Replace placeholders with actual values.

YOLO26 Service:

sudo tee /etc/systemd/system/ai-yolo26.service > /dev/null << EOF
[Unit]
Description=YOLO26 Object Detection Service
After=network.target

[Service]
Type=simple
User=$(whoami)
WorkingDirectory=${PROJECT_ROOT}/ai/yolo26
ExecStart=/usr/bin/python3 model.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Nemotron LLM Service:

sudo tee /etc/systemd/system/ai-llm.service > /dev/null << EOF
[Unit]
Description=Nemotron LLM Service
After=network.target

[Service]
Type=simple
User=$(whoami)
ExecStart=/usr/bin/llama-server --model ${PROJECT_ROOT}/ai/nemotron/nemotron-mini-4b-instruct-q4_k_m.gguf --port 8091 --ctx-size 4096 --n-gpu-layers 99 --host 0.0.0.0 --parallel 2 --cont-batching
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable ai-yolo26 ai-llm
sudo systemctl start ai-yolo26 ai-llm

Auto-start on Boot¶

Add to crontab:

crontab -e

# Add line:
@reboot /path/to/project/scripts/start-ai.sh start

Quick Reference¶

Common Commands¶

# Start all AI services
./scripts/start-ai.sh start

# Stop all AI services
./scripts/start-ai.sh stop

# Check status
./scripts/start-ai.sh status

# Health check
./scripts/start-ai.sh health

# Check GPU
nvidia-smi

# Download models
./ai/download_models.sh

Enrichment Pipeline

Context enrichment pipeline showing how detection data flows through zone analysis, baseline comparison, and cross-camera correlation.

Service Endpoints¶

Service	Endpoint	Purpose
YOLO26	GET /health	Health check
YOLO26	POST /detect	Object detection
YOLO26	POST /detect/batch	Batch detection
Nemotron	GET /health	Health check
Nemotron	POST /completion	Text completion
Nemotron	POST /v1/chat/completions	Chat API

Expected Resource Usage¶

Service	VRAM	CPU	Latency	Throughput
YOLO26	~4GB	10-20%	30-50ms	20-30 img/s
Nemotron	~3GB	5-10%	2-5s	0.2-0.5 req/s

Next Steps¶

AI GHCR Deployment - Deploy AI services from GHCR
AI Troubleshooting - Common issues and solutions
AI Performance - Performance tuning
AI TLS - Secure communications