AI Services Management¶
Start, stop, verify, and monitor AI inference services.
Time to read: ~8 min Prerequisites: AI Configuration
Starting Services¶
[!IMPORTANT] This doc covers host-run AI (useful for development) and containerized AI (recommended for production). In production,
docker-compose.prod.ymldefines all AI services (8091–8096).For “which URL should I use?” (container DNS vs host vs remote), start with: Deployment Modes & AI Networking.
Unified Startup (Recommended)¶
Use the unified startup script to manage the core host-run AI services (YOLO26 + Nemotron):
Expected output:
==========================================
Starting AI Services
==========================================
[INFO] Checking prerequisites...
[OK] NVIDIA GPU detected: NVIDIA RTX A5500
[OK] CUDA available
[OK] llama-server found: /usr/bin/llama-server
[OK] Python found: /usr/bin/python3
[OK] Nemotron model found (2.5G)
[WARN] YOLO26 model not found (will auto-download)
[OK] All prerequisites satisfied
[INFO] Starting YOLO26 detection server...
[OK] YOLO26 detection server started successfully
Port: 8095
PID: 12345
Log: /tmp/yolo26-detector.log
Expected VRAM: ~4GB
[INFO] Starting Nemotron LLM server...
[OK] Nemotron LLM server started successfully
Port: 8091
PID: 12346
Log: /tmp/nemotron-llm.log
Expected VRAM: ~3GB
First startup takes longer (~2-3 minutes) due to:
- Model loading into VRAM
- CUDA initialization
- GPU warmup inferences
Individual Service Startup¶
Start services separately for debugging:
# YOLO26 detection server
./ai/start_detector.sh
# Nemotron LLM server (in separate terminal)
./ai/start_llm.sh
Production (containerized AI services)¶
Start the full stack (including Florence/CLIP/Enrichment):
Start only the core services:
Start only AI services (all 5):
Stop:
Service Management¶
Check Status¶
Output shows:
- Service status (RUNNING/STOPPED)
- Process IDs
- Health check results
- GPU memory usage
- Log file locations
Stop Services¶
Performs graceful shutdown (10 second timeout), force kills if not responding.
Restart Services¶
Useful after:
- Model updates
- Configuration changes
- Service crashes
Health Check¶
Returns:
- HTTP health check results
- Service response times
- Detailed status JSON
Verification¶
Test YOLO26 Detection¶
Expected response:
{
"status": "healthy",
"model_loaded": true,
"device": "cuda:0",
"cuda_available": true,
"vram_used_gb": 4.2
}
Test detection (requires test image):
Test Nemotron LLM¶
Test Florence / CLIP / Enrichment (production)¶
curl http://localhost:8092/health # Florence-2
curl http://localhost:8093/health # CLIP
curl http://localhost:8094/health # Enrichment
# Health check
curl http://localhost:8091/health
# Test completion
curl -X POST http://localhost:8091/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Analyze: A person detected at front door at 14:30.",
"temperature": 0.7,
"max_tokens": 200
}'
Integration Test¶
Test full pipeline from backend:
cd backend
# Run integration tests
pytest tests/integration/ -v -k "test_ai_pipeline"
# Run full test suite
pytest tests/ -v
Service Logs¶
View Logs¶
# YOLO26 logs
tail -f /tmp/yolo26-detector.log
# Nemotron LLM logs
tail -f /tmp/nemotron-llm.log
# Both logs (parallel)
tail -f /tmp/yolo26-detector.log -f /tmp/nemotron-llm.log
Backend Integration Logging¶
The backend automatically monitors AI service health:
# Check backend logs for AI service status
cd backend
tail -f logs/app.log | grep -E "yolo26|nemotron"
Backend logs include:
- Connection failures
- Timeout errors
- Health check results
- Inference latencies
Production Deployment¶
Systemd Service Units¶
Create systemd services for production. Replace placeholders with actual values.
YOLO26 Service:
sudo tee /etc/systemd/system/ai-yolo26.service > /dev/null << EOF
[Unit]
Description=YOLO26 Object Detection Service
After=network.target
[Service]
Type=simple
User=$(whoami)
WorkingDirectory=${PROJECT_ROOT}/ai/yolo26
ExecStart=/usr/bin/python3 model.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
Nemotron LLM Service:
sudo tee /etc/systemd/system/ai-llm.service > /dev/null << EOF
[Unit]
Description=Nemotron LLM Service
After=network.target
[Service]
Type=simple
User=$(whoami)
ExecStart=/usr/bin/llama-server --model ${PROJECT_ROOT}/ai/nemotron/nemotron-mini-4b-instruct-q4_k_m.gguf --port 8091 --ctx-size 4096 --n-gpu-layers 99 --host 0.0.0.0 --parallel 2 --cont-batching
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable ai-yolo26 ai-llm
sudo systemctl start ai-yolo26 ai-llm
Auto-start on Boot¶
Add to crontab:
Quick Reference¶
Common Commands¶
# Start all AI services
./scripts/start-ai.sh start
# Stop all AI services
./scripts/start-ai.sh stop
# Check status
./scripts/start-ai.sh status
# Health check
./scripts/start-ai.sh health
# Check GPU
nvidia-smi
# Download models
./ai/download_models.sh

Context enrichment pipeline showing how detection data flows through zone analysis, baseline comparison, and cross-camera correlation.
Service Endpoints¶
| Service | Endpoint | Purpose |
|---|---|---|
| YOLO26 | GET /health | Health check |
| YOLO26 | POST /detect | Object detection |
| YOLO26 | POST /detect/batch | Batch detection |
| Nemotron | GET /health | Health check |
| Nemotron | POST /completion | Text completion |
| Nemotron | POST /v1/chat/completions | Chat API |
Expected Resource Usage¶
| Service | VRAM | CPU | Latency | Throughput |
|---|---|---|---|---|
| YOLO26 | ~4GB | 10-20% | 30-50ms | 20-30 img/s |
| Nemotron | ~3GB | 5-10% | 2-5s | 0.2-0.5 req/s |
Next Steps¶
- AI GHCR Deployment - Deploy AI services from GHCR
- AI Troubleshooting - Common issues and solutions
- AI Performance - Performance tuning
- AI TLS - Secure communications
See Also¶
- GPU Setup - GPU driver and container configuration
- AI Issues (Troubleshooting) - Detailed problem-solving guide
- AI Configuration - Environment variables