AI Performance¶

The AI Performance page provides real-time monitoring of the AI pipeline that powers security event detection and risk analysis. It displays metrics for the YOLO26 object detection model, the Nemotron LLM risk analyzer, and the overall processing pipeline.
Overview¶
The AI Performance page provides real-time monitoring of the AI models that power your security system. Here you can track:
- Model health status (YOLO26 and Nemotron)
- Processing latency and queue depths
- Detection and event statistics
- Risk score distribution
- Model Zoo - 18+ specialized AI models for enhanced detection
Accessing the AI Performance Page¶
Click the Brain icon or AI Performance in the left sidebar, or navigate directly to /ai.
What You're Looking At¶
The AI Performance page embeds the HSI Consolidated Grafana dashboard in kiosk mode. This provides a unified monitoring experience with all AI metrics visualized through Grafana's powerful charting capabilities.
Key features displayed in the Grafana dashboard:
- Model Health Status - Real-time health checks for YOLO26 and Nemotron services
- Inference Latency - Average, P50, P95, and P99 latency statistics for each AI service
- Pipeline Throughput - Queue depths, detection counts, and event generation rates
- Historical Trends - Time-series charts showing performance over time
- GPU Utilization - VRAM usage and GPU performance metrics
Page controls:
- Refresh Button - Manually reload the Grafana iframe to get the latest data
- Open in Grafana - Opens the full dashboard in a new tab with editing capabilities (no kiosk mode)
The Grafana dashboard auto-refreshes based on its own configured interval. Use the "Refresh" button in the page header to force a reload.
Key Metrics Explained¶
The Grafana dashboard displays metrics from the AI pipeline. Here's what each metric means:
AI Model Health¶
YOLO26 (Object Detection)
- Detects people, vehicles, and animals in camera images
- Model: Real-Time Detection Transformer v2 (COCO + Objects365 pre-trained)
- Typical inference time: 30-50ms per image
- VRAM usage: ~4GB
- Health status: healthy, degraded, unhealthy, or unknown
Nemotron (Risk Analysis LLM)
- Analyzes detection batches to generate risk scores and explanations
- Production model: NVIDIA Nemotron-3-Nano-30B-A3B (~14.7GB VRAM, 128K context)
- Development model: Nemotron Mini 4B Instruct (~3GB VRAM, 4K context)
- Typical inference time: 2-5 seconds per batch
- Runs via llama.cpp inference server on port 8091
Latency Metrics¶
The dashboard tracks latency at multiple pipeline stages:
| Stage | Description | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Detection Inference | Time for YOLO26 to process one image | 500ms | 2000ms |
| Analysis Inference | Time for Nemotron to analyze a batch | 5000ms | 30000ms |
| Watch to Detect | File detection to detection result | 100ms | 500ms |
| Detect to Batch | Detection result to batch assignment | 200ms | 1000ms |
| Batch to Analyze | Batch closure to analysis completion | 100ms | 500ms |
| Total Pipeline | End-to-end from file upload to event creation | Varies | Varies |
Expected end-to-end latency:
- Fast path (high-confidence person): ~3-6 seconds
- Normal path (batched): 30-95 seconds (dominated by batch window)
Queue Health¶
Queue Depths indicate processing backlog:
- Detection Queue: Images waiting for YOLO26 processing
- Analysis Queue: Batches waiting for Nemotron analysis
| Queue Depth | Status | Meaning |
|---|---|---|
| < 10 items | Healthy (green) | Processing keeping up |
| 10-50 items | Moderate (yellow) | Some backlog forming |
| > 50 items | Backlog (red) | Processing cannot keep up |
Throughput Counters:
- Total Detections: Objects detected by YOLO26
- Total Events: Security events generated by the pipeline
Error Monitoring:
- Pipeline Errors: Failures by type (detection errors, analysis errors, etc.)
- Queue Overflows: Items dropped due to queue capacity limits
- Dead Letter Queue (DLQ): Failed jobs awaiting manual review or reprocessing
Risk Score Distribution¶
Events are categorized by risk level:
| Risk Level | Score Range | Description |
|---|---|---|
| Low | 0-30 | Normal activity, no concern |
| Medium | 31-60 | Unusual but not threatening |
| High | 61-80 | Suspicious activity requiring attention |
| Critical | 81-100 | Potential security threat, immediate action needed |
Clickable Risk Score Bars¶
The risk score distribution chart is interactive. Each bar and count is clickable.
How It Works:
- Click any bar in the chart to navigate to the Event Timeline
- The Timeline automatically filters to show only events at that risk level
- You can quickly investigate all events of a specific severity
| Click Target | Navigates To |
|---|---|
| Low bar/count | /timeline?risk_level=low |
| Medium bar/count | /timeline?risk_level=medium |
| High bar/count | /timeline?risk_level=high |
| Critical bar/count | /timeline?risk_level=critical |
Visual Feedback:
- Hover tooltip - Shows "Click to view X events" on hover
- Scale effect - Bar slightly enlarges on hover
- Pointer cursor - Indicates the bar is clickable
- Focus ring - Green outline when using keyboard navigation
The bars are implemented as buttons for full keyboard accessibility (Tab to navigate, Enter/Space to select).
Detection Class Distribution¶
Objects are detected in these security-relevant categories:
- People: person
- Vehicles: car, truck, bus, motorcycle, bicycle
- Animals: dog, cat, bird
Model Zoo Section¶
The Model Zoo contains 18+ specialized AI models that enhance your security detections beyond basic object detection. These models extract additional details like license plates, faces, clothing, and vehicle types.
Summary Bar¶
The Model Zoo summary bar at the top displays key statistics:
| Indicator | Description |
|---|---|
| Loaded | Models currently in GPU memory (green dot) |
| Unloaded | Available models not currently loaded (gray) |
| Disabled | Temporarily disabled models (yellow) |
| VRAM | GPU memory usage (used/budget) |
VRAM (Video RAM) is the GPU memory used by loaded models. The Model Zoo has a dedicated budget of 1650 MB separate from core AI models.
Latency Chart¶
The latency chart shows inference time trends for any Model Zoo model:
- Select a model using the dropdown menu at the top right
- View timing data displayed as three lines:
- Avg (ms) - Average inference time (emerald green)
- P50 (ms) - Median inference time (blue)
- P95 (ms) - 95th percentile time (amber)
- Time axis shows the last 60 minutes of data
Chart Legend:
| Line Color | Metric | Meaning |
|---|---|---|
| Emerald | Average | Typical inference time |
| Blue | P50/Median | Half of inferences are faster than this |
| Amber | P95 | 95% of inferences are faster than this |
No data? If a model shows "No data available," it either has not been used recently or is disabled.
Model Status Cards¶
Below the chart, each Model Zoo model appears as a status card with:
| Element | Description |
|---|---|
| Model Name | Human-readable name of the model |
| Status Dot | Color-coded health indicator |
| Status Label | Current state (Loaded, Unloaded, Loading, etc.) |
| VRAM | GPU memory required when loaded |
| Last Used | Time since model was last used ("2h ago", "Never") |
| Category | Model type (Detection, Classification, etc.) |
Model Status Indicators¶
| Status | Dot Color | Meaning |
|---|---|---|
| Loaded | Green | Model is in GPU memory and ready |
| Loading | Blue (pulsing) | Model is currently being loaded |
| Unloaded | Gray | Model available but not loaded |
| Disabled | Yellow | Model is turned off |
| Error | Red | Model failed to load |
Active vs Disabled Models¶
Models are organized into two sections:
- Active Models - Enabled and available for use
- Disabled Models - Turned off (grayed out, appear at bottom)
Why are some models disabled?
- Incompatible with current software versions
- Moved to a dedicated service
- Not yet released
- Temporarily turned off for maintenance
Model Zoo Categories¶
Detection Models¶
| Model | VRAM | Purpose |
|---|---|---|
| YOLO11 License Plate | 300 MB | Find license plates on vehicles |
| YOLO11 Face | 200 MB | Detect faces on people |
| YOLO World S | 1500 MB | Open vocabulary detection |
| Vehicle Damage Detection | 2000 MB | Find damage on vehicles |
Classification Models¶
| Model | VRAM | Purpose |
|---|---|---|
| Violence Detection | 500 MB | Identify violent activity |
| Weather Classification | 200 MB | Determine weather conditions |
| Fashion CLIP | 500 MB | Classify clothing types |
| Vehicle Segment Classifier | 1500 MB | Identify vehicle types |
| Pet Classifier | 200 MB | Distinguish cats and dogs |
Other Specialized Models¶
| Model | VRAM | Category | Purpose |
|---|---|---|---|
| SegFormer Clothes | 1500 MB | Segmentation | Clothing segmentation |
| ViTPose Small | 1500 MB | Pose | Human pose estimation |
| Depth Anything V2 | 150 MB | Depth | Distance estimation |
| CLIP ViT-L | 800 MB | Embedding | Visual embeddings |
| PaddleOCR | 100 MB | OCR | Read text from plates |
| X-CLIP Base | 2000 MB | Action Recognition | Recognize activities |
Understanding Model Memory (VRAM)¶
Models load into your GPU's video memory (VRAM) when needed:
- VRAM Budget: 1650 MB for the Model Zoo
- Loading Strategy: One model loads at a time (sequential)
- Automatic Management: Models load/unload based on demand
Why does this matter?
- Loaded models respond instantly
- Unloaded models need time to load before first use
- VRAM constraints limit how many models can be loaded simultaneously
Note: The core YOLO26 (~650 MB) and Nemotron (~21,700 MB) models have separate VRAM allocations and are always loaded.
Model Zoo Analytics¶
Below the Model Zoo status cards, you see additional analytics:
Model Contribution Chart¶
A bar chart showing which models contribute most to event enrichment:
- Higher bars = More frequently used models
- Sorted by contribution = Most useful models at top
- Hover for details = See exact percentage
Model Leaderboard¶
A sortable table ranking models by contribution:
| Column | Description |
|---|---|
| Rank | Position (top 3 have badges) |
| Model | Model name |
| Contribution | Percentage of events this model enriched |
| Events | Number of events processed |
| Quality | Correlation with good AI assessments |
Click column headers to sort by that metric.
Settings & Configuration¶
Grafana Configuration¶
The AI Performance page embeds a Grafana dashboard. The dashboard URL is fetched from the backend configuration API.
| Setting | Environment Variable | Default | Description |
|---|---|---|---|
| Grafana URL | GRAFANA_URL | http://localhost:3002 | URL of the Grafana instance |
| Dashboard UID | N/A | hsi-consolidated | The dashboard to display |
The page loads the dashboard at: {grafana_url}/d/hsi-consolidated?orgId=1&kiosk
To access Grafana directly with full editing capabilities, click the "Open in Grafana" button in the page header. This opens the same dashboard without kiosk mode, allowing you to:
- Adjust time ranges
- Modify queries
- Create alerts
- Export data
AI Service Configuration¶
| Setting | Environment Variable | Default | Description |
|---|---|---|---|
| YOLO26 URL | YOLO26_URL | http://localhost:8095 | Detection service endpoint |
| Nemotron URL | NEMOTRON_URL | http://localhost:8091 | LLM analysis endpoint |
| Detection Confidence | DETECTION_CONFIDENCE_THRESHOLD | 0.5 | Minimum confidence to store detection |
| Batch Window | BATCH_WINDOW_SECONDS | 90 | Maximum batch duration |
| Idle Timeout | BATCH_IDLE_TIMEOUT_SECONDS | 30 | Close batch after this idle period |
| Fast Path Confidence | FAST_PATH_CONFIDENCE_THRESHOLD | 0.90 | Confidence for immediate analysis |
| Fast Path Types | FAST_PATH_OBJECT_TYPES | ["person"] | Object types eligible for fast path |
Refresh Settings¶
The AI Performance page relies on Grafana's built-in refresh mechanism. The dashboard's refresh interval is configured within Grafana itself.
| Setting | Location | Description |
|---|---|---|
| Dashboard Refresh | Grafana dashboard settings | Auto-refresh interval for the embedded dashboard |
| Manual Refresh | Page header button | Reloads the Grafana iframe on demand |
Note: The standalone AI metrics components (available for other pages) use a 5-second polling interval with a 60-minute latency history window.
Troubleshooting¶
Grafana Dashboard Shows "Failed to Load"¶
- Verify Grafana is running:
curl http://localhost:3002/api/health - Check the backend config endpoint returns
grafana_url:curl http://localhost:8000/api/system/config - Ensure network/firewall allows iframe embedding from Grafana
- Check browser console for CORS or content security policy errors
Model Shows "Unhealthy" Status¶
YOLO26:
- Check if the detection server is running:
curl http://localhost:8095/health - Verify GPU is available:
nvidia-smi - Check container logs:
docker logs yolo26 - VRAM exhaustion may require restarting the service
Nemotron:
- Check if llama.cpp server is running:
curl http://localhost:8091/health - Verify model file exists and is valid
- Check VRAM availability (Nemotron needs ~14.7GB for production model)
- Review llama.cpp logs for memory allocation errors
High Latency Detected¶
Detection Latency > 500ms:
- GPU may be under thermal throttling
- Check for other GPU workloads
- Verify CUDA is being used (not CPU fallback)
Analysis Latency > 30s:
- LLM context may be too large
- Consider reducing batch size
- Check if model is fully loaded in VRAM
Queue Backlogs:
- Processing cannot keep up with incoming images
- Consider scaling camera upload frequency
- Check for downstream service bottlenecks
Dead Letter Queue Items Appearing¶
Items in the DLQ indicate failed processing jobs. To investigate:
- Check the DLQ Monitor in Settings or use the API
- Review the error messages for each failed job (includes error type, stack trace, HTTP status)
- Common causes:
- Temporary service unavailability
- Malformed image files
- LLM response parsing failures
- Network timeouts
DLQ API Endpoints:
View DLQ statistics:
List jobs in a specific DLQ:
# Detection queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?start=0&limit=10"
# Analysis queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:analysis_queue?start=0&limit=10"
Requeue all jobs from a DLQ back to processing (requires API key if authentication enabled):
# Requeue detection queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue
# Requeue analysis queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue
Clear a DLQ (permanently removes all jobs):
Metrics Not Updating¶
- Verify backend is healthy:
curl http://localhost:8000/api/system/health - Check Prometheus metrics endpoint:
curl http://localhost:8000/api/metrics - Verify Redis is connected (used for queue depth metrics)
- Try manual refresh using the "Refresh" button
Model Zoo Troubleshooting¶
Model Showing "Error" Status¶
Symptoms: Model card shows red dot and "Error" label.
Possible causes:
- Model file is missing or corrupted
- Insufficient GPU memory
- Model incompatible with current GPU
What to do:
- Check the System page for GPU memory status
- Note the model name and check system logs
- Restart the AI service if multiple models show errors
Model Never Loads¶
Symptoms: Model stays "Unloaded" even when its function should trigger.
Possible causes:
- No detections that require this model (e.g., no license plates seen)
- Model is disabled in configuration
- Queue is backed up with other processing
What to do:
- Check if the model is in the "Disabled Models" section
- Wait for normal detection activity
- Check the Pipeline Health panel for queue issues
High Latency on a Model¶
Symptoms: Latency chart shows consistently high times (P95 above 500ms for detection models).
Possible causes:
- GPU under heavy load
- Model being loaded/unloaded frequently
- Large number of objects in images
What to do:
- Check GPU utilization on the System page
- Look for patterns in the latency chart
- Normal during high-activity periods
"No Data Available" for Model Latency¶
Symptoms: Latency chart shows "No data available for [model name]"
This is normal when:
- The model has not been used in the last hour
- The model is disabled
- No detections have triggered this model type
What to do: Nothing - this is informational. Data appears when the model is used.
Technical Deep Dive¶
For developers wanting to understand the underlying systems.
Architecture¶
- AI Pipeline Architecture: AI Pipeline - Complete flow from file upload to event creation
- YOLO26 Integration: Object detection with Real-Time Detection Transformer v2
- Nemotron LLM: Risk analysis using NVIDIA Nemotron-3-Nano-30B-A3B via llama.cpp
- Batch Aggregation: Time-window based grouping of related detections
Data Flow¶
Camera FTP Upload
|
v
FileWatcher (inotify/FSEvents)
|
v
detection_queue (Redis)
|
v
YOLO26 Server (Port 8095)
|
v
BatchAggregator
|
+----+----+
| |
v v
Fast Path Normal Batching (30-90s windows)
| |
+----+----+
|
v
analysis_queue (Redis)
|
v
Nemotron LLM (Port 8091)
|
v
Event Creation + WebSocket Broadcast
API Endpoints¶
| Endpoint | Description |
|---|---|
/api/metrics | Prometheus metrics in exposition format |
/api/system/health | Overall system and AI service health |
/api/system/telemetry | Queue depths and basic stats |
/api/system/pipeline-latency | Detailed pipeline latency percentiles |
/api/detections/stats | Detection class distribution |
/api/events/stats | Risk level distribution |
/api/dlq/stats | Dead letter queue counts |
Related Code¶
AI Performance Page (Grafana Embed):
| Component | File Path |
|---|---|
| AI Performance Page | frontend/src/components/ai/AIPerformancePage.tsx |
| Config API Service | frontend/src/services/api.ts (fetchConfig) |
Standalone AI Metrics Components (available for use elsewhere, not currently used on AI Performance page):
| Component | File Path | Purpose |
|---|---|---|
| AI Metrics Hook | frontend/src/hooks/useAIMetrics.ts | Fetches and combines AI metrics from multiple endpoints |
| Metrics Parser | frontend/src/services/metricsParser.ts | Parses Prometheus metrics format |
| Model Status Cards | frontend/src/components/ai/ModelStatusCards.tsx | YOLO26 and Nemotron status badges |
| Latency Panel | frontend/src/components/ai/LatencyPanel.tsx | Latency histograms with percentiles |
| Pipeline Health Panel | frontend/src/components/ai/PipelineHealthPanel.tsx | Queue depths and error counts |
| Insights Charts | frontend/src/components/ai/InsightsCharts.tsx | Detection and risk distribution charts |
Backend Services:
| Component | File Path |
|---|---|
| Backend Metrics | backend/core/metrics.py |
| System Routes | backend/api/routes/system.py |
| DLQ Routes | backend/api/routes/dlq.py |
| Detector Client | backend/services/detector_client.py |
| Nemotron Analyzer | backend/services/nemotron_analyzer.py |
| Batch Aggregator | backend/services/batch_aggregator.py |
Note: The standalone AI metrics components are exported from frontend/src/components/ai/index.ts and can be used on other pages that need to display AI metrics directly (without Grafana).
GPU Requirements¶
| Service | Model | VRAM | Port | Context Window |
|---|---|---|---|---|
| YOLO26 | yolo26_v2_r101vd | ~4 GB | 8095 | N/A |
| Nemotron (Prod) | Nemotron-3-Nano-30B-A3B | ~14.7 GB | 8091 | 128K tokens |
| Nemotron (Dev) | Nemotron Mini 4B Instruct | ~3 GB | 8091 | 4K tokens |
| Total (Production) | ~18.7 GB | |||
| Total (Development) | ~7 GB |
The system is optimized for NVIDIA RTX A5500 (24GB VRAM) in production. For development, a GPU with 8GB+ VRAM is recommended.
Production model advantages:
- 128K context window enables analyzing hours of detection history in a single prompt
- Better reasoning quality for complex security scenarios
- Supports detailed enrichment data (clothing, vehicles, behavior patterns)
Development model trade-offs:
- Faster inference (~100-200 tokens/second vs ~50-100)
- Limited context requires more aggressive summarization
- Lower VRAM footprint allows running on consumer GPUs