Skip to content

AI Performance

AI Performance Screenshot

The AI Performance page provides real-time monitoring of the AI pipeline that powers security event detection and risk analysis. It displays metrics for the YOLO26 object detection model, the Nemotron LLM risk analyzer, and the overall processing pipeline.

Overview

The AI Performance page provides real-time monitoring of the AI models that power your security system. Here you can track:

  • Model health status (YOLO26 and Nemotron)
  • Processing latency and queue depths
  • Detection and event statistics
  • Risk score distribution
  • Model Zoo - 18+ specialized AI models for enhanced detection

Accessing the AI Performance Page

Click the Brain icon or AI Performance in the left sidebar, or navigate directly to /ai.

What You're Looking At

The AI Performance page embeds the HSI Consolidated Grafana dashboard in kiosk mode. This provides a unified monitoring experience with all AI metrics visualized through Grafana's powerful charting capabilities.

Key features displayed in the Grafana dashboard:

  • Model Health Status - Real-time health checks for YOLO26 and Nemotron services
  • Inference Latency - Average, P50, P95, and P99 latency statistics for each AI service
  • Pipeline Throughput - Queue depths, detection counts, and event generation rates
  • Historical Trends - Time-series charts showing performance over time
  • GPU Utilization - VRAM usage and GPU performance metrics

Page controls:

  • Refresh Button - Manually reload the Grafana iframe to get the latest data
  • Open in Grafana - Opens the full dashboard in a new tab with editing capabilities (no kiosk mode)

The Grafana dashboard auto-refreshes based on its own configured interval. Use the "Refresh" button in the page header to force a reload.

Key Metrics Explained

The Grafana dashboard displays metrics from the AI pipeline. Here's what each metric means:

AI Model Health

YOLO26 (Object Detection)

  • Detects people, vehicles, and animals in camera images
  • Model: Real-Time Detection Transformer v2 (COCO + Objects365 pre-trained)
  • Typical inference time: 30-50ms per image
  • VRAM usage: ~4GB
  • Health status: healthy, degraded, unhealthy, or unknown

Nemotron (Risk Analysis LLM)

  • Analyzes detection batches to generate risk scores and explanations
  • Production model: NVIDIA Nemotron-3-Nano-30B-A3B (~14.7GB VRAM, 128K context)
  • Development model: Nemotron Mini 4B Instruct (~3GB VRAM, 4K context)
  • Typical inference time: 2-5 seconds per batch
  • Runs via llama.cpp inference server on port 8091

Latency Metrics

The dashboard tracks latency at multiple pipeline stages:

Stage Description Warning Threshold Critical Threshold
Detection Inference Time for YOLO26 to process one image 500ms 2000ms
Analysis Inference Time for Nemotron to analyze a batch 5000ms 30000ms
Watch to Detect File detection to detection result 100ms 500ms
Detect to Batch Detection result to batch assignment 200ms 1000ms
Batch to Analyze Batch closure to analysis completion 100ms 500ms
Total Pipeline End-to-end from file upload to event creation Varies Varies

Expected end-to-end latency:

  • Fast path (high-confidence person): ~3-6 seconds
  • Normal path (batched): 30-95 seconds (dominated by batch window)

Queue Health

Queue Depths indicate processing backlog:

  • Detection Queue: Images waiting for YOLO26 processing
  • Analysis Queue: Batches waiting for Nemotron analysis
Queue Depth Status Meaning
< 10 items Healthy (green) Processing keeping up
10-50 items Moderate (yellow) Some backlog forming
> 50 items Backlog (red) Processing cannot keep up

Throughput Counters:

  • Total Detections: Objects detected by YOLO26
  • Total Events: Security events generated by the pipeline

Error Monitoring:

  • Pipeline Errors: Failures by type (detection errors, analysis errors, etc.)
  • Queue Overflows: Items dropped due to queue capacity limits
  • Dead Letter Queue (DLQ): Failed jobs awaiting manual review or reprocessing

Risk Score Distribution

Events are categorized by risk level:

Risk Level Score Range Description
Low 0-30 Normal activity, no concern
Medium 31-60 Unusual but not threatening
High 61-80 Suspicious activity requiring attention
Critical 81-100 Potential security threat, immediate action needed

Clickable Risk Score Bars

The risk score distribution chart is interactive. Each bar and count is clickable.

How It Works:

  1. Click any bar in the chart to navigate to the Event Timeline
  2. The Timeline automatically filters to show only events at that risk level
  3. You can quickly investigate all events of a specific severity
Click Target Navigates To
Low bar/count /timeline?risk_level=low
Medium bar/count /timeline?risk_level=medium
High bar/count /timeline?risk_level=high
Critical bar/count /timeline?risk_level=critical

Visual Feedback:

  • Hover tooltip - Shows "Click to view X events" on hover
  • Scale effect - Bar slightly enlarges on hover
  • Pointer cursor - Indicates the bar is clickable
  • Focus ring - Green outline when using keyboard navigation

The bars are implemented as buttons for full keyboard accessibility (Tab to navigate, Enter/Space to select).

Detection Class Distribution

Objects are detected in these security-relevant categories:

  • People: person
  • Vehicles: car, truck, bus, motorcycle, bicycle
  • Animals: dog, cat, bird

Model Zoo Section

The Model Zoo contains 18+ specialized AI models that enhance your security detections beyond basic object detection. These models extract additional details like license plates, faces, clothing, and vehicle types.

Summary Bar

The Model Zoo summary bar at the top displays key statistics:

Indicator Description
Loaded Models currently in GPU memory (green dot)
Unloaded Available models not currently loaded (gray)
Disabled Temporarily disabled models (yellow)
VRAM GPU memory usage (used/budget)

VRAM (Video RAM) is the GPU memory used by loaded models. The Model Zoo has a dedicated budget of 1650 MB separate from core AI models.

Latency Chart

The latency chart shows inference time trends for any Model Zoo model:

  1. Select a model using the dropdown menu at the top right
  2. View timing data displayed as three lines:
  3. Avg (ms) - Average inference time (emerald green)
  4. P50 (ms) - Median inference time (blue)
  5. P95 (ms) - 95th percentile time (amber)
  6. Time axis shows the last 60 minutes of data

Chart Legend:

Line Color Metric Meaning
Emerald Average Typical inference time
Blue P50/Median Half of inferences are faster than this
Amber P95 95% of inferences are faster than this

No data? If a model shows "No data available," it either has not been used recently or is disabled.

Model Status Cards

Below the chart, each Model Zoo model appears as a status card with:

Element Description
Model Name Human-readable name of the model
Status Dot Color-coded health indicator
Status Label Current state (Loaded, Unloaded, Loading, etc.)
VRAM GPU memory required when loaded
Last Used Time since model was last used ("2h ago", "Never")
Category Model type (Detection, Classification, etc.)

Model Status Indicators

Status Dot Color Meaning
Loaded Green Model is in GPU memory and ready
Loading Blue (pulsing) Model is currently being loaded
Unloaded Gray Model available but not loaded
Disabled Yellow Model is turned off
Error Red Model failed to load

Active vs Disabled Models

Models are organized into two sections:

  • Active Models - Enabled and available for use
  • Disabled Models - Turned off (grayed out, appear at bottom)

Why are some models disabled?

  • Incompatible with current software versions
  • Moved to a dedicated service
  • Not yet released
  • Temporarily turned off for maintenance

Model Zoo Categories

Detection Models

Model VRAM Purpose
YOLO11 License Plate 300 MB Find license plates on vehicles
YOLO11 Face 200 MB Detect faces on people
YOLO World S 1500 MB Open vocabulary detection
Vehicle Damage Detection 2000 MB Find damage on vehicles

Classification Models

Model VRAM Purpose
Violence Detection 500 MB Identify violent activity
Weather Classification 200 MB Determine weather conditions
Fashion CLIP 500 MB Classify clothing types
Vehicle Segment Classifier 1500 MB Identify vehicle types
Pet Classifier 200 MB Distinguish cats and dogs

Other Specialized Models

Model VRAM Category Purpose
SegFormer Clothes 1500 MB Segmentation Clothing segmentation
ViTPose Small 1500 MB Pose Human pose estimation
Depth Anything V2 150 MB Depth Distance estimation
CLIP ViT-L 800 MB Embedding Visual embeddings
PaddleOCR 100 MB OCR Read text from plates
X-CLIP Base 2000 MB Action Recognition Recognize activities

Understanding Model Memory (VRAM)

Models load into your GPU's video memory (VRAM) when needed:

  • VRAM Budget: 1650 MB for the Model Zoo
  • Loading Strategy: One model loads at a time (sequential)
  • Automatic Management: Models load/unload based on demand

Why does this matter?

  • Loaded models respond instantly
  • Unloaded models need time to load before first use
  • VRAM constraints limit how many models can be loaded simultaneously

Note: The core YOLO26 (~650 MB) and Nemotron (~21,700 MB) models have separate VRAM allocations and are always loaded.

Model Zoo Analytics

Below the Model Zoo status cards, you see additional analytics:

Model Contribution Chart

A bar chart showing which models contribute most to event enrichment:

  • Higher bars = More frequently used models
  • Sorted by contribution = Most useful models at top
  • Hover for details = See exact percentage

Model Leaderboard

A sortable table ranking models by contribution:

Column Description
Rank Position (top 3 have badges)
Model Model name
Contribution Percentage of events this model enriched
Events Number of events processed
Quality Correlation with good AI assessments

Click column headers to sort by that metric.

Settings & Configuration

Grafana Configuration

The AI Performance page embeds a Grafana dashboard. The dashboard URL is fetched from the backend configuration API.

Setting Environment Variable Default Description
Grafana URL GRAFANA_URL http://localhost:3002 URL of the Grafana instance
Dashboard UID N/A hsi-consolidated The dashboard to display

The page loads the dashboard at: {grafana_url}/d/hsi-consolidated?orgId=1&kiosk

To access Grafana directly with full editing capabilities, click the "Open in Grafana" button in the page header. This opens the same dashboard without kiosk mode, allowing you to:

  • Adjust time ranges
  • Modify queries
  • Create alerts
  • Export data

AI Service Configuration

Setting Environment Variable Default Description
YOLO26 URL YOLO26_URL http://localhost:8095 Detection service endpoint
Nemotron URL NEMOTRON_URL http://localhost:8091 LLM analysis endpoint
Detection Confidence DETECTION_CONFIDENCE_THRESHOLD 0.5 Minimum confidence to store detection
Batch Window BATCH_WINDOW_SECONDS 90 Maximum batch duration
Idle Timeout BATCH_IDLE_TIMEOUT_SECONDS 30 Close batch after this idle period
Fast Path Confidence FAST_PATH_CONFIDENCE_THRESHOLD 0.90 Confidence for immediate analysis
Fast Path Types FAST_PATH_OBJECT_TYPES ["person"] Object types eligible for fast path

Refresh Settings

The AI Performance page relies on Grafana's built-in refresh mechanism. The dashboard's refresh interval is configured within Grafana itself.

Setting Location Description
Dashboard Refresh Grafana dashboard settings Auto-refresh interval for the embedded dashboard
Manual Refresh Page header button Reloads the Grafana iframe on demand

Note: The standalone AI metrics components (available for other pages) use a 5-second polling interval with a 60-minute latency history window.

Troubleshooting

Grafana Dashboard Shows "Failed to Load"

  1. Verify Grafana is running: curl http://localhost:3002/api/health
  2. Check the backend config endpoint returns grafana_url: curl http://localhost:8000/api/system/config
  3. Ensure network/firewall allows iframe embedding from Grafana
  4. Check browser console for CORS or content security policy errors

Model Shows "Unhealthy" Status

YOLO26:

  1. Check if the detection server is running: curl http://localhost:8095/health
  2. Verify GPU is available: nvidia-smi
  3. Check container logs: docker logs yolo26
  4. VRAM exhaustion may require restarting the service

Nemotron:

  1. Check if llama.cpp server is running: curl http://localhost:8091/health
  2. Verify model file exists and is valid
  3. Check VRAM availability (Nemotron needs ~14.7GB for production model)
  4. Review llama.cpp logs for memory allocation errors

High Latency Detected

Detection Latency > 500ms:

  • GPU may be under thermal throttling
  • Check for other GPU workloads
  • Verify CUDA is being used (not CPU fallback)

Analysis Latency > 30s:

  • LLM context may be too large
  • Consider reducing batch size
  • Check if model is fully loaded in VRAM

Queue Backlogs:

  • Processing cannot keep up with incoming images
  • Consider scaling camera upload frequency
  • Check for downstream service bottlenecks

Dead Letter Queue Items Appearing

Items in the DLQ indicate failed processing jobs. To investigate:

  1. Check the DLQ Monitor in Settings or use the API
  2. Review the error messages for each failed job (includes error type, stack trace, HTTP status)
  3. Common causes:
  4. Temporary service unavailability
  5. Malformed image files
  6. LLM response parsing failures
  7. Network timeouts

DLQ API Endpoints:

View DLQ statistics:

curl http://localhost:8000/api/dlq/stats

List jobs in a specific DLQ:

# Detection queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?start=0&limit=10"

# Analysis queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:analysis_queue?start=0&limit=10"

Requeue all jobs from a DLQ back to processing (requires API key if authentication enabled):

# Requeue detection queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue

# Requeue analysis queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue

Clear a DLQ (permanently removes all jobs):

curl -X DELETE http://localhost:8000/api/dlq/dlq:detection_queue

Metrics Not Updating

  1. Verify backend is healthy: curl http://localhost:8000/api/system/health
  2. Check Prometheus metrics endpoint: curl http://localhost:8000/api/metrics
  3. Verify Redis is connected (used for queue depth metrics)
  4. Try manual refresh using the "Refresh" button

Model Zoo Troubleshooting

Model Showing "Error" Status

Symptoms: Model card shows red dot and "Error" label.

Possible causes:

  • Model file is missing or corrupted
  • Insufficient GPU memory
  • Model incompatible with current GPU

What to do:

  1. Check the System page for GPU memory status
  2. Note the model name and check system logs
  3. Restart the AI service if multiple models show errors

Model Never Loads

Symptoms: Model stays "Unloaded" even when its function should trigger.

Possible causes:

  • No detections that require this model (e.g., no license plates seen)
  • Model is disabled in configuration
  • Queue is backed up with other processing

What to do:

  1. Check if the model is in the "Disabled Models" section
  2. Wait for normal detection activity
  3. Check the Pipeline Health panel for queue issues

High Latency on a Model

Symptoms: Latency chart shows consistently high times (P95 above 500ms for detection models).

Possible causes:

  • GPU under heavy load
  • Model being loaded/unloaded frequently
  • Large number of objects in images

What to do:

  1. Check GPU utilization on the System page
  2. Look for patterns in the latency chart
  3. Normal during high-activity periods

"No Data Available" for Model Latency

Symptoms: Latency chart shows "No data available for [model name]"

This is normal when:

  • The model has not been used in the last hour
  • The model is disabled
  • No detections have triggered this model type

What to do: Nothing - this is informational. Data appears when the model is used.

Technical Deep Dive

For developers wanting to understand the underlying systems.

Architecture

  • AI Pipeline Architecture: AI Pipeline - Complete flow from file upload to event creation
  • YOLO26 Integration: Object detection with Real-Time Detection Transformer v2
  • Nemotron LLM: Risk analysis using NVIDIA Nemotron-3-Nano-30B-A3B via llama.cpp
  • Batch Aggregation: Time-window based grouping of related detections

Data Flow

Camera FTP Upload
        |
        v
FileWatcher (inotify/FSEvents)
        |
        v
detection_queue (Redis)
        |
        v
YOLO26 Server (Port 8095)
        |
        v
BatchAggregator
        |
   +----+----+
   |         |
   v         v
Fast Path   Normal Batching (30-90s windows)
   |         |
   +----+----+
        |
        v
analysis_queue (Redis)
        |
        v
Nemotron LLM (Port 8091)
        |
        v
Event Creation + WebSocket Broadcast

API Endpoints

Endpoint Description
/api/metrics Prometheus metrics in exposition format
/api/system/health Overall system and AI service health
/api/system/telemetry Queue depths and basic stats
/api/system/pipeline-latency Detailed pipeline latency percentiles
/api/detections/stats Detection class distribution
/api/events/stats Risk level distribution
/api/dlq/stats Dead letter queue counts

AI Performance Page (Grafana Embed):

Component File Path
AI Performance Page frontend/src/components/ai/AIPerformancePage.tsx
Config API Service frontend/src/services/api.ts (fetchConfig)

Standalone AI Metrics Components (available for use elsewhere, not currently used on AI Performance page):

Component File Path Purpose
AI Metrics Hook frontend/src/hooks/useAIMetrics.ts Fetches and combines AI metrics from multiple endpoints
Metrics Parser frontend/src/services/metricsParser.ts Parses Prometheus metrics format
Model Status Cards frontend/src/components/ai/ModelStatusCards.tsx YOLO26 and Nemotron status badges
Latency Panel frontend/src/components/ai/LatencyPanel.tsx Latency histograms with percentiles
Pipeline Health Panel frontend/src/components/ai/PipelineHealthPanel.tsx Queue depths and error counts
Insights Charts frontend/src/components/ai/InsightsCharts.tsx Detection and risk distribution charts

Backend Services:

Component File Path
Backend Metrics backend/core/metrics.py
System Routes backend/api/routes/system.py
DLQ Routes backend/api/routes/dlq.py
Detector Client backend/services/detector_client.py
Nemotron Analyzer backend/services/nemotron_analyzer.py
Batch Aggregator backend/services/batch_aggregator.py

Note: The standalone AI metrics components are exported from frontend/src/components/ai/index.ts and can be used on other pages that need to display AI metrics directly (without Grafana).

GPU Requirements

Service Model VRAM Port Context Window
YOLO26 yolo26_v2_r101vd ~4 GB 8095 N/A
Nemotron (Prod) Nemotron-3-Nano-30B-A3B ~14.7 GB 8091 128K tokens
Nemotron (Dev) Nemotron Mini 4B Instruct ~3 GB 8091 4K tokens
Total (Production) ~18.7 GB
Total (Development) ~7 GB

The system is optimized for NVIDIA RTX A5500 (24GB VRAM) in production. For development, a GPU with 8GB+ VRAM is recommended.

Production model advantages:

  • 128K context window enables analyzing hours of detection history in a single prompt
  • Better reasoning quality for complex security scenarios
  • Supports detailed enrichment data (clothing, vehicles, behavior patterns)

Development model trade-offs:

  • Faster inference (~100-200 tokens/second vs ~50-100)
  • Limited context requires more aggressive summarization
  • Lower VRAM footprint allows running on consumer GPUs