AI Performance¶

AI Performance Screenshot

The AI Performance page provides real-time monitoring of the AI pipeline that powers security event detection and risk analysis. It displays metrics for the YOLO26 object detection model, the Nemotron LLM risk analyzer, and the overall processing pipeline.

Overview¶

The AI Performance page provides real-time monitoring of the AI models that power your security system. Here you can track:

Model health status (YOLO26 and Nemotron)
Processing latency and queue depths
Detection and event statistics
Risk score distribution
Model Zoo - 18+ specialized AI models for enhanced detection

Accessing the AI Performance Page¶

Click the Brain icon or AI Performance in the left sidebar, or navigate directly to /ai.

What You're Looking At¶

The AI Performance page embeds the HSI Consolidated Grafana dashboard in kiosk mode. This provides a unified monitoring experience with all AI metrics visualized through Grafana's powerful charting capabilities.

Key features displayed in the Grafana dashboard:

Model Health Status - Real-time health checks for YOLO26 and Nemotron services
Inference Latency - Average, P50, P95, and P99 latency statistics for each AI service
Pipeline Throughput - Queue depths, detection counts, and event generation rates
Historical Trends - Time-series charts showing performance over time
GPU Utilization - VRAM usage and GPU performance metrics

Page controls:

Refresh Button - Manually reload the Grafana iframe to get the latest data
Open in Grafana - Opens the full dashboard in a new tab with editing capabilities (no kiosk mode)

The Grafana dashboard auto-refreshes based on its own configured interval. Use the "Refresh" button in the page header to force a reload.

Key Metrics Explained¶

The Grafana dashboard displays metrics from the AI pipeline. Here's what each metric means:

AI Model Health¶

YOLO26 (Object Detection)

Detects people, vehicles, and animals in camera images
Model: Real-Time Detection Transformer v2 (COCO + Objects365 pre-trained)
Typical inference time: 30-50ms per image
VRAM usage: ~4GB
Health status: healthy, degraded, unhealthy, or unknown

Nemotron (Risk Analysis LLM)

Analyzes detection batches to generate risk scores and explanations
Production model: NVIDIA Nemotron-3-Nano-30B-A3B (~14.7GB VRAM, 128K context)
Development model: Nemotron Mini 4B Instruct (~3GB VRAM, 4K context)
Typical inference time: 2-5 seconds per batch
Runs via llama.cpp inference server on port 8091

Latency Metrics¶

The dashboard tracks latency at multiple pipeline stages:

Stage	Description	Warning Threshold	Critical Threshold
Detection Inference	Time for YOLO26 to process one image	500ms	2000ms
Analysis Inference	Time for Nemotron to analyze a batch	5000ms	30000ms
Watch to Detect	File detection to detection result	100ms	500ms
Detect to Batch	Detection result to batch assignment	200ms	1000ms
Batch to Analyze	Batch closure to analysis completion	100ms	500ms
Total Pipeline	End-to-end from file upload to event creation	Varies	Varies

Expected end-to-end latency:

Fast path (high-confidence person): ~3-6 seconds
Normal path (batched): 30-95 seconds (dominated by batch window)

Queue Health¶

Queue Depths indicate processing backlog:

Detection Queue: Images waiting for YOLO26 processing
Analysis Queue: Batches waiting for Nemotron analysis

Queue Depth	Status	Meaning
< 10 items	Healthy (green)	Processing keeping up
10-50 items	Moderate (yellow)	Some backlog forming
> 50 items	Backlog (red)	Processing cannot keep up

Throughput Counters:

Total Detections: Objects detected by YOLO26
Total Events: Security events generated by the pipeline

Error Monitoring:

Pipeline Errors: Failures by type (detection errors, analysis errors, etc.)
Queue Overflows: Items dropped due to queue capacity limits
Dead Letter Queue (DLQ): Failed jobs awaiting manual review or reprocessing

Risk Score Distribution¶

Events are categorized by risk level:

Risk Level	Score Range	Description
Low	0-30	Normal activity, no concern
Medium	31-60	Unusual but not threatening
High	61-80	Suspicious activity requiring attention
Critical	81-100	Potential security threat, immediate action needed

Clickable Risk Score Bars¶

The risk score distribution chart is interactive. Each bar and count is clickable.

How It Works:

Click any bar in the chart to navigate to the Event Timeline
The Timeline automatically filters to show only events at that risk level
You can quickly investigate all events of a specific severity

Click Target	Navigates To
Low bar/count	`/timeline?risk_level=low`
Medium bar/count	`/timeline?risk_level=medium`
High bar/count	`/timeline?risk_level=high`
Critical bar/count	`/timeline?risk_level=critical`

Visual Feedback:

Hover tooltip - Shows "Click to view X events" on hover
Scale effect - Bar slightly enlarges on hover
Pointer cursor - Indicates the bar is clickable
Focus ring - Green outline when using keyboard navigation

The bars are implemented as buttons for full keyboard accessibility (Tab to navigate, Enter/Space to select).

Detection Class Distribution¶

Objects are detected in these security-relevant categories:

People: person
Vehicles: car, truck, bus, motorcycle, bicycle
Animals: dog, cat, bird

Model Zoo Section¶

The Model Zoo contains 18+ specialized AI models that enhance your security detections beyond basic object detection. These models extract additional details like license plates, faces, clothing, and vehicle types.

Summary Bar¶

The Model Zoo summary bar at the top displays key statistics:

Indicator	Description
Loaded	Models currently in GPU memory (green dot)
Unloaded	Available models not currently loaded (gray)
Disabled	Temporarily disabled models (yellow)
VRAM	GPU memory usage (used/budget)

VRAM (Video RAM) is the GPU memory used by loaded models. The Model Zoo has a dedicated budget of 1650 MB separate from core AI models.

Latency Chart¶

The latency chart shows inference time trends for any Model Zoo model:

Select a model using the dropdown menu at the top right
View timing data displayed as three lines:
Avg (ms) - Average inference time (emerald green)
P50 (ms) - Median inference time (blue)
P95 (ms) - 95th percentile time (amber)
Time axis shows the last 60 minutes of data

Chart Legend:

Line Color	Metric	Meaning
Emerald	Average	Typical inference time
Blue	P50/Median	Half of inferences are faster than this
Amber	P95	95% of inferences are faster than this

No data? If a model shows "No data available," it either has not been used recently or is disabled.

Model Status Cards¶

Below the chart, each Model Zoo model appears as a status card with:

Element	Description
Model Name	Human-readable name of the model
Status Dot	Color-coded health indicator
Status Label	Current state (Loaded, Unloaded, Loading, etc.)
VRAM	GPU memory required when loaded
Last Used	Time since model was last used ("2h ago", "Never")
Category	Model type (Detection, Classification, etc.)

Model Status Indicators¶

Status	Dot Color	Meaning
Loaded	Green	Model is in GPU memory and ready
Loading	Blue (pulsing)	Model is currently being loaded
Unloaded	Gray	Model available but not loaded
Disabled	Yellow	Model is turned off
Error	Red	Model failed to load

Active vs Disabled Models¶

Models are organized into two sections:

Active Models - Enabled and available for use
Disabled Models - Turned off (grayed out, appear at bottom)

Why are some models disabled?

Incompatible with current software versions
Moved to a dedicated service
Not yet released
Temporarily turned off for maintenance

Model Zoo Categories¶

Detection Models¶

Model	VRAM	Purpose
YOLO11 License Plate	300 MB	Find license plates on vehicles
YOLO11 Face	200 MB	Detect faces on people
YOLO World S	1500 MB	Open vocabulary detection
Vehicle Damage Detection	2000 MB	Find damage on vehicles

Classification Models¶

Model	VRAM	Purpose
Violence Detection	500 MB	Identify violent activity
Weather Classification	200 MB	Determine weather conditions
Fashion CLIP	500 MB	Classify clothing types
Vehicle Segment Classifier	1500 MB	Identify vehicle types
Pet Classifier	200 MB	Distinguish cats and dogs

Other Specialized Models¶

Model	VRAM	Category	Purpose
SegFormer Clothes	1500 MB	Segmentation	Clothing segmentation
ViTPose Small	1500 MB	Pose	Human pose estimation
Depth Anything V2	150 MB	Depth	Distance estimation
CLIP ViT-L	800 MB	Embedding	Visual embeddings
PaddleOCR	100 MB	OCR	Read text from plates
X-CLIP Base	2000 MB	Action Recognition	Recognize activities

Understanding Model Memory (VRAM)¶

Models load into your GPU's video memory (VRAM) when needed:

VRAM Budget: 1650 MB for the Model Zoo
Loading Strategy: One model loads at a time (sequential)
Automatic Management: Models load/unload based on demand

Why does this matter?

Loaded models respond instantly
Unloaded models need time to load before first use
VRAM constraints limit how many models can be loaded simultaneously

Note: The core YOLO26 (~650 MB) and Nemotron (~21,700 MB) models have separate VRAM allocations and are always loaded.

Model Zoo Analytics¶

Below the Model Zoo status cards, you see additional analytics:

Model Contribution Chart¶

A bar chart showing which models contribute most to event enrichment:

Higher bars = More frequently used models
Sorted by contribution = Most useful models at top
Hover for details = See exact percentage

Model Leaderboard¶

A sortable table ranking models by contribution:

Column	Description
Rank	Position (top 3 have badges)
Model	Model name
Contribution	Percentage of events this model enriched
Events	Number of events processed
Quality	Correlation with good AI assessments

Click column headers to sort by that metric.

Settings & Configuration¶

Grafana Configuration¶

The AI Performance page embeds a Grafana dashboard. The dashboard URL is fetched from the backend configuration API.

Setting	Environment Variable	Default	Description
Grafana URL	`GRAFANA_URL`	`http://localhost:3002`	URL of the Grafana instance
Dashboard UID	N/A	`hsi-consolidated`	The dashboard to display

The page loads the dashboard at: {grafana_url}/d/hsi-consolidated?orgId=1&kiosk

To access Grafana directly with full editing capabilities, click the "Open in Grafana" button in the page header. This opens the same dashboard without kiosk mode, allowing you to:

Adjust time ranges
Modify queries
Create alerts
Export data

AI Service Configuration¶

Setting	Environment Variable	Default	Description
YOLO26 URL	`YOLO26_URL`	`http://localhost:8095`	Detection service endpoint
Nemotron URL	`NEMOTRON_URL`	`http://localhost:8091`	LLM analysis endpoint
Detection Confidence	`DETECTION_CONFIDENCE_THRESHOLD`	`0.5`	Minimum confidence to store detection
Batch Window	`BATCH_WINDOW_SECONDS`	`90`	Maximum batch duration
Idle Timeout	`BATCH_IDLE_TIMEOUT_SECONDS`	`30`	Close batch after this idle period
Fast Path Confidence	`FAST_PATH_CONFIDENCE_THRESHOLD`	`0.90`	Confidence for immediate analysis
Fast Path Types	`FAST_PATH_OBJECT_TYPES`	`["person"]`	Object types eligible for fast path

Refresh Settings¶

The AI Performance page relies on Grafana's built-in refresh mechanism. The dashboard's refresh interval is configured within Grafana itself.

Setting	Location	Description
Dashboard Refresh	Grafana dashboard settings	Auto-refresh interval for the embedded dashboard
Manual Refresh	Page header button	Reloads the Grafana iframe on demand

Note: The standalone AI metrics components (available for other pages) use a 5-second polling interval with a 60-minute latency history window.

Troubleshooting¶

Grafana Dashboard Shows "Failed to Load"¶

Verify Grafana is running: curl http://localhost:3002/api/health
Check the backend config endpoint returns grafana_url: curl http://localhost:8000/api/system/config
Ensure network/firewall allows iframe embedding from Grafana
Check browser console for CORS or content security policy errors

Model Shows "Unhealthy" Status¶

YOLO26:

Check if the detection server is running: curl http://localhost:8095/health
Verify GPU is available: nvidia-smi
Check container logs: docker logs yolo26
VRAM exhaustion may require restarting the service

Nemotron:

Check if llama.cpp server is running: curl http://localhost:8091/health
Verify model file exists and is valid
Check VRAM availability (Nemotron needs ~14.7GB for production model)
Review llama.cpp logs for memory allocation errors

High Latency Detected¶

Detection Latency > 500ms:

GPU may be under thermal throttling
Check for other GPU workloads
Verify CUDA is being used (not CPU fallback)

Analysis Latency > 30s:

LLM context may be too large
Consider reducing batch size
Check if model is fully loaded in VRAM

Queue Backlogs:

Processing cannot keep up with incoming images
Consider scaling camera upload frequency
Check for downstream service bottlenecks

Dead Letter Queue Items Appearing¶

Items in the DLQ indicate failed processing jobs. To investigate:

Check the DLQ Monitor in Settings or use the API
Review the error messages for each failed job (includes error type, stack trace, HTTP status)
Common causes:
Temporary service unavailability
Malformed image files
LLM response parsing failures
Network timeouts

DLQ API Endpoints:

View DLQ statistics:

curl http://localhost:8000/api/dlq/stats

List jobs in a specific DLQ:

# Detection queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:detection_queue?start=0&limit=10"

# Analysis queue DLQ
curl "http://localhost:8000/api/dlq/jobs/dlq:analysis_queue?start=0&limit=10"

Requeue all jobs from a DLQ back to processing (requires API key if authentication enabled):

# Requeue detection queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:detection_queue

# Requeue analysis queue DLQ items
curl -X POST http://localhost:8000/api/dlq/requeue-all/dlq:analysis_queue

Clear a DLQ (permanently removes all jobs):

curl -X DELETE http://localhost:8000/api/dlq/dlq:detection_queue

Metrics Not Updating¶

Verify backend is healthy: curl http://localhost:8000/api/system/health
Check Prometheus metrics endpoint: curl http://localhost:8000/api/metrics
Verify Redis is connected (used for queue depth metrics)
Try manual refresh using the "Refresh" button

Model Zoo Troubleshooting¶

Model Showing "Error" Status¶

Symptoms: Model card shows red dot and "Error" label.

Possible causes:

Model file is missing or corrupted
Insufficient GPU memory
Model incompatible with current GPU

What to do:

Check the System page for GPU memory status
Note the model name and check system logs
Restart the AI service if multiple models show errors

Model Never Loads¶

Symptoms: Model stays "Unloaded" even when its function should trigger.

Possible causes:

No detections that require this model (e.g., no license plates seen)
Model is disabled in configuration
Queue is backed up with other processing

What to do:

Check if the model is in the "Disabled Models" section
Wait for normal detection activity
Check the Pipeline Health panel for queue issues

High Latency on a Model¶

Symptoms: Latency chart shows consistently high times (P95 above 500ms for detection models).

Possible causes:

GPU under heavy load
Model being loaded/unloaded frequently
Large number of objects in images

What to do:

Check GPU utilization on the System page
Look for patterns in the latency chart
Normal during high-activity periods

"No Data Available" for Model Latency¶

Symptoms: Latency chart shows "No data available for [model name]"

This is normal when:

The model has not been used in the last hour
The model is disabled
No detections have triggered this model type

What to do: Nothing - this is informational. Data appears when the model is used.

Technical Deep Dive¶

For developers wanting to understand the underlying systems.

Architecture¶

AI Pipeline Architecture: AI Pipeline - Complete flow from file upload to event creation
YOLO26 Integration: Object detection with Real-Time Detection Transformer v2
Nemotron LLM: Risk analysis using NVIDIA Nemotron-3-Nano-30B-A3B via llama.cpp
Batch Aggregation: Time-window based grouping of related detections

Data Flow¶

Camera FTP Upload
        |
        v
FileWatcher (inotify/FSEvents)
        |
        v
detection_queue (Redis)
        |
        v
YOLO26 Server (Port 8095)
        |
        v
BatchAggregator
        |
   +----+----+
   |         |
   v         v
Fast Path   Normal Batching (30-90s windows)
   |         |
   +----+----+
        |
        v
analysis_queue (Redis)
        |
        v
Nemotron LLM (Port 8091)
        |
        v
Event Creation + WebSocket Broadcast

API Endpoints¶

Endpoint	Description
`/api/metrics`	Prometheus metrics in exposition format
`/api/system/health`	Overall system and AI service health
`/api/system/telemetry`	Queue depths and basic stats
`/api/system/pipeline-latency`	Detailed pipeline latency percentiles
`/api/detections/stats`	Detection class distribution
`/api/events/stats`	Risk level distribution
`/api/dlq/stats`	Dead letter queue counts

AI Performance Page (Grafana Embed):

Component	File Path
AI Performance Page	`frontend/src/components/ai/AIPerformancePage.tsx`
Config API Service	`frontend/src/services/api.ts` (fetchConfig)

Standalone AI Metrics Components (available for use elsewhere, not currently used on AI Performance page):

Component	File Path	Purpose
AI Metrics Hook	`frontend/src/hooks/useAIMetrics.ts`	Fetches and combines AI metrics from multiple endpoints
Metrics Parser	`frontend/src/services/metricsParser.ts`	Parses Prometheus metrics format
Model Status Cards	`frontend/src/components/ai/ModelStatusCards.tsx`	YOLO26 and Nemotron status badges
Latency Panel	`frontend/src/components/ai/LatencyPanel.tsx`	Latency histograms with percentiles
Pipeline Health Panel	`frontend/src/components/ai/PipelineHealthPanel.tsx`	Queue depths and error counts
Insights Charts	`frontend/src/components/ai/InsightsCharts.tsx`	Detection and risk distribution charts

Backend Services:

Component	File Path
Backend Metrics	`backend/core/metrics.py`
System Routes	`backend/api/routes/system.py`
DLQ Routes	`backend/api/routes/dlq.py`
Detector Client	`backend/services/detector_client.py`
Nemotron Analyzer	`backend/services/nemotron_analyzer.py`
Batch Aggregator	`backend/services/batch_aggregator.py`

Note: The standalone AI metrics components are exported from frontend/src/components/ai/index.ts and can be used on other pages that need to display AI metrics directly (without Grafana).

GPU Requirements¶

Service	Model	VRAM	Port	Context Window
YOLO26	yolo26_v2_r101vd	~4 GB	8095	N/A
Nemotron (Prod)	Nemotron-3-Nano-30B-A3B	~14.7 GB	8091	128K tokens
Nemotron (Dev)	Nemotron Mini 4B Instruct	~3 GB	8091	4K tokens
Total (Production)		~18.7 GB
Total (Development)		~7 GB

The system is optimized for NVIDIA RTX A5500 (24GB VRAM) in production. For development, a GPU with 8GB+ VRAM is recommended.

Production model advantages:

128K context window enables analyzing hours of detection history in a single prompt
Better reasoning quality for complex security scenarios
Supports detailed enrichment data (clothing, vehicles, behavior patterns)

Development model trade-offs:

Faster inference (~100-200 tokens/second vs ~50-100)
Limited context requires more aggressive summarization
Lower VRAM footprint allows running on consumer GPUs

AI Performance¶

Overview¶

Accessing the AI Performance Page¶

What You're Looking At¶

Key Metrics Explained¶

AI Model Health¶

Latency Metrics¶

Queue Health¶

Risk Score Distribution¶

Clickable Risk Score Bars¶

Detection Class Distribution¶

Model Zoo Section¶

Summary Bar¶

Latency Chart¶

Model Status Cards¶

Model Status Indicators¶

Active vs Disabled Models¶

Model Zoo Categories¶

Detection Models¶

Classification Models¶

Other Specialized Models¶

Understanding Model Memory (VRAM)¶

Model Zoo Analytics¶

Model Contribution Chart¶

Model Leaderboard¶

Settings & Configuration¶

Grafana Configuration¶

AI Service Configuration¶

Refresh Settings¶

Troubleshooting¶

Grafana Dashboard Shows "Failed to Load"¶

Model Shows "Unhealthy" Status¶

High Latency Detected¶

Dead Letter Queue Items Appearing¶

Metrics Not Updating¶

Model Zoo Troubleshooting¶

Model Showing "Error" Status¶

Model Never Loads¶

High Latency on a Model¶

"No Data Available" for Model Latency¶

Technical Deep Dive¶

Architecture¶

Data Flow¶

API Endpoints¶

Related Code¶

GPU Requirements¶