LLM Analysis Flow¶
This document describes the Nemotron LLM analysis process, including prompt construction, request handling, retry logic, and response parsing.

Analysis Flow Overview¶
Source: backend/services/nemotron_analyzer.py:1-28
# backend/services/nemotron_analyzer.py:1-28
"""Nemotron analyzer service for LLM-based risk assessment.
This service analyzes batches of detections using the Nemotron LLM
via llama.cpp server to generate risk scores and natural language summaries.
Analysis Flow:
1. Fetch batch detections from Redis/database
2. Enrich context with zones, baselines, and cross-camera activity
3. Run enrichment pipeline for license plates, faces, OCR (optional)
4. Format prompt with enriched detection details
5. Acquire shared AI inference semaphore (NEM-1463)
6. POST to llama.cpp completion endpoint (with retry on transient failures)
7. Release semaphore
8. Parse JSON response
9. Create Event with risk assessment
10. Store Event in database
11. Broadcast via WebSocket (if available)
"""
Analysis Sequence Diagram¶
sequenceDiagram
participant AQ as Analysis Queue
participant AW as Analysis Worker
participant NA as NemotronAnalyzer
participant CE as ContextEnricher
participant EP as EnrichmentPipeline
participant Sem as AI Semaphore
participant LLM as Nemotron LLM
participant DB as PostgreSQL
participant EB as EventBroadcaster
AQ->>AW: Dequeue batch
AW->>NA: analyze_batch(batch_id)
NA->>DB: Fetch Detection records
NA->>CE: get_enriched_context()
CE-->>NA: Zone, baseline, cross-camera data
opt Enrichment enabled
NA->>EP: enrich_detections()
EP-->>NA: License plates, faces, OCR
end
NA->>NA: Format prompt
rect rgb(255, 240, 240)
Note over NA,LLM: Concurrency-limited section
NA->>Sem: Acquire (max 4 concurrent)
NA->>LLM: POST /completion
Note over LLM: 2-10s typical, 120s timeout
LLM-->>NA: JSON response
NA->>Sem: Release
end
NA->>NA: Parse response
NA->>NA: Validate risk data
NA->>DB: INSERT Event
NA->>EB: broadcast_event()
EB-->>AW: Broadcast complete NemotronAnalyzer Class¶
Source: backend/services/nemotron_analyzer.py:135-237
# backend/services/nemotron_analyzer.py:135-237
class NemotronAnalyzer:
"""Analyzes detection batches using Nemotron LLM for risk assessment.
Features:
- Retry logic with exponential backoff for transient failures (NEM-1343)
- Configurable timeouts and retry attempts via settings
- Context enrichment with zone, baseline, and cross-camera data
- Enrichment pipeline for license plates, faces, and OCR
"""
def __init__(
self,
redis_client: RedisClient | None = None,
context_enricher: ContextEnricher | None = None,
enrichment_pipeline: EnrichmentPipeline | None = None,
use_enriched_context: bool = True,
use_enrichment_pipeline: bool = True,
max_retries: int | None = None,
service_facade: AnalyzerServiceFacade | None = None,
):
Configuration¶
| Parameter | Default | Source |
|---|---|---|
| Connect timeout | 10s | NEMOTRON_CONNECT_TIMEOUT (line 130) |
| Read timeout | 120s | NEMOTRON_READ_TIMEOUT (line 131) |
| Health timeout | 5s | NEMOTRON_HEALTH_TIMEOUT (line 132) |
| Max retries | 3 | nemotron_max_retries setting |
| Max concurrent | 4 | AI_MAX_CONCURRENT_INFERENCES setting |
Concurrency Control¶
Source: backend/services/nemotron_analyzer.py:19-22
# backend/services/nemotron_analyzer.py:19-22
# Concurrency Control (NEM-1463):
# Uses a shared asyncio.Semaphore to limit concurrent AI inference operations.
# This prevents GPU/AI service overload under high traffic. The limit is
# configurable via AI_MAX_CONCURRENT_INFERENCES setting (default: 4).
Semaphore Acquisition Flow¶
Request 1: Acquire semaphore (count: 4 -> 3) -> Process -> Release (count: 3 -> 4)
Request 2: Acquire semaphore (count: 4 -> 3) -> Process -> Release
Request 3: Acquire semaphore (count: 4 -> 3) -> Process -> Release
Request 4: Acquire semaphore (count: 4 -> 3) -> Process -> Release
Request 5: WAIT (count: 0) -> Acquire when available -> Process -> Release
Retry Logic¶
Source: backend/services/nemotron_analyzer.py:24-27
# backend/services/nemotron_analyzer.py:24-27
# Retry Logic (NEM-1343):
# - Configurable max retries via NEMOTRON_MAX_RETRIES setting (default: 3)
# - Exponential backoff: 2^attempt seconds between retries (capped at 30s)
# - Only retries transient failures (connection, timeout, HTTP 5xx)
Retry Timing¶
| Attempt | Backoff | Cumulative Wait |
|---|---|---|
| 1 | 0s | 0s |
| 2 | 2s | 2s |
| 3 | 4s | 6s |
| 4 | 8s (capped at 30s) | 14s |
Retriable Errors¶
| Error Type | Retriable | Reason |
|---|---|---|
httpx.ConnectError | Yes | Transient network issue |
httpx.TimeoutException | Yes | Server overload |
| HTTP 5xx | Yes | Server error |
| HTTP 4xx | No | Client error |
| JSON parse error | No | Response format issue |
| Validation error | No | Invalid response data |
Prompt Construction¶

Prompt Templates¶
Source: backend/services/nemotron_analyzer.py:95-114
# backend/services/nemotron_analyzer.py:95-114
from backend.services.prompts import (
ENRICHED_RISK_ANALYSIS_PROMPT,
FULL_ENRICHED_RISK_ANALYSIS_PROMPT,
MODEL_ZOO_ENHANCED_RISK_ANALYSIS_PROMPT,
RISK_ANALYSIS_PROMPT,
VISION_ENHANCED_RISK_ANALYSIS_PROMPT,
format_action_recognition_context,
format_camera_health_context,
format_clothing_analysis_context,
format_depth_context,
format_detections_with_all_enrichment,
format_household_context,
format_image_quality_context,
format_pet_classification_context,
format_pose_analysis_context,
format_vehicle_classification_context,
format_vehicle_damage_context,
format_violence_context,
format_weather_context,
)
Prompt Components¶
| Component | Source | Content |
|---|---|---|
| Detection details | Database | Object labels, bounding boxes, confidence |
| Zone context | ContextEnricher | Zone definitions, baseline deviations |
| Cross-camera activity | ContextEnricher | Related detections on other cameras |
| License plates | EnrichmentPipeline | OCR results from vehicles |
| Face detections | EnrichmentPipeline | Face locations in person detections |
| Weather context | EnrichmentPipeline | Current weather conditions |
| Time context | System | Time of day, day of week |
Example Prompt Structure¶
<|im_start|>system
You are a security analysis AI. Analyze the following detections and provide a risk assessment.
Current time: 2024-12-23 22:15:00 (Night, Monday)
Weather: Clear, 45°F
Zone: Front Yard (Entry Zone)
Baseline: Typically 0-2 person detections per hour at this time
Current: 3 person detections (above baseline)
Cross-camera activity:
- Side Gate camera: 1 person detection 2 minutes ago
- Driveway camera: 1 vehicle detection 5 minutes ago
<|im_end|>
<|im_start|>user
Detections in this batch:
1. person (confidence: 0.92) at [120, 340, 280, 580]
2. person (confidence: 0.87) at [400, 320, 520, 560]
3. car (confidence: 0.95) at [50, 100, 350, 300]
License plate detected: ABC 123 (California)
Unknown vehicle - not in household database
Provide your risk assessment as JSON:
{
"risk_score": <0-100>,
"risk_level": "<low|medium|high|critical>",
"summary": "<brief description>",
"reasoning": "<detailed explanation>"
}
<|im_end|>
LLM Request¶
Source: backend/services/nemotron_analyzer.py:455-487
# backend/services/nemotron_analyzer.py:455-487
async def _call_llm_with_version(
self,
context: str,
prompt_version: str = "v1_original",
) -> dict[str, Any]:
"""Call LLM with a specific prompt version."""
settings = get_settings()
max_output_tokens = settings.nemotron_max_output_tokens
payload = {
"prompt": context,
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": max_output_tokens,
"stop": ["<|im_end|>", "<|im_start|>"],
}
headers = {"Content-Type": "application/json"}
headers.update(self._get_auth_headers())
async with httpx.AsyncClient(timeout=self._timeout) as client:
response = await client.post(
f"{self._llm_url}/completion",
json=payload,
headers=headers,
)
response.raise_for_status()
llm_result = response.json()
Request Parameters¶
| Parameter | Value | Purpose | | ------------- | ------------ | -------------------------- | ------ | -------- | ---- | ------------------------------- | | temperature | 0.7 | Moderate creativity | | top_p | 0.95 | Nucleus sampling threshold | | max_tokens | Configurable | Limit response length | | stop | ["< | im_end | >", "< | im_start | >"] | Stop generation at chat markers |
Response Parsing¶
Source: backend/services/nemotron_analyzer.py:116-119
# backend/services/nemotron_analyzer.py:116-119
# Pre-compiled regex patterns for LLM response parsing
# These are compiled once at module load time for better performance
_THINK_PATTERN = re.compile(r"<think>.*?</think>", re.DOTALL)
_JSON_PATTERN = re.compile(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}", re.DOTALL)
Parsing Steps¶
- Remove
<think>blocks (reasoning scaffolding) - Extract JSON object from response text
- Validate required fields
- Normalize risk score to 0-100 range
- Derive risk_level if not provided
Expected Response Format¶
{
"risk_score": 75,
"risk_level": "high",
"summary": "Multiple unknown individuals detected at front entrance after dark",
"reasoning": "Three persons detected at entry zone during nighttime hours (22:15). This exceeds the typical baseline of 0-2 detections. An unfamiliar vehicle with out-of-state plates is also present. The combination of multiple unknown individuals and an unrecognized vehicle at this hour warrants elevated concern."
}
A/B Testing Support¶
Source: backend/services/nemotron_analyzer.py:223-259
# backend/services/nemotron_analyzer.py:223-259
# A/B Testing support (NEM-1667)
self._ab_tester: Any | None = None # PromptABTester when configured
self._ab_config: Any | None = None # ABTestConfig when configured
# Prompt Experiment support (NEM-3023)
self._experiment_config: Any | None = None # PromptExperimentConfig when configured
# A/B Rollout Manager support (NEM-3338)
self._rollout_manager: Any | None = None # ABRolloutManager when configured
Shadow Mode Analysis¶
Source: backend/services/nemotron_analyzer.py:348-423
# backend/services/nemotron_analyzer.py:348-423
async def run_shadow_analysis(
self,
camera_id: str,
context: str,
) -> dict[str, Any]:
"""Run shadow mode analysis with both prompt versions.
In shadow mode, runs both V1 and V2 prompts but returns V1 results
as the primary output. V2 results are logged for comparison analysis.
"""
Shadow mode enables safe prompt experimentation:
- Run both V1 (control) and V2 (treatment) prompts
- Return V1 result as primary output
- Log V2 result for comparison
- Track score differences and latency
Cold Start and Warmup¶
Source: backend/services/nemotron_analyzer.py:216-221
# backend/services/nemotron_analyzer.py:216-221
# Cold start and warmup tracking (NEM-1670)
self._last_inference_time: float | None = None
self._is_warming: bool = False
self._warmup_enabled = settings.ai_warmup_enabled
self._cold_start_threshold = settings.ai_cold_start_threshold_seconds
self._warmup_prompt = settings.nemotron_warmup_prompt
Cold Start Detection¶
Model is considered "cold" if:
- Never used (
_last_inference_time is None) - Not used within threshold (default: 300s)
Cold starts may have higher latency due to:
- Model loading into GPU memory
- CUDA context initialization
- JIT compilation of kernels
Error Handling¶
Error Categories¶
| Error | Handling | User Impact |
|---|---|---|
| Connection error | Retry 3x with backoff | Delayed analysis |
| Timeout | Retry 3x with backoff | Delayed analysis |
| HTTP 5xx | Retry 3x with backoff | Delayed analysis |
| HTTP 4xx | Fail immediately | Batch skipped |
| Parse error | Fail immediately | Batch skipped |
| Validation error | Fail immediately | Batch skipped |
| Semaphore timeout | Wait or fail | Queued behind others |
Graceful Degradation¶
If LLM is unavailable:
- Batch remains in analysis queue
- Will be retried on next worker cycle
- After max retries, batch is moved to DLQ
- Frontend shows detections without risk assessment
Metrics and Observability¶
Source: backend/services/nemotron_analyzer.py:56-66
# backend/services/nemotron_analyzer.py:56-66
from backend.core.metrics import (
observe_ai_request_duration,
observe_risk_score,
observe_stage_duration,
record_event_by_camera,
record_event_by_risk_level,
record_event_created,
record_nemotron_tokens,
record_pipeline_error,
record_prompt_template_used,
)
Recorded Metrics¶
| Metric | Type | Labels |
|---|---|---|
hsi_ai_request_duration_seconds | Histogram | service=nemotron |
hsi_risk_score | Histogram | camera_id |
hsi_events_total | Counter | risk_level, camera_id |
hsi_nemotron_tokens_total | Counter | type=input/output |
hsi_pipeline_errors_total | Counter | stage=nemotron_analysis |
Timing Summary¶
| Phase | Typical Duration | Max Duration |
|---|---|---|
| Fetch detections | <100ms | 500ms |
| Context enrichment | 100-500ms | 2s |
| Enrichment pipeline | 500ms-5s | 30s |
| Prompt formatting | <50ms | 200ms |
| Semaphore wait | 0-10s | 60s |
| LLM inference | 2-10s | 120s |
| Response parsing | <10ms | 100ms |
| Event creation | <100ms | 500ms |
| WebSocket broadcast | <10ms | 100ms |
Total: 3-25s typical (dominated by LLM inference)
Related Documents¶
- image-to-event.md - Complete pipeline context
- batch-aggregation-flow.md - What triggers analysis
- enrichment-pipeline.md - Pre-analysis enrichment
- error-recovery-flow.md - Retry patterns