AI Audit¶

AI Audit Screenshot

The AI Audit page provides transparency into how the AI models make decisions, allowing you to understand, evaluate, and improve the AI-powered security analysis.

Overview¶

The AI Audit Dashboard helps you understand how well the AI system is performing. It shows quality metrics, consistency scores, and recommendations for improving the AI's analysis. This page is useful for power users who want to fine-tune their security system or verify the AI is working correctly.

Accessing the AI Audit Dashboard¶

Click AI Audit in the left sidebar
The dashboard loads with data from the last 7 days by default
Use the period selector to change the time range

What You're Looking At¶

The AI Audit Dashboard helps you understand and improve the quality of AI-generated security assessments. It provides:

Quality Metrics - How well the AI is performing (1-5 scale scores)
Model Contributions - Which AI models contributed to analyses
Recommendations - AI-generated suggestions for improving prompt templates
Prompt Playground - Interactive environment for testing prompt modifications
Version History - Track and restore previous prompt configurations

This page is essential for maintaining and improving AI accuracy over time. By analyzing the AI's self-evaluation, you can identify patterns where the AI struggles and make targeted improvements.

Key Components¶

The AI Audit page uses a tabbed interface with four main sections. Note: The page title displays as "AI Audit Dashboard" and Model contribution rates are shown alongside quality metrics on the Dashboard tab.

Dashboard Tab¶

Dashboard Tab

The default view showing aggregate quality metrics and recommendations.

Quality Score Metrics¶

Four stat cards display aggregate performance over the selected time period:

Metric	Description
Average Quality Score	Overall quality of AI analyses (1-5 scale, higher is better). Displayed as "X.X / 5" with a progress bar. Based on the number of fully evaluated events.
Consistency Rate	How consistent the AI is when re-analyzing the same events (1-5 scale). Measures risk score consistency on re-evaluation.
Enrichment Utilization	Percentage of AI models contributing to analyses (0-100%). Indicates how many enrichment sources were available and used.
Evaluation Coverage	Percentage of events that have been fully evaluated. Shows "X of Y events evaluated".

Score Interpretation:

4.0-5.0 (Green): Excellent - AI is performing well
3.0-3.9 (Yellow): Good - Room for improvement
1.0-2.9 (Red): Needs attention - Consider prompt adjustments

Interpreting Your Results¶

Signs of Good AI Performance:

Average Quality Score: 4.0 or higher
Consistency Rate: 4.0 or higher
Enrichment Utilization: 70% or higher
Evaluation Coverage: 80% or higher
Recommendations: Mostly low priority items

Warning Signs to Watch For:

Low quality scores - The AI may need configuration adjustments
Low consistency - Results vary too much; investigate why
Low enrichment - Some AI models may not be contributing
High-priority recommendations - Address these for better accuracy

Model Contribution Breakdown¶

A horizontal bar chart showing the contribution rate of each AI model to event analyses. Each bar shows:

Model name with icon
Number of events the model contributed to
Percentage contribution rate (0-100%)

Model	Description
YOLO26	Object detection (always active)
Florence-2	Visual question-answering for scene details
X-CLIP	Action recognition (walking, running, etc.)
Violence Detection	Violence classifier for suspicious behavior
Clothing Analysis	FashionCLIP clothing identification
Vehicle Detection	Vehicle type and color classification
Pet Detection	Pet vs. wildlife classification
Weather Analysis	Environmental condition assessment
Image Quality	Camera image quality scoring
Zone Analysis	Entry point and security zone context
Baseline	Historical activity pattern comparison
Cross-Camera	Correlation with other camera detections

Models are sorted by contribution rate in descending order. Higher contribution rates indicate the model data was available and used in analyses.

Prompt Improvement Recommendations¶

AI-generated suggestions for improving prompt templates, displayed in an accordion grouped by category. The panel header shows the total "High Priority" count and how many events were analyzed.

Category	Description	Icon
Missing Context	Information that would help the AI make better assessments	Warning triangle
Unused Data	Provided data that was not useful for analysis	Info circle
Model Gaps	AI models that should have provided data but did not	Warning triangle
Format Suggestions	Ways to improve the prompt structure	Lightbulb
Confusing Sections	Parts of the prompt that were unclear or contradictory	Info circle

Each category accordion shows:

Item count - Number of suggestions in that category
High priority count - If any suggestions are high priority

Each recommendation item shows:

Suggestion text - What to improve
Category badge - Color-coded by category type
Priority badge - High (red), Medium (yellow), or Low (gray)
Frequency count - How many events mentioned this (e.g., "5x")
Edit Prompt button - Opens the Prompt Playground with that recommendation context

Recommendations are sorted by priority (high first), then by frequency within each category.

Prompt Playground Tab¶

Prompt Playground Tab

An interactive slide-out panel (80% viewport width) for editing, testing, and refining AI model prompts. Opens when you click "Open Prompt Playground". Press Escape to close.

Supported Models¶

Each model has an accordion-style editor. The first model (Nemotron) is expanded by default. A "(modified)" indicator with a pulsing dot appears when you have unsaved changes.

Model	Editor Type	What You Can Configure
Nemotron	Full text editor with syntax highlighting	System prompt with highlighted variables like `{detections}`, `{cross_camera_data}`, `{weather}`, `{time_context}`. Also includes Temperature slider (0-2) and Max Tokens input (100-8192).
Florence-2	Multi-line text (one per line)	VQA queries for visual scene analysis
YOLO-World	Multi-line text + slider	Object classes (one per line) + confidence threshold slider (0-1)
X-CLIP	Multi-line text	Action recognition classes (one per line)
Fashion-CLIP	Two text areas	Clothing categories + suspicious indicators (one per line each)

Syntax Highlighting: The Nemotron editor highlights prompt variables like {variable_name} in green with a subtle background, and includes line numbers.

Diff Preview Flow¶

When opening from a recommendation with an enriched suggestion:

Preview Changes - See a side-by-side diff view showing original vs. modified prompt
Suggestion Explanation - View why this change is recommended with event links
Apply or Dismiss - Apply the suggestion to the editor or dismiss it
Applied Banner - After applying, shows "Suggestion applied. Test it or save to keep your changes."

A/B Testing Workflow¶

After applying a suggestion or making changes, the A/B Test section appears:

Run A/B Test - Tests the modified prompt against a real event (uses the Event ID field, or picks a random recent event)
View Results - Shows test count (e.g., "3 tests completed")
Run More Tests - Requires at least 3 tests before promoting
Promote B as Default - Opens a confirmation dialog showing:
Average score change (green if negative = improvement, red if positive = regression)
Improvement rate percentage
Confirm Promote - Saves the modified prompt as the new default

Note: The "Promote B" button shows a warning "Run at least 3 tests before promoting" if fewer than 3 tests have been run.

Test Configuration¶

A separate "Test Configuration" section at the bottom allows you to:

Enter an Event ID to test against
Click Run Test to compare before/after results
View side-by-side results showing:
Score and risk level (before and after)
Summary text
Whether the configuration "improved results" (green) or "did not improve results" (yellow)
Inference time in milliseconds

Save, Export, and Import¶

Each model has its own action buttons:

Reset - Revert to the original saved configuration (shows toast notification)
Save - Persist changes to the database (creates new version, shows "Saved!" with checkmark)

Footer buttons:

Import JSON - Load configurations from a JSON file (validates format)
Export JSON - Download all prompt configurations as JSON (filename: prompt-configs-YYYY-MM-DD.json)

Toast Notifications: Success, error, and info messages appear in the bottom-right corner and auto-dismiss after 3 seconds.

Best Practices for Prompt Editing¶

Before Making Changes:

Export current configurations as a backup
Choose representative events for testing - include low, medium, and high risk events
Document your changes using the change description field

When Editing Prompts:

Make small, incremental changes - It is easier to identify what works
Use the variables provided - They ensure the AI has the context it needs
Be specific - Clear instructions produce more consistent results
Define output format - Specify the exact structure you expect

When Testing:

Test against multiple events - At least 3-5 different scenarios
Include edge cases - Test with challenging events
Compare scores - Look for consistency, not just improvement
Run A/B tests before promoting major changes

After Making Changes:

Monitor the AI Audit Dashboard for quality changes
Check the consistency rate - It should remain stable
Review new events to ensure the changes work in production
Roll back quickly if you see problems

Batch Audit Tab¶

Batch Audit Tab

Trigger batch evaluation of events to generate quality metrics and recommendations.

When to Use Batch Audit¶

After making changes to the AI configuration
When you want to analyze high-risk events more thoroughly
To generate fresh recommendations based on recent patterns
To verify AI consistency by re-evaluating the same events

The tab shows summary stats:

Total Events - Total events in the selected time period
Audited Events - Events that have some audit data
Fully Evaluated - Events that have completed full evaluation

Click "Trigger Batch Audit" to open the configuration modal:

Option	Default	Range	Description
Event Limit	50	1-1000	Maximum number of events to process
Minimum Risk Score	50	0-100	Only process events with risk score at or above this value
Force Re-evaluate	Off	On/Off	Re-process events that have already been evaluated

Batch Audit Process¶

Click "Trigger Batch Audit" button
Configure options in the modal
Click "Start Batch Audit" (button shows "Processing..." while submitting)
Modal closes and a success banner appears: "Queued X events for evaluation"
The batch runs asynchronously in the background
Use the /api/ai-audit/batch/{job_id} endpoint to track progress (see API Endpoints)
Refresh the Dashboard tab to see updated metrics

Note: If no events match the criteria, the job completes immediately with "No events found matching criteria".

Self-Evaluation Modes¶

The audit service runs 4 evaluation passes on each event:

Mode	What It Does
Self-Critique	AI critiques its own previous analysis
Rubric Scoring	Scores on context usage, reasoning coherence, risk justification (1-5 scale each)
Consistency Check	Re-analyzes the event with a clean prompt and compares risk scores
Prompt Improvement	Identifies missing context, unused data, model gaps, format suggestions, and confusing sections

Version History Tab¶

Version History Tab

View and restore previous prompt configurations.

Version History Features¶

Model Filter Dropdown - Filter versions by specific model or view "All Models"
Available options: All Models, Nemotron, Florence-2, YOLO-World, X-CLIP, Fashion-CLIP
Refresh Button - Reload the version history
Version Table - Shows version number, model, date, changes, status, and actions

Version Table Columns¶

Column	Description
Version	Sequential version identifier displayed as "vX" (e.g., v1, v2)
Model	Which model this version applies to (formatted display name)
Date	When created, shown as relative time (e.g., "2h ago", "3d ago") with full date on hover
Changes	Description of what changed, or "No description" if none provided
Status	"Active" (green badge) for current version, "Previous" (gray badge) for older versions
Actions	"Restore" button (only shown for non-active versions)

Restore Process¶

Click "Restore" on a previous version
Button shows loading spinner during restore
On success, a green banner appears: "Restored [Model] to version X (new version: Y)"
The table refreshes to show the new active version
On error, a red banner appears with the error message

Note: Restoring a version creates a new version entry (it doesn't overwrite). The restored content becomes the new active version with an incremented version number.

Settings & Configuration¶

Period Selector¶

The time period dropdown in the header controls which data is displayed:

Period	Description
Last 24h	Yesterday's audit data
Last 7 days	Rolling week (default)
Last 14 days	Two week window
Last 30 days	Monthly view
Last 90 days	Quarterly view

Refresh¶

Click the Refresh button to reload all data without changing the page.

Troubleshooting¶

"No Events Have Been Audited Yet"¶

This appears when no events have been processed through the audit system.

Solution: Click "Trigger Batch Audit" to start evaluating events. The audit requires:

Events exist in the database
Events have AI analysis (risk scores)
Nemotron LLM service is running

Quality Scores Show "N/A"¶

Quality scores are only available for fully evaluated events.

Possible causes:

Batch audit hasn't run yet
Nemotron service was unavailable during evaluation
Events don't have LLM prompts stored

Solution: Run a batch audit with "Force Re-evaluate" enabled.

Model Contribution Rates Are Low¶

Low contribution rates indicate certain AI models aren't being used.

Possible causes:

Model service is offline
No relevant detections (e.g., no vehicles for vehicle classification)
Model is disabled in configuration

Check: Go to AI Performance page to verify model health status.

Recommendations Are Empty¶

Recommendations are generated from fully evaluated events.

Solution:

Verify events exist in the time period
Run a batch audit to evaluate events
Check that events have LLM prompts (older events may not)

Prompt Playground Test Fails¶

Test failures can occur when running tests or A/B tests.

Common causes:

Nemotron LLM service is offline
Event no longer exists in database
Network timeout during inference
Invalid Event ID entered (must be a positive integer)
No events available for A/B testing (when no Event ID is provided)

Solution:

Check AI Performance page for Nemotron health
Verify the Event ID exists in the Timeline page
Try a different event ID
Wait and retry (inference takes 2-5 seconds)
Check the error message displayed in the red error banner

A/B Test Specific:

If no Event ID is entered, the system picks a random event from the 5 most recent events
If no events exist at all, you'll see "No events available for A/B testing"

Version Restore Failed¶

Restoring a previous version creates a new version entry.

Possible causes:

Database connection issue
Original version data is corrupted
Network error during API call

Solution:

Try the restore again (click the Restore button)
Check the red error banner for the specific error message
If it persists, check backend logs for PromptApiError details
Verify the prompt management API is accessible

Technical Deep Dive¶

For developers wanting to understand the underlying systems.

Architecture¶

AI Pipeline: AI Pipeline Architecture
Self-Evaluation Service: backend/services/pipeline_quality_audit_service.py
API Routes: backend/api/routes/ai_audit.py
Prompt Management: backend/api/routes/prompt_management.py (consolidated from ai_audit.py in NEM-2695)

Self-Evaluation Modes¶

The audit service runs 4 evaluation passes on each event:

Mode 1: Self-Critique
  - Prompt: "Critique your own previous analysis"
  - Output: Text critique identifying strengths and weaknesses
  - Stored in: self_eval_critique field

Mode 2: Rubric Scoring (1-5 scale)
  - context_usage: Did analysis reference all relevant data?
  - reasoning_coherence: Is reasoning logical and well-structured?
  - risk_justification: Does evidence support the risk score?
  - Overall: Average of the three scores above

Mode 3: Consistency Check
  - Re-analyze the same event with clean prompt
  - Compare new risk score to original
  - Consistency score: 5 if diff <= 5, down to 1 if diff >= 25
  - Stored in: consistency_risk_score, consistency_diff

Mode 4: Prompt Improvement
  - Identify: missing_context, unused_data, model_gaps, format_suggestions, confusing_sections
  - Stored as JSON arrays
  - Aggregated across events into recommendations

Quality Score Calculation¶

overall_quality = average(
  context_usage_score,
  reasoning_coherence_score,
  risk_justification_score
)

consistency_score = max(1.0, 5.0 - (risk_score_diff / 5))

Note: The consistency_score is separate from the overall_quality score. The UI displays them as separate metrics.

API Endpoints¶

Endpoint	Method	Description	Query Parameters
`/api/ai-audit/stats`	GET	Aggregate audit statistics	`days` (1-90, default 7), `camera_id` (optional)
`/api/ai-audit/leaderboard`	GET	Model leaderboard by contribution	`days` (1-90, default 7)
`/api/ai-audit/recommendations`	GET	Aggregated prompt improvement suggestions	`days` (1-90, default 7)
`/api/ai-audit/events/{id}`	GET	Get audit for specific event	-
`/api/ai-audit/events/{id}/evaluate`	POST	Trigger evaluation for specific event	`force` (boolean, default false)
`/api/ai-audit/batch`	POST	Trigger batch audit processing (async)	Body: `{ limit, min_risk_score, force_reevaluate }`
`/api/ai-audit/batch/{job_id}`	GET	Get batch audit job status and progress	-

Batch Audit Response (202 Accepted):

{
  "job_id": "uuid-string",
  "status": "pending",
  "message": "Batch audit job created. Use GET /api/ai-audit/batch/{job_id} to track progress.",
  "total_events": 50
}

Batch Job Status Response:

{
  "job_id": "uuid-string",
  "status": "running|completed|failed",
  "progress": 45,
  "message": "Processing event 23 of 50",
  "total_events": 50,
  "processed_events": 22,
  "failed_events": 1,
  "created_at": "2025-01-15T10:00:00Z",
  "started_at": "2025-01-15T10:00:01Z",
  "completed_at": null,
  "error": null
}

Database Model¶

The event_audits table stores:

Model contribution flags: has_yolo26, has_florence, has_clip, has_violence, has_clothing, has_vehicle, has_pet, has_weather, has_image_quality, has_zones, has_baseline, has_cross_camera
Quality scores: context_usage_score, reasoning_coherence_score, risk_justification_score, consistency_score, overall_quality_score
Consistency check results: consistency_risk_score, consistency_diff
Prompt metadata: prompt_length, prompt_token_estimate, enrichment_utilization
Prompt improvements (JSON): missing_context, unused_data, model_gaps, format_suggestions, confusing_sections
Self-evaluation text: self_eval_critique
Status: is_fully_evaluated, audited_at

Component	Location
Main Page	`frontend/src/components/ai/AIAuditPage.tsx`
Dashboard Component	`frontend/src/components/ai-audit/AIAuditDashboard.tsx`
Quality Score Trends	`frontend/src/components/ai/QualityScoreTrends.tsx`
Model Contribution Chart	`frontend/src/components/ai-audit/ModelContributionChart.tsx`
Recommendations Panel	`frontend/src/components/ai/RecommendationsPanel.tsx`
Prompt Playground	`frontend/src/components/ai/PromptPlayground.tsx`
Batch Audit Modal	`frontend/src/components/ai/BatchAuditModal.tsx`
Version History	`frontend/src/components/ai-audit/PromptVersionHistory.tsx`
Audit Progress Bar	`frontend/src/components/ai-audit/AuditProgressBar.tsx`
Audit Results Table	`frontend/src/components/ai-audit/AuditResultsTable.tsx`
Backend Service	`backend/services/pipeline_quality_audit_service.py`
API Routes	`backend/api/routes/ai_audit.py`
API Schemas	`backend/api/schemas/ai_audit.py`
Audit API (Frontend)	`frontend/src/services/auditApi.ts`

AI Audit¶

Overview¶

Accessing the AI Audit Dashboard¶

What You're Looking At¶

Key Components¶

Dashboard Tab¶

Quality Score Metrics¶

Interpreting Your Results¶

Model Contribution Breakdown¶

Prompt Improvement Recommendations¶

Prompt Playground Tab¶

Supported Models¶

Diff Preview Flow¶

A/B Testing Workflow¶

Test Configuration¶

Save, Export, and Import¶

Best Practices for Prompt Editing¶

Batch Audit Tab¶

When to Use Batch Audit¶

Batch Audit Modal¶

Batch Audit Process¶

Self-Evaluation Modes¶

Version History Tab¶

Version History Features¶

Version Table Columns¶

Restore Process¶

Settings & Configuration¶

Period Selector¶

Refresh¶

Troubleshooting¶

"No Events Have Been Audited Yet"¶

Quality Scores Show "N/A"¶

Model Contribution Rates Are Low¶

Recommendations Are Empty¶

Prompt Playground Test Fails¶

Version Restore Failed¶

Technical Deep Dive¶

Architecture¶

Self-Evaluation Modes¶

Quality Score Calculation¶

API Endpoints¶

Database Model¶

Related Code¶