Prompt Evaluation Results This document tracks pre/post metrics for Nemotron prompt quality improvements through synthetic data evaluation.
Overview The evaluation harness compares 5 prompt templates against synthetic scenarios generated by NeMo Data Designer. This document tracks metrics before and after prompt optimization.
Baseline Metrics Status: To be filled after initial evaluation run
Risk Score Variance Measures how consistently the model produces risk scores within expected ground truth ranges.
Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5 Normal TBD TBD TBD TBD TBD Suspicious TBD TBD TBD TBD TBD Threat TBD TBD TBD TBD TBD Edge Case TBD TBD TBD TBD TBD Average TBD TBD TBD TBD TBD
Values represent mean absolute deviation from ground truth range midpoint.
Reasoning Key Point Coverage Measures how many expected reasoning points appear in the model's output.
Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5 Normal TBD% TBD% TBD% TBD% TBD% Suspicious TBD% TBD% TBD% TBD% TBD% Threat TBD% TBD% TBD% TBD% TBD% Edge Case TBD% TBD% TBD% TBD% TBD% Average TBD% TBD% TBD% TBD% TBD%
Values represent percentage of expected key points found in reasoning.
Context Utilization Scores Measures how well the model incorporates enrichment context in its reasoning.
Enrichment Level Template 1 Template 2 Template 3 Template 4 Template 5 None TBD TBD TBD TBD TBD Basic TBD TBD TBD TBD TBD Full TBD TBD TBD TBD TBD Average TBD TBD TBD TBD TBD
Values represent LLM-Judge context_usage rubric score (1-4 scale).
Edge Case Success Rate Measures correct handling of ambiguous scenarios.
Edge Case Category Template 1 Template 2 Template 3 Template 4 Template 5 Contractor at night TBD% TBD% TBD% TBD% TBD% Costume/disguise TBD% TBD% TBD% TBD% TBD% Wildlife TBD% TBD% TBD% TBD% TBD% Weather occlusion TBD% TBD% TBD% TBD% TBD% Low-light/IR TBD% TBD% TBD% TBD% TBD% Average TBD% TBD% TBD% TBD% TBD%
Values represent percentage of edge cases with risk score within acceptable range.
Multimodal Alignment (IoU) Measures alignment between local pipeline detections and NVIDIA vision model ground truth.
Detection Type YOLO26 IoU Florence-2 Desc Combined Score Person TBD TBD TBD Vehicle TBD TBD TBD Animal TBD TBD TBD Package TBD TBD TBD Average TBD TBD TBD
IoU = Intersection over Union for bounding boxes. Desc = description similarity.
Post-Implementation Metrics Status: To be filled after prompt optimization
Risk Score Variance (Post) Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5 Normal TBD TBD TBD TBD TBD Suspicious TBD TBD TBD TBD TBD Threat TBD TBD TBD TBD TBD Edge Case TBD TBD TBD TBD TBD Average TBD TBD TBD TBD TBD
Reasoning Key Point Coverage (Post) Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5 Normal TBD% TBD% TBD% TBD% TBD% Suspicious TBD% TBD% TBD% TBD% TBD% Threat TBD% TBD% TBD% TBD% TBD% Edge Case TBD% TBD% TBD% TBD% TBD% Average TBD% TBD% TBD% TBD% TBD%
Context Utilization Scores (Post) Enrichment Level Template 1 Template 2 Template 3 Template 4 Template 5 None TBD TBD TBD TBD TBD Basic TBD TBD TBD TBD TBD Full TBD TBD TBD TBD TBD Average TBD TBD TBD TBD TBD
Edge Case Success Rate (Post) Edge Case Category Template 1 Template 2 Template 3 Template 4 Template 5 Contractor at night TBD% TBD% TBD% TBD% TBD% Costume/disguise TBD% TBD% TBD% TBD% TBD% Wildlife TBD% TBD% TBD% TBD% TBD% Weather occlusion TBD% TBD% TBD% TBD% TBD% Low-light/IR TBD% TBD% TBD% TBD% TBD% Average TBD% TBD% TBD% TBD% TBD%
Multimodal Alignment (Post) Detection Type YOLO26 IoU Florence-2 Desc Combined Score Person TBD TBD TBD Vehicle TBD TBD TBD Animal TBD TBD TBD Package TBD TBD TBD Average TBD TBD TBD
Metrics Comparison Summary Comparison Table Metric Baseline Post-Implementation Delta Target Risk Score Deviation TBD TBD TBD <15 Key Point Coverage TBD% TBD% TBD% >80% Context Utilization (3+) TBD% TBD% TBD% >90% Edge Case Success TBD% TBD% TBD% >70% Multimodal IoU TBD TBD TBD >0.70
Success Criteria From the design document :
Criterion Target Status Risk score consistency <15 point deviation for 90%+ scenarios TBD Context utilization Full enrichment scenarios score 3+ rubric TBD Edge case coverage All 4 scenario types have test fixtures TBD Multimodal alignment 70%+ IoU with NVIDIA vision detection TBD
Template Ranking Results Overall Template Rankings Rank Template Composite Score Best For Weaknesses 1 TBD TBD TBD TBD 2 TBD TBD TBD TBD 3 TBD TBD TBD TBD 4 TBD TBD TBD TBD 5 TBD TBD TBD TBD
Composite score = weighted average of all metrics.
Per-Scenario Type Rankings Normal Scenarios Rank Template Score Notes 1 TBD TBD TBD 2 TBD TBD TBD 3 TBD TBD TBD
Suspicious Scenarios Rank Template Score Notes 1 TBD TBD TBD 2 TBD TBD TBD 3 TBD TBD TBD
Threat Scenarios Rank Template Score Notes 1 TBD TBD TBD 2 TBD TBD TBD 3 TBD TBD TBD
Edge Case Scenarios Rank Template Score Notes 1 TBD TBD TBD 2 TBD TBD TBD 3 TBD TBD TBD
Edge Case Analysis Failure Case Breakdown Edge Case Failure Mode Frequency Template(s) Affected TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD
Root Cause Categories Root Cause Count Percentage Mitigation Missing context TBD TBD% TBD Misinterpreted time TBD TBD% TBD Object misclassification TBD TBD% TBD Baseline deviation error TBD TBD% TBD
Recommended Prompt Improvements Based on edge case analysis:
TBD - Description of improvement TBD - Description of improvement TBD - Description of improvement LLM-Judge Rubric Scores Average Rubric Scores by Template Rubric Template 1 Template 2 Template 3 Template 4 Template 5 Relevance TBD TBD TBD TBD TBD Risk Calibration TBD TBD TBD TBD TBD Context Usage TBD TBD TBD TBD TBD Reasoning Quality TBD TBD TBD TBD TBD Threat Identification TBD TBD TBD TBD TBD Actionability TBD TBD TBD TBD TBD Average TBD TBD TBD TBD TBD
Rubric Score Distribution Score Relevance Risk Cal. Context Reasoning Threat ID Action 4 TBD% TBD% TBD% TBD% TBD% TBD% 3 TBD% TBD% TBD% TBD% TBD% TBD% 2 TBD% TBD% TBD% TBD% TBD% TBD% 1 TBD% TBD% TBD% TBD% TBD% TBD%
Evaluation History Date Scenarios Templates Best Template Notes TBD TBD TBD TBD Initial baseline TBD TBD TBD TBD Post-optimization