Skip to content

Prompt Evaluation Results

This document tracks pre/post metrics for Nemotron prompt quality improvements through synthetic data evaluation.

Overview

The evaluation harness compares 5 prompt templates against synthetic scenarios generated by NeMo Data Designer. This document tracks metrics before and after prompt optimization.

Baseline Metrics

Status: To be filled after initial evaluation run

Risk Score Variance

Measures how consistently the model produces risk scores within expected ground truth ranges.

Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5
Normal TBD TBD TBD TBD TBD
Suspicious TBD TBD TBD TBD TBD
Threat TBD TBD TBD TBD TBD
Edge Case TBD TBD TBD TBD TBD
Average TBD TBD TBD TBD TBD

Values represent mean absolute deviation from ground truth range midpoint.

Reasoning Key Point Coverage

Measures how many expected reasoning points appear in the model's output.

Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5
Normal TBD% TBD% TBD% TBD% TBD%
Suspicious TBD% TBD% TBD% TBD% TBD%
Threat TBD% TBD% TBD% TBD% TBD%
Edge Case TBD% TBD% TBD% TBD% TBD%
Average TBD% TBD% TBD% TBD% TBD%

Values represent percentage of expected key points found in reasoning.

Context Utilization Scores

Measures how well the model incorporates enrichment context in its reasoning.

Enrichment Level Template 1 Template 2 Template 3 Template 4 Template 5
None TBD TBD TBD TBD TBD
Basic TBD TBD TBD TBD TBD
Full TBD TBD TBD TBD TBD
Average TBD TBD TBD TBD TBD

Values represent LLM-Judge context_usage rubric score (1-4 scale).

Edge Case Success Rate

Measures correct handling of ambiguous scenarios.

Edge Case Category Template 1 Template 2 Template 3 Template 4 Template 5
Contractor at night TBD% TBD% TBD% TBD% TBD%
Costume/disguise TBD% TBD% TBD% TBD% TBD%
Wildlife TBD% TBD% TBD% TBD% TBD%
Weather occlusion TBD% TBD% TBD% TBD% TBD%
Low-light/IR TBD% TBD% TBD% TBD% TBD%
Average TBD% TBD% TBD% TBD% TBD%

Values represent percentage of edge cases with risk score within acceptable range.

Multimodal Alignment (IoU)

Measures alignment between local pipeline detections and NVIDIA vision model ground truth.

Detection Type YOLO26 IoU Florence-2 Desc Combined Score
Person TBD TBD TBD
Vehicle TBD TBD TBD
Animal TBD TBD TBD
Package TBD TBD TBD
Average TBD TBD TBD

IoU = Intersection over Union for bounding boxes. Desc = description similarity.


Post-Implementation Metrics

Status: To be filled after prompt optimization

Risk Score Variance (Post)

Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5
Normal TBD TBD TBD TBD TBD
Suspicious TBD TBD TBD TBD TBD
Threat TBD TBD TBD TBD TBD
Edge Case TBD TBD TBD TBD TBD
Average TBD TBD TBD TBD TBD

Reasoning Key Point Coverage (Post)

Scenario Type Template 1 Template 2 Template 3 Template 4 Template 5
Normal TBD% TBD% TBD% TBD% TBD%
Suspicious TBD% TBD% TBD% TBD% TBD%
Threat TBD% TBD% TBD% TBD% TBD%
Edge Case TBD% TBD% TBD% TBD% TBD%
Average TBD% TBD% TBD% TBD% TBD%

Context Utilization Scores (Post)

Enrichment Level Template 1 Template 2 Template 3 Template 4 Template 5
None TBD TBD TBD TBD TBD
Basic TBD TBD TBD TBD TBD
Full TBD TBD TBD TBD TBD
Average TBD TBD TBD TBD TBD

Edge Case Success Rate (Post)

Edge Case Category Template 1 Template 2 Template 3 Template 4 Template 5
Contractor at night TBD% TBD% TBD% TBD% TBD%
Costume/disguise TBD% TBD% TBD% TBD% TBD%
Wildlife TBD% TBD% TBD% TBD% TBD%
Weather occlusion TBD% TBD% TBD% TBD% TBD%
Low-light/IR TBD% TBD% TBD% TBD% TBD%
Average TBD% TBD% TBD% TBD% TBD%

Multimodal Alignment (Post)

Detection Type YOLO26 IoU Florence-2 Desc Combined Score
Person TBD TBD TBD
Vehicle TBD TBD TBD
Animal TBD TBD TBD
Package TBD TBD TBD
Average TBD TBD TBD

Metrics Comparison

Summary Comparison Table

Metric Baseline Post-Implementation Delta Target
Risk Score Deviation TBD TBD TBD <15
Key Point Coverage TBD% TBD% TBD% >80%
Context Utilization (3+) TBD% TBD% TBD% >90%
Edge Case Success TBD% TBD% TBD% >70%
Multimodal IoU TBD TBD TBD >0.70

Success Criteria

From the design document:

Criterion Target Status
Risk score consistency <15 point deviation for 90%+ scenarios TBD
Context utilization Full enrichment scenarios score 3+ rubric TBD
Edge case coverage All 4 scenario types have test fixtures TBD
Multimodal alignment 70%+ IoU with NVIDIA vision detection TBD

Template Ranking Results

Overall Template Rankings

Rank Template Composite Score Best For Weaknesses
1 TBD TBD TBD TBD
2 TBD TBD TBD TBD
3 TBD TBD TBD TBD
4 TBD TBD TBD TBD
5 TBD TBD TBD TBD

Composite score = weighted average of all metrics.

Per-Scenario Type Rankings

Normal Scenarios

Rank Template Score Notes
1 TBD TBD TBD
2 TBD TBD TBD
3 TBD TBD TBD

Suspicious Scenarios

Rank Template Score Notes
1 TBD TBD TBD
2 TBD TBD TBD
3 TBD TBD TBD

Threat Scenarios

Rank Template Score Notes
1 TBD TBD TBD
2 TBD TBD TBD
3 TBD TBD TBD

Edge Case Scenarios

Rank Template Score Notes
1 TBD TBD TBD
2 TBD TBD TBD
3 TBD TBD TBD

Edge Case Analysis

Failure Case Breakdown

Edge Case Failure Mode Frequency Template(s) Affected
TBD TBD TBD TBD
TBD TBD TBD TBD
TBD TBD TBD TBD

Root Cause Categories

Root Cause Count Percentage Mitigation
Missing context TBD TBD% TBD
Misinterpreted time TBD TBD% TBD
Object misclassification TBD TBD% TBD
Baseline deviation error TBD TBD% TBD

Based on edge case analysis:

  1. TBD - Description of improvement
  2. TBD - Description of improvement
  3. TBD - Description of improvement

LLM-Judge Rubric Scores

Average Rubric Scores by Template

Rubric Template 1 Template 2 Template 3 Template 4 Template 5
Relevance TBD TBD TBD TBD TBD
Risk Calibration TBD TBD TBD TBD TBD
Context Usage TBD TBD TBD TBD TBD
Reasoning Quality TBD TBD TBD TBD TBD
Threat Identification TBD TBD TBD TBD TBD
Actionability TBD TBD TBD TBD TBD
Average TBD TBD TBD TBD TBD

Rubric Score Distribution

Score Relevance Risk Cal. Context Reasoning Threat ID Action
4 TBD% TBD% TBD% TBD% TBD% TBD%
3 TBD% TBD% TBD% TBD% TBD% TBD%
2 TBD% TBD% TBD% TBD% TBD% TBD%
1 TBD% TBD% TBD% TBD% TBD% TBD%

Evaluation History

Date Scenarios Templates Best Template Notes
TBD TBD TBD TBD Initial baseline
TBD TBD TBD TBD Post-optimization