NeMo Data Designer Integration Design Date: 2026-01-21 Status: Draft Author: AI-assisted design session
Overview Integration of NVIDIA NeMo Data Designer to improve testing coverage and Nemotron prompt quality through synthetic data generation and systematic evaluation.
Goals Improve testing coverage across all layers (baseline/anomaly, enrichment pipeline, batch aggregation, end-to-end integration) Improve Nemotron prompts by addressing: Inconsistent risk scores Poor reasoning quality Context underutilization No quantitative template comparison Edge case failures Validate full pipeline through multimodal evaluation against NVIDIA vision models Non-Goals Production runtime integration (developer-only tooling) Replacing real camera feeds during development Fine-tuning production Nemotron on synthetic data Architecture ┌─────────────────────────────────────────────────────────────────┐
│ Development Workflow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ NeMo Data │ │ NVIDIA API │ │
│ │ Designer │─────▶│ (Nemotron 49B) │ │
│ │ (Generation) │ │ (Generation + │ │
│ └────────┬─────────┘ │ LLM-as-Judge) │ │
│ │ └──────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Synthetic │ Versioned fixtures checked into repo │
│ │ Scenario │ - scenarios.parquet │
│ │ Fixtures │ - ground_truth_scores.json │
│ └────────┬─────────┘ - evaluation_rubrics.yaml │
│ │ │
├───────────┼─────────────────────────────────────────────────────┤
│ │ CI / Test Workflow │
│ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Prompt │ │ Local Nemotron │ │
│ │ Evaluation │─────▶│ (Your 5 │ │
│ │ Harness │ │ Templates) │ │
│ └────────┬─────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Comparison │ Compare outputs against ground truth │
│ │ Reports │ - Risk score deviation │
│ └──────────────────┘ - Reasoning quality metrics │
│ - Template performance rankings │
└─────────────────────────────────────────────────────────────────┘
Scenario Taxonomy Sampler Columns (Statistical Control) Column Values Purpose time_of_day morning, midday, evening, night, late_night Test time-based risk calibration day_type weekday, weekend, holiday Baseline deviation testing camera_location front_door, backyard, driveway, side_gate Zone-based context detection_count 1, 2-3, 4-6, 7+ Batch complexity primary_object person, vehicle, animal, package Core detection types scenario_type normal, suspicious, threat, edge_case Ground truth classification enrichment_level none, basic, full Context utilization testing
Scenario Type Definitions normal - Expected activity (family arriving, delivery, pets) suspicious - Unusual but not threatening (unknown person lingering, vehicle idling) threat - Clear security concern (weapon detected, forced entry attempt, prowler) edge_case - Ambiguous situations (contractor at odd hours, costume, wildlife) Ground Truth Risk Ranges Scenario Type Risk Range Risk Level normal 0-25 low suspicious 30-55 medium threat 70-100 high/critical edge_case 20-60 varies
Column Inventory Complete Column Structure (24 columns across 7 types) # SAMPLERS (7) - Statistical control
- time_of_day : [ morning , midday , evening , night , late_night ]
- day_type : [ weekday , weekend , holiday ]
- camera_location : [ front_door , backyard , driveway , side_gate ]
- detection_count : [ 1 , 2-3 , 4-6 , 7+ ]
- primary_object : [ person , vehicle , animal , package ]
- scenario_type : [ normal , suspicious , threat , edge_case ]
- enrichment_level : [ none , basic , full ]
# LLM-STRUCTURED (3) - Pydantic-validated generation
- detections : list[Detection]
- enrichment_context : EnrichmentContext | None
- ground_truth : GroundTruth # risk_range, reasoning, expected_models
# LLM-TEXT (3) - Narrative generation
- scenario_narrative : str
- expected_summary : str
- reasoning_key_points : str
# LLM-JUDGE (6) - Quality rubrics
- relevance : 1-4
- risk_calibration : 1-4
- context_usage : 1-4
- reasoning_quality : 1-4
- threat_identification : 1-4
- actionability : 1-4
# EMBEDDING (2) - Semantic search
- scenario_embedding : vector[768]
- reasoning_embedding : vector[768]
# EXPRESSION (3) - Derived fields
- formatted_prompt_input : str # Pre-rendered for each template
- complexity_score : float
- scenario_hash : str
# VALIDATION (2) - Quality gates
- detection_schema_valid : bool
- temporal_consistency : bool
Pydantic Models from pydantic import BaseModel , Field
from typing import Literal
class Detection ( BaseModel ):
object_type : Literal [ "person" , "car" , "truck" , "dog" , "cat" , "bicycle" ]
confidence : float = Field ( ge = 0.5 , le = 1.0 )
bbox : tuple [ int , int , int , int ] # x, y, width, height
timestamp_offset_seconds : int = Field ( ge = 0 , le = 90 )
class EnrichmentContext ( BaseModel ):
zone_name : str | None
is_entry_point : bool
baseline_expected_count : int
baseline_deviation_score : float = Field ( ge =- 3.0 , le = 3.0 )
cross_camera_matches : int = Field ( ge = 0 , le = 5 )
class GroundTruth ( BaseModel ):
risk_range : tuple [ int , int ]
reasoning_key_points : list [ str ]
expected_enrichment_models : list [ str ]
should_trigger_alert : bool
Evaluation Harness Workflow 1. LOAD FIXTURES
scenarios.parquet → DataFrame with all columns
2. FOR EACH PROMPT TEMPLATE (5 templates)
├─ Render prompt using formatted_prompt_input
├─ Call local Nemotron → get risk_score, reasoning
└─ Store in results DataFrame
3. COMPUTE METRICS
├─ Risk score deviation from ground_truth range
├─ Reasoning similarity to expected_summary (cosine)
├─ Key point coverage (reasoning_key_points)
└─ Aggregate by scenario_type, enrichment_level
4. GENERATE REPORTS
├─ Template ranking by overall score
├─ Failure cases (score outside ground_truth range)
├─ Context utilization gaps
└─ Edge case performance breakdown
LLM-Judge Rubrics Dimension Scale Evaluates relevance 1-4 Does output address the actual security concern? risk_calibration 1-4 Is score appropriate for scenario severity? context_usage 1-4 Are enrichment inputs reflected in reasoning? reasoning_quality 1-4 Is the explanation logical and complete? threat_identification 1-4 Did it correctly identify/miss the actual threat? actionability 1-4 Is the output useful for a homeowner to act on?
File Structure backend/
├── tests/
│ ├── fixtures/
│ │ └── synthetic/ # Generated fixtures
│ │ ├── scenarios.parquet # Main scenario dataset
│ │ ├── ground_truth.json # Risk ranges, key points
│ │ ├── embeddings.npy # Pre-computed vectors
│ │ ├── images/ # Multimodal test images
│ │ │ ├── normal/
│ │ │ ├── suspicious/
│ │ │ ├── threat/
│ │ │ └── edge_case/
│ │ └── multimodal_ground_truth.parquet
│ │
│ ├── integration/
│ │ ├── test_nemotron_prompts.py # Prompt evaluation tests
│ │ └── test_multimodal_pipeline.py # Vision comparison tests
│ │
│ └── conftest.py # Fixture loaders
│
├── evaluation/ # Evaluation tooling
│ ├── __init__.py
│ ├── harness.py # Prompt evaluation runner
│ ├── metrics.py # Score calculation, comparisons
│ └── reports.py # Report generation (JSON, HTML)
tools/
└── nemo_data_designer/ # Generation scripts
├── __init__.py
├── config.py # Column definitions, Pydantic models
├── generate_scenarios.py # Main generation script
├── multimodal/
│ ├── __init__.py
│ ├── image_analyzer.py # NVIDIA vision API wrapper
│ ├── ground_truth_generator.py # Generate GT from images
│ └── pipeline_comparator.py # Compare local vs NVIDIA outputs
├── notebooks/
│ ├── 01_scenario_generation.ipynb
│ ├── 02_evaluation_analysis.ipynb
│ └── 03_multimodal_evaluation.ipynb
└── README.md
Testing Integration Pytest Fixtures # backend/tests/conftest.py additions
import pandas as pd
import pytest
from pathlib import Path
SYNTHETIC_FIXTURES_DIR = Path ( __file__ ) . parent / "fixtures" / "synthetic"
@pytest . fixture ( scope = "session" )
def synthetic_scenarios () -> pd . DataFrame :
"""Load pre-generated NeMo Data Designer scenarios."""
return pd . read_parquet ( SYNTHETIC_FIXTURES_DIR / "scenarios.parquet" )
@pytest . fixture ( scope = "session" )
def scenario_by_type ( synthetic_scenarios ):
"""Group scenarios for targeted testing."""
return {
"normal" : synthetic_scenarios [ synthetic_scenarios . scenario_type == "normal" ],
"suspicious" : synthetic_scenarios [ synthetic_scenarios . scenario_type == "suspicious" ],
"threat" : synthetic_scenarios [ synthetic_scenarios . scenario_type == "threat" ],
"edge_case" : synthetic_scenarios [ synthetic_scenarios . scenario_type == "edge_case" ],
}
Test Patterns # backend/tests/integration/test_nemotron_prompts.py
@pytest . mark . parametrize ( "template" , PROMPT_TEMPLATES )
def test_risk_score_within_ground_truth_range ( template , scenario_by_type ):
"""Each template should produce scores within expected ranges."""
for scenario in scenario_by_type [ "threat" ] . itertuples ():
result = evaluate_prompt ( template , scenario . formatted_prompt_input )
min_score , max_score = scenario . ground_truth_risk_range
assert min_score <= result . risk_score <= max_score
def test_enrichment_context_reflected_in_reasoning ( synthetic_scenarios ):
"""Full enrichment scenarios should reference context in reasoning."""
full_enrichment = synthetic_scenarios [
synthetic_scenarios . enrichment_level == "full"
]
for scenario in full_enrichment . itertuples ():
result = evaluate_prompt ( ENRICHED_TEMPLATE , scenario . formatted_prompt_input )
for key_point in scenario . reasoning_key_points :
assert key_point . lower () in result . reasoning . lower ()
CI Integration # .github/workflows/prompt-evaluation.yml
prompt-evaluation :
runs-on : ubuntu-latest
if : github.event_name == 'schedule' # Nightly only
steps :
- uses : actions/checkout@v4
- name : Run prompt evaluation suite
run : |
uv run pytest backend/tests/integration/test_nemotron_prompts.py \
--tb=short -v --json-report
- name : Upload evaluation report
uses : actions/upload-artifact@v4
with :
name : prompt-evaluation-report
path : reports/prompt_evaluation.json
Multimodal Evaluation (Phase 6) Pipeline Sample Images (from curated test set)
│
├──────────────────┬───────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ NVIDIA API │ │ Your Pipeline │ │ │
│ (Vision Model) │ │ YOLO26 + │ │ Compare │
│ │ │ Florence-2 + │ │ Outputs │
│ Ground Truth │ │ Enrichment │ │ │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Comparison Metrics: │
│ • Detection accuracy (did YOLO26 find what NVIDIA saw?) │
│ • Enrichment quality (Florence-2 vs NVIDIA vision desc) │
│ • End-to-end risk score alignment │
└─────────────────────────────────────────────────────────────┘
Additional Columns Column Type Purpose image_path Seed Reference to test image nvidia_vision_description LLM-Structured What NVIDIA's vision model sees nvidia_detected_objects LLM-Structured Objects with bounding boxes nvidia_risk_assessment LLM-Structured Full risk analysis from image vision_alignment_score LLM-Judge How well does local pipeline match?
Image Curation Category Count Source Normal activity 25 Stock footage, staged captures Suspicious 15 Staged scenarios Threat simulation 10 Controlled/synthetic (no real weapons) Edge cases 15 Weather, lighting, occlusion, costumes Total 65 Curated test set
Implementation Phases Phase Scope Deliverables Effort 1. Foundation NeMo setup + basic generation tools/nemo_data_designer/, initial 100 scenarios 3-4 days 2. Evaluation Harness Metrics + comparison engine backend/evaluation/, prompt comparison reports 2-3 days 3. Testing Integration Pytest + CI test_nemotron_prompts.py, GitHub Actions workflow 2 days 4. Full Coverage Scenario expansion 1,500+ scenarios, embeddings, coverage analysis 3-4 days 5. Enrichment Pipeline Model zoo edge cases Circuit breaker tests, VRAM eviction scenarios 2-3 days 6. Multimodal Evaluation Image-based ground truth Vision comparison pipeline, image test fixtures 4-5 days
Total estimated effort: 16-21 days
Dependencies Python Packages (dev dependencies) [project.optional-dependencies]
nemo = [
"data-designer>=0.1.0" ,
"pandas>=2.0.0" ,
"pyarrow>=14.0.0" ,
"numpy>=1.24.0" ,
]
External Services NVIDIA API key (NVIDIA_API_KEY environment variable) Access to Nemotron 49B via build.nvidia.com Success Criteria Prompt template ranking - Quantitative comparison showing best-performing template Risk score consistency - <15 point deviation from ground truth for 90%+ scenarios Context utilization - Full enrichment scenarios score 3+ on context_usage rubric Edge case coverage - All 4 scenario types have dedicated test fixtures Multimodal alignment - Local pipeline achieves 70%+ IoU with NVIDIA vision detection Open Questions Should we version fixtures in Git LFS or generate on-demand? What's the budget for NVIDIA API usage during generation? How often should we regenerate fixtures (quarterly, on prompt changes)? References