Automated Risk Score Validation Test Suite¶

Date: 2026-01-31 Status: Implemented Related Issues: NEM-4533, NEM-4529, NEM-4527

Context¶

The AI pipeline's risk scoring accuracy needed automated validation against synthetic test scenarios. Previously, validation was manual and inconsistent, making it difficult to catch regressions in risk assessment quality.

Decision¶

Implemented a comprehensive automated validation test suite with three components:

1. Integration Test Suite (NEM-4533)¶

File: backend/tests/integration/test_risk_score_validation.py

Features:

Loads synthetic scenarios from data/synthetic/ with expected_labels.json
Compares actual risk scores to expected ranges
Reports pass/fail based on score being within expected range
Calculates gap rate (should be <20%)
Validates per-class detection accuracy (precision, recall, F1)
Generates scenario-type breakdown (normal/suspicious/threats)
Analyzes confidence distribution percentiles

Key Metrics:

Gap Rate:
Percentage of scenarios where actual risk score falls outside expected range
Target: < 20%
Calculation: (Scenarios with gaps / Total scenarios) × 100%
Per-Class Metrics (NEM-4529):
Precision, recall, F1 for each object class (Person, Car, Dog, etc.)
Identifies which detection types are underperforming
Per-Scenario-Type Metrics (NEM-4529):
Aggregated metrics by category (normal, suspicious, threats)
Highlights which scenario types have highest error rates
Confidence Distribution (NEM-4529):
P50, P90, P95, P99 percentiles of detection confidence
Indicates model certainty patterns

Test Methods:

class TestRiskScoreValidation:
    async def test_load_synthetic_scenarios()
    async def test_risk_score_ranges_match_category()
    async def test_gap_rate_below_threshold()
    async def test_per_class_detection_accuracy()
    async def test_scenario_type_breakdown()
    async def test_confidence_distribution_percentiles()

2. Enhanced Validation Script (NEM-4529)¶

File: scripts/validate_detections.py (improvements documented, to be applied in separate PR)

Planned Enhancements:

Per-class precision/recall/F1 calculation
Scenario-type breakdown analysis
Confidence distribution percentiles
Enhanced JSON export with detailed metrics

Note: Script enhancements are documented but not applied in this PR to maintain separation of concerns. They will be implemented after the test suite is merged and validated.

3. Coverage Documentation (NEM-4527)¶

File: docs/guides/detection-validation-coverage.md

Contents:

How to process synthetic scenarios through the pipeline
Three methods: camera directory, API upload, bulk processing script
Interpreting validation results
Troubleshooting guide
Best practices for continuous validation

Implementation Details¶

Gap Calculation¶

if expected_min <= actual_score <= expected_max:
    gap = 0.0  # Within range
elif actual_score < expected_min:
    gap = float(expected_min - actual_score)  # Below range
else:
    gap = float(actual_score - expected_max)  # Above range

Per-Class Metrics¶

For each object class:

precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0.0
f1_score = 2 * (precision * recall) / (precision + recall)

Scenario Type Inference¶

def _infer_scenario_type(scenario_name: str) -> str:
    if any(marker in scenario_lower for marker in ["suspicious", "casing", "loiter"]):
        return "suspicious"
    if any(marker in scenario_lower for marker in ["threat", "break", "intrude"]):
        return "threats"
    return "normal"

Consequences¶

Positive¶

Automated Quality Gates:
CI/CD can fail builds if gap rate > 20%
Prevents regressions in risk scoring accuracy
Detailed Diagnostics:
Per-class metrics identify specific detection issues
Scenario-type breakdown shows which categories need improvement
Confidence analysis reveals model certainty patterns
Continuous Improvement:
Track metrics over time to measure improvements
A/B test prompt changes with quantitative metrics
Validate model updates before deployment
Documentation:
Comprehensive guide for expanding validation coverage
Clear troubleshooting steps
Best practices for synthetic scenario generation

Negative¶

Database Dependency:
Tests require database with processed scenarios
Can't run in pure unit test mode
Mitigated by: Clear fixtures, good error messages
Synthetic Data Quality:
Results depend on quality of synthetic scenarios
Some scenarios may be miscategorized
Mitigated by: Lenient category checks, clear reporting
Processing Time:
Processing all scenarios takes time
Running validation suite is slower than unit tests
Mitigated by: Can run on subset of scenarios, parallel processing

Testing¶

Running Tests¶

# Run all validation tests
uv run pytest backend/tests/integration/test_risk_score_validation.py -v

# Run specific test
uv run pytest backend/tests/integration/test_risk_score_validation.py::TestRiskScoreValidation::test_gap_rate_below_threshold -v

# Run without database (only fixture tests)
uv run pytest backend/tests/integration/test_risk_score_validation.py::TestRiskScoreValidation::test_load_synthetic_scenarios -v

Expected Output¶

================================================================================
RISK SCORE VALIDATION REPORT
================================================================================
Total Scenarios: 150
Scenarios with Gaps: 12
Overall Gap Rate: 8.0%
Threshold: 20.0%

Per-Category Gap Rates:
--------------------------------------------------------------------------------
  normal      :   2/ 80 (  2.5%)
  suspicious  :   7/ 50 ( 14.0%)
  threats     :   3/ 20 ( 15.0%)

Top 10 Largest Gaps:
--------------------------------------------------------------------------------
  prowling_20260125_180257                 | Expected: [35, 60] | Actual:  75 | Gap:  15
  ...
================================================================================

Future Enhancements¶

Temporal Analysis:
Track gap rate trends over time
Alert on significant regressions
Visualize improvements in dashboard
Confidence Calibration:
Compare predicted vs actual risk levels
Measure calibration error
Identify overconfident/underconfident predictions
Scenario Generation Automation:
Auto-generate scenarios from production events
Create edge cases based on historical failures
Expand coverage systematically
Integration with CI/CD:
Automated scenario processing on PR
Comment PR with validation results
Block merge if gap rate exceeds threshold

References¶

Testing Guide
Detection Validation Coverage Guide
Video Analytics Guide
Test Suite Source
Linear Issues: NEM-4533, NEM-4529, NEM-4527