Skip to content

Automated Risk Score Validation Test Suite

Date: 2026-01-31 Status: Implemented Related Issues: NEM-4533, NEM-4529, NEM-4527

Context

The AI pipeline's risk scoring accuracy needed automated validation against synthetic test scenarios. Previously, validation was manual and inconsistent, making it difficult to catch regressions in risk assessment quality.

Decision

Implemented a comprehensive automated validation test suite with three components:

1. Integration Test Suite (NEM-4533)

File: backend/tests/integration/test_risk_score_validation.py

Features:

  • Loads synthetic scenarios from data/synthetic/ with expected_labels.json
  • Compares actual risk scores to expected ranges
  • Reports pass/fail based on score being within expected range
  • Calculates gap rate (should be <20%)
  • Validates per-class detection accuracy (precision, recall, F1)
  • Generates scenario-type breakdown (normal/suspicious/threats)
  • Analyzes confidence distribution percentiles

Key Metrics:

  1. Gap Rate:

  2. Percentage of scenarios where actual risk score falls outside expected range

  3. Target: < 20%
  4. Calculation: (Scenarios with gaps / Total scenarios) × 100%

  5. Per-Class Metrics (NEM-4529):

  6. Precision, recall, F1 for each object class (Person, Car, Dog, etc.)

  7. Identifies which detection types are underperforming

  8. Per-Scenario-Type Metrics (NEM-4529):

  9. Aggregated metrics by category (normal, suspicious, threats)

  10. Highlights which scenario types have highest error rates

  11. Confidence Distribution (NEM-4529):

  12. P50, P90, P95, P99 percentiles of detection confidence
  13. Indicates model certainty patterns

Test Methods:

class TestRiskScoreValidation:
    async def test_load_synthetic_scenarios()
    async def test_risk_score_ranges_match_category()
    async def test_gap_rate_below_threshold()
    async def test_per_class_detection_accuracy()
    async def test_scenario_type_breakdown()
    async def test_confidence_distribution_percentiles()

2. Enhanced Validation Script (NEM-4529)

File: scripts/validate_detections.py (improvements documented, to be applied in separate PR)

Planned Enhancements:

  • Per-class precision/recall/F1 calculation
  • Scenario-type breakdown analysis
  • Confidence distribution percentiles
  • Enhanced JSON export with detailed metrics

Note: Script enhancements are documented but not applied in this PR to maintain separation of concerns. They will be implemented after the test suite is merged and validated.

3. Coverage Documentation (NEM-4527)

File: docs/guides/detection-validation-coverage.md

Contents:

  • How to process synthetic scenarios through the pipeline
  • Three methods: camera directory, API upload, bulk processing script
  • Interpreting validation results
  • Troubleshooting guide
  • Best practices for continuous validation

Implementation Details

Gap Calculation

if expected_min <= actual_score <= expected_max:
    gap = 0.0  # Within range
elif actual_score < expected_min:
    gap = float(expected_min - actual_score)  # Below range
else:
    gap = float(actual_score - expected_max)  # Above range

Per-Class Metrics

For each object class:

precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0.0
f1_score = 2 * (precision * recall) / (precision + recall)

Scenario Type Inference

def _infer_scenario_type(scenario_name: str) -> str:
    if any(marker in scenario_lower for marker in ["suspicious", "casing", "loiter"]):
        return "suspicious"
    if any(marker in scenario_lower for marker in ["threat", "break", "intrude"]):
        return "threats"
    return "normal"

Consequences

Positive

  1. Automated Quality Gates:

  2. CI/CD can fail builds if gap rate > 20%

  3. Prevents regressions in risk scoring accuracy

  4. Detailed Diagnostics:

  5. Per-class metrics identify specific detection issues

  6. Scenario-type breakdown shows which categories need improvement
  7. Confidence analysis reveals model certainty patterns

  8. Continuous Improvement:

  9. Track metrics over time to measure improvements

  10. A/B test prompt changes with quantitative metrics
  11. Validate model updates before deployment

  12. Documentation:

  13. Comprehensive guide for expanding validation coverage
  14. Clear troubleshooting steps
  15. Best practices for synthetic scenario generation

Negative

  1. Database Dependency:

  2. Tests require database with processed scenarios

  3. Can't run in pure unit test mode
  4. Mitigated by: Clear fixtures, good error messages

  5. Synthetic Data Quality:

  6. Results depend on quality of synthetic scenarios

  7. Some scenarios may be miscategorized
  8. Mitigated by: Lenient category checks, clear reporting

  9. Processing Time:

  10. Processing all scenarios takes time
  11. Running validation suite is slower than unit tests
  12. Mitigated by: Can run on subset of scenarios, parallel processing

Testing

Running Tests

# Run all validation tests
uv run pytest backend/tests/integration/test_risk_score_validation.py -v

# Run specific test
uv run pytest backend/tests/integration/test_risk_score_validation.py::TestRiskScoreValidation::test_gap_rate_below_threshold -v

# Run without database (only fixture tests)
uv run pytest backend/tests/integration/test_risk_score_validation.py::TestRiskScoreValidation::test_load_synthetic_scenarios -v

Expected Output

================================================================================
RISK SCORE VALIDATION REPORT
================================================================================
Total Scenarios: 150
Scenarios with Gaps: 12
Overall Gap Rate: 8.0%
Threshold: 20.0%

Per-Category Gap Rates:
--------------------------------------------------------------------------------
  normal      :   2/ 80 (  2.5%)
  suspicious  :   7/ 50 ( 14.0%)
  threats     :   3/ 20 ( 15.0%)

Top 10 Largest Gaps:
--------------------------------------------------------------------------------
  prowling_20260125_180257                 | Expected: [35, 60] | Actual:  75 | Gap:  15
  ...
================================================================================

Future Enhancements

  1. Temporal Analysis:

  2. Track gap rate trends over time

  3. Alert on significant regressions
  4. Visualize improvements in dashboard

  5. Confidence Calibration:

  6. Compare predicted vs actual risk levels

  7. Measure calibration error
  8. Identify overconfident/underconfident predictions

  9. Scenario Generation Automation:

  10. Auto-generate scenarios from production events

  11. Create edge cases based on historical failures
  12. Expand coverage systematically

  13. Integration with CI/CD:

  14. Automated scenario processing on PR
  15. Comment PR with validation results
  16. Block merge if gap rate exceeds threshold

References