Test Performance Metrics¶
This document tracks test performance baselines and CI parallelization configuration for the project.
Last updated: 2026-01-05 Related issue: NEM-1375
Test Suite Overview¶
| Suite | Test Count | Parallelization | CI Configuration |
|---|---|---|---|
| Backend Unit | 8,229 | pytest-xdist (-n auto --dist=worksteal) | 1 job, parallel workers |
| Backend Integration | 1,556 | Serial (-n0) - 4 domain shards | 4 parallel jobs (API, WebSocket, Services, Models) |
| Frontend Vitest | ~2,000+ | Thread pool | 8 shards |
| Frontend E2E (Chromium) | ~500+ | Playwright workers | 4 shards |
| Frontend E2E (Firefox/WebKit) | ~500+ | Playwright workers | 1 job each (non-blocking) |
CI Parallelization Strategy¶
Backend Tests¶
Unit Tests¶
- Configuration: Single job with
pytest-xdist - Flags:
-n auto --dist=worksteal - Rationale: Worksteal scheduling optimizes for uneven test durations
- Expected runtime: ~60-90 seconds
Integration Tests (4 Domain Shards)¶
Domain-based sharding to maximize parallelism while avoiding database contention:
| Shard | Job Name | Test Domains | Filter Pattern |
|---|---|---|---|
| 1 | integration-tests-api | API routes | test_*_api or test_api_* |
| 2 | integration-tests-websocket | WebSocket/PubSub | test_websocket or test_*_broadcaster or test_*_pubsub |
| 3 | integration-tests-services | Business logic | test_*_integration or test_batch* or test_detector* |
| 4 | integration-tests-models | Database models | test_models or test_database or test_baseline* |
- Configuration: Serial execution within each shard (
-n0) - Rationale: Avoids database deadlocks from concurrent schema creation
- Expected runtime per shard: ~60-90 seconds (runs in parallel)
- Total wall-clock time: ~2-3 minutes (down from ~6-8 minutes serial)
Frontend Tests¶
Vitest (Component/Hook Tests)¶
- Configuration: 8-shard matrix strategy
- CI command:
npx vitest run --shard=${{ matrix.shard }}/8 - Timeout: 3 minutes per shard
- Coverage: Collected per-shard, merged in
frontend-coverage-mergejob - Expected runtime per shard: ~30-60 seconds
Playwright E2E Tests¶
- Primary (Chromium): 4 shards with
--shard=1/4through--shard=4/4 - Secondary (Firefox/WebKit): 1 job each, non-blocking (
continue-on-error: true) - Workers: 4 parallel workers in CI
- Retries: 2 retries on CI for flaky test handling
- Expected runtime per shard: ~2-4 minutes
Performance Thresholds¶
The scripts/audit-test-durations.py script enforces these thresholds:
| Test Category | Max Duration | Warning at (80%) |
|---|---|---|
| Unit tests | 1.0s | 0.8s |
| Integration tests | 5.0s | 4.0s |
| E2E tests | 5.0s | 4.0s |
| Known slow tests | 60.0s | 48.0s |
| Benchmarks | Excluded | N/A |
Known Slow Test Patterns¶
Tests matching these patterns use the extended 60s threshold:
- Pipeline worker manager tests (real stop timeouts)
- Health monitor subprocess tests
- System broadcaster reconnection tests
- Property-based tests with text generation
- TLS certificate generation (RSA key gen)
- E2E error state tests (API retry exhaustion)
- Accessibility tests (axe-core analysis)
- ReID service retry/backoff tests
See scripts/audit-test-durations.py for the full pattern list.
Baseline Metrics (2026-01-05)¶
Test Counts (Minimum Thresholds)¶
These baselines prevent accidental test deletion:
| Suite | Minimum Count | Current Count |
|---|---|---|
| Backend Unit | 2,900 | 8,229 |
| Backend Integration | 600 | 1,556 |
| Frontend Test Files | 50 | 135 |
| Frontend E2E Specs | 15 | 26 |
Expected CI Duration¶
| Stage | Expected Duration | Notes |
|---|---|---|
| Lint + Typecheck | ~1-2 min | Runs in parallel |
| Backend Unit Tests | ~1-2 min | pytest-xdist parallel |
| Backend Integration Tests | ~2-3 min | 4 shards in parallel |
| Frontend Vitest | ~1-2 min | 8 shards in parallel |
| Frontend E2E (Chromium) | ~3-5 min | 4 shards in parallel |
| Total CI time | ~8-12 min | All jobs parallelized |
Improvement from Parallelization¶
| Configuration | Estimated Time | Improvement |
|---|---|---|
| Serial (no parallelization) | ~25-35 min | Baseline |
| Current parallelized | ~8-12 min | 60-70% reduction |
Caching Strategy¶
GitHub Actions Cache¶
| Cache | Key Pattern | Purpose |
|---|---|---|
| uv dependencies | uv-${{ runner.os }}-${{ hashFiles('uv.lock') }} | Python packages |
| npm dependencies | npm-${{ runner.os }}-${{ hashFiles('package-lock.json') }} | Node packages |
| Playwright browsers | playwright-*-${{ runner.os }}-${{ hashFiles('package-lock.json') }} | Browser binaries |
| Docker layers | gha,scope=backend / gha,scope=frontend | Image build layers |
Test Performance Audit Job¶
The test-performance-audit CI job:
- Downloads JUnit XML results from all test jobs
- Parses test durations from XML files
- Flags tests exceeding thresholds
- Reports warnings for tests approaching limits
- Excludes benchmark tests from analysis
Running Locally¶
# After running tests with JUnit output
uv run python scripts/audit-test-durations.py test-results/
# Environment variable overrides
UNIT_TEST_THRESHOLD=2.0 \
INTEGRATION_TEST_THRESHOLD=10.0 \
uv run python scripts/audit-test-durations.py test-results/
Monitoring and Maintenance¶
Weekly Checks¶
- Review CI run times in GitHub Actions
- Check for new slow tests in audit reports
- Verify test count thresholds remain accurate
- Update slow test patterns if needed
When to Update Baselines¶
Update this document when:
- Test count baselines change significantly (add tests)
- New slow test patterns are identified
- CI configuration changes (shards, parallelization)
- Performance improvements are made
Alerting¶
The CI workflow includes:
test-count-verificationjob - fails if test counts drop below thresholdstest-performance-auditjob - warns on slow tests, fails if threshold exceeded
References¶
- CI workflow:
.github/workflows/ci.yml - Test duration audit:
scripts/audit-test-durations.py - Vitest config:
frontend/vite.config.ts - Playwright config:
frontend/playwright.config.ts - pytest config:
pyproject.toml