Prompt Management Architecture¶

Technical documentation for the AI prompt management system, including A/B testing, version history, and the Prompt Playground UI.

Time to read: ~20 min Prerequisites: Architecture Overview, AI Pipeline

Overview¶

The prompt management system provides a complete solution for managing AI model configurations across the security pipeline. It supports:

Version Control: Every change creates a new version with full history
A/B Testing: Compare prompt variations on real events before deployment
Shadow Mode: Run experimental prompts in parallel without affecting production
Automatic Rollback: Detect performance degradation and revert automatically
Import/Export: Backup and share configurations across instances

Architecture¶

Prompt Management Full Stack

Full-stack prompt management flow from UI to database to AI inference.

Components¶

┌─────────────────────────────────────────────────────────────────────┐
│                         Frontend                                     │
├─────────────────────────────────────────────────────────────────────┤
│  PromptPlayground.tsx    │  PromptABTest.tsx    │  ABTestStats.tsx  │
│  - Model editors         │  - Split-view UI     │  - Statistics     │
│  - Diff preview          │  - Delta indicators  │  - Comparison     │
│  - Test configuration    │  - Result panels     │  - Charts         │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         Backend API                                  │
├─────────────────────────────────────────────────────────────────────┤
│  /api/ai-audit/prompts                                               │
│  - GET    /           - Get all prompts                              │
│  - GET    /{model}    - Get prompt for model                         │
│  - PUT    /{model}    - Update prompt (creates new version)          │
│  - GET    /export     - Export all configurations                    │
│  - POST   /import     - Import configurations                        │
│  - POST   /import/preview - Preview import changes                   │
│  - GET    /history    - Get version history                          │
│  - POST   /history/{id} - Restore version                            │
│  - POST   /test       - Test prompt against event                    │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         Services                                     │
├─────────────────────────────────────────────────────────────────────┤
│  PromptService          │  PromptABTester       │  PromptEvaluator  │
│  - CRUD operations      │  - Traffic splitting  │  - Batch eval     │
│  - Version management   │  - A/B selection      │  - Comparison     │
│  - Import/Export        │  - Metrics recording  │  - Correlation    │
│                         │                       │                   │
│  PromptShadowRunner     │  PromptRollbackChecker                    │
│  - Shadow execution     │  - Performance check                      │
│  - Result comparison    │  - Auto rollback                          │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         Database                                     │
├─────────────────────────────────────────────────────────────────────┤
│  prompt_versions                                                     │
│  - id, model, version, config_json, created_at, created_by          │
│  - change_description, is_active, row_version                        │
│  - Indexes: model, model+version, model+is_active, created_at       │
│  - Unique constraint: (model, version)                               │
└─────────────────────────────────────────────────────────────────────┘

Data Flow¶

User edits prompt in PromptPlayground UI
Test request sent to /api/ai-audit/prompts/test
PromptService runs LLM inference with modified config
Results returned and displayed in before/after comparison
User saves via PUT /api/ai-audit/prompts/{model}
New version created in prompt_versions table
Old version deactivated (is_active = false)

Database Schema¶

PromptVersion Model¶

Source: backend/models/prompt_version.py

class PromptVersion(Base):
    __tablename__ = "prompt_versions"

    id: Mapped[int]                    # Primary key
    model: Mapped[str]                 # Enum: nemotron, florence2, yolo_world, xclip, fashion_clip
    version: Mapped[int]               # Version number (auto-incremented per model)
    created_at: Mapped[datetime]       # When this version was created
    created_by: Mapped[str | None]     # Optional user identifier
    config_json: Mapped[str]           # JSON configuration (see below)
    change_description: Mapped[str | None]  # Optional description of changes
    is_active: Mapped[bool]            # True for current active version
    row_version: Mapped[int]           # Optimistic locking counter

Configuration JSON Formats¶

Nemotron:

{
  "system_prompt": "You are a security analysis AI...",
  "temperature": 0.7,
  "max_tokens": 2048
}

Florence-2:

{
  "vqa_queries": ["What is the person doing?", "What objects are they carrying?"]
}

YOLO-World:

{
  "object_classes": ["knife", "gun", "package"],
  "confidence_threshold": 0.35
}

X-CLIP:

{
  "action_classes": ["loitering", "running away", "fighting"]
}

Fashion-CLIP:

{
  "clothing_categories": ["dark hoodie", "face mask"],
  "suspicious_indicators": ["all black", "gloves at night"]
}

Indexes¶

Index Name	Columns	Purpose
`idx_prompt_versions_model`	model	Filter by model
`idx_prompt_versions_model_version`	model, version	Get specific version
`idx_prompt_versions_model_active`	model, is_active	Find active version
`idx_prompt_versions_created_at`	created_at	Order by creation date

Unique Constraint¶

uq_prompt_version_model_version: Ensures (model, version) combinations are unique

API Endpoints¶

Core Endpoints¶

Source: backend/api/routes/prompt_management.py

Method	Endpoint	Description	Rate Limited
GET	`/api/ai-audit/prompts`	Get all model configurations	No
GET	`/api/ai-audit/prompts/{model}`	Get specific model configuration	No
PUT	`/api/ai-audit/prompts/{model}`	Update model configuration	No
GET	`/api/ai-audit/prompts/export`	Export all configurations	No
POST	`/api/ai-audit/prompts/import`	Import configurations	No
POST	`/api/ai-audit/prompts/import/preview`	Preview import changes	No
GET	`/api/ai-audit/prompts/history`	Get version history	No
POST	`/api/ai-audit/prompts/history/{id}`	Restore specific version	No
POST	`/api/ai-audit/prompts/test`	Test prompt against event	Yes (10/min)

Request/Response Schemas¶

Source: backend/api/schemas/prompt_management.py

PromptUpdateRequest:

class PromptUpdateRequest(BaseModel):
    config: dict[str, Any]                    # New configuration
    change_description: str | None = None     # Optional change description
    expected_version: int | None = None       # For optimistic locking

Optimistic Locking: If expected_version is provided and does not match the current version, the API returns 409 Conflict. This prevents race conditions when multiple users edit the same prompt.

Validation¶

Each model has a dedicated validation schema:

def validate_config_for_model(model: AIModelEnum, config: dict) -> list[str]:
    """Validate configuration for a specific model."""

Validation checks:

Required fields are present
Field types are correct
Values are within valid ranges
Lists are not empty

Service Layer¶

PromptService¶

Source: backend/services/prompt_service.py

The main service for CRUD operations on prompts.

class PromptService:
    async def get_all_prompts(self, session: AsyncSession) -> dict[str, dict]:
        """Get active configurations for all models."""

    async def get_prompt_for_model(self, session: AsyncSession, model: str) -> dict:
        """Get active configuration for a specific model."""

    async def update_prompt_for_model(
        self,
        session: AsyncSession,
        model: str,
        config: dict,
        change_description: str | None = None,
        created_by: str | None = None,
        expected_version: int | None = None,
    ) -> PromptVersion:
        """Create new version with updated configuration."""

    async def get_version_history(
        self,
        session: AsyncSession,
        model: str | None = None,
        limit: int = 50,
        offset: int = 0,
    ) -> tuple[list[PromptVersion], int]:
        """Get paginated version history."""

    async def restore_version(
        self,
        session: AsyncSession,
        version_id: int,
    ) -> PromptVersion:
        """Restore a previous version (creates new version with old config)."""

PromptABTester¶

Manages A/B testing traffic splitting.

@dataclass
class ABTestConfig:
    control_version: int       # Current production version
    treatment_version: int     # New version being tested
    traffic_split: float       # 0.0 to 1.0 - fraction to treatment
    model: str                 # Model name
    enabled: bool = True

class PromptABTester:
    def select_prompt_version(self) -> tuple[int, bool]:
        """Select version based on traffic split. Returns (version, is_treatment)."""

    async def record_prompt_execution(
        self,
        version: int,
        latency_seconds: float,
        risk_score: int,
    ) -> None:
        """Record metrics for analysis."""

A/B Testing Traffic Split Flowchart¶

flowchart TD
    A[Incoming Request] --> B{A/B Test<br>enabled?}
    B -->|No| C[Use Active<br>Production Version]
    B -->|Yes| D[Generate Random<br>Value 0.0-1.0]

    D --> E{random <=<br>traffic_split?}
    E -->|Yes| F[Select TREATMENT<br>New Version]
    E -->|No| G[Select CONTROL<br>Current Version]

    F --> H[Execute with<br>Treatment Prompt]
    G --> I[Execute with<br>Control Prompt]
    C --> J[Execute with<br>Production Prompt]

    H --> K[Record Metrics:<br>latency, risk_score, version]
    I --> K
    J --> L[Standard Metrics]

    K --> M[Aggregate for<br>Statistical Analysis]

    style F fill:#3B82F6,color:#fff
    style G fill:#76B900,color:#000
    style C fill:#76B900,color:#000

Traffic Split Examples:

traffic_split: 0.1 - 10% traffic to treatment, 90% to control
traffic_split: 0.5 - 50/50 split for balanced comparison
traffic_split: 0.0 - All traffic to control (test disabled)

PromptShadowRunner¶

Runs experimental prompts in parallel without affecting production.

@dataclass
class ShadowModeConfig:
    enabled: bool
    control_version: int
    shadow_version: int
    model: str
    log_comparisons: bool = True

@dataclass
class ShadowComparisonResult:
    control_result: dict
    shadow_result: dict | None
    risk_score_diff: float
    control_latency_ms: float
    shadow_latency_ms: float
    shadow_error: str | None

class PromptShadowRunner:
    async def run_shadow_comparison(self, context: str) -> ShadowComparisonResult:
        """Run both prompts and compare results."""

PromptRollbackChecker¶

Monitors performance and triggers automatic rollback.

@dataclass
class RollbackConfig:
    enabled: bool = True
    max_latency_increase_pct: float = 50.0   # Rollback if latency increases 50%+
    max_score_variance: float = 15.0         # Rollback if score variance > 15
    min_samples: int = 100                   # Minimum samples before checking
    evaluation_window_hours: int = 1

class PromptRollbackChecker:
    async def check_rollback_needed(self, metrics) -> RollbackCheckResult:
        """Check if rollback should be triggered."""

    async def execute_rollback(
        self,
        session: AsyncSession,
        ab_config,
        reason: str,
    ) -> RollbackExecutionResult:
        """Execute rollback to control version."""

Rollback Trigger Detection Flowchart¶

flowchart TD
    A[Collect Metrics<br>from Treatment] --> B{samples >=<br>min_samples?}
    B -->|No| C[Continue Collecting<br>Wait for more data]
    B -->|Yes| D[Calculate Treatment<br>Statistics]

    D --> E{latency increase ><br>max_latency_increase_pct?}
    E -->|Yes| F[TRIGGER ROLLBACK<br>Reason: Latency Degradation]
    E -->|No| G{score_variance ><br>max_score_variance?}

    G -->|Yes| H[TRIGGER ROLLBACK<br>Reason: Score Variance]
    G -->|No| I{Within evaluation<br>window?}

    I -->|Yes| J[Continue A/B Test<br>No Issues Detected]
    I -->|No| K[Evaluation Complete<br>Promote or Reject]

    F --> L[Execute Rollback]
    H --> L

    L --> M[Deactivate Treatment]
    M --> N[Restore Control as Active]
    N --> O[Log Rollback Event]
    O --> P[Alert Operators]

    style F fill:#EF4444,color:#fff
    style H fill:#EF4444,color:#fff
    style J fill:#22C55E,color:#fff
    style K fill:#3B82F6,color:#fff

Rollback Thresholds:

Metric	Default	Description
`max_latency_increase_pct`	50%	Rollback if latency increases 50%+
`max_score_variance`	15	Rollback if score variance > 15
`min_samples`	100	Minimum samples before checking
`evaluation_window_hours`	1	Time window for metrics collection

PromptEvaluator¶

Evaluates prompts against historical events.

@dataclass
class EvaluationResults:
    total_events: int
    average_score_diff: float | None
    score_variance: float | None
    average_latency_ms: float
    score_correlation: float | None

class PromptEvaluator:
    async def create_evaluation_batch(
        self,
        session: AsyncSession,
        hours_back: int = 24,
        sample_size: int = 100,
    ) -> EvaluationBatch:
        """Create batch of historical events for evaluation."""

    async def evaluate_prompt_version(
        self,
        session: AsyncSession,
        prompt_version: int,
        batch: EvaluationBatch,
    ) -> EvaluationResults:
        """Evaluate prompt version against event batch."""

    async def compare_prompt_versions(
        self,
        session: AsyncSession,
        version_a: int,
        version_b: int,
        batch: EvaluationBatch,
    ) -> VersionComparisonResult:
        """Compare two versions and recommend best."""

Frontend Components¶

PromptPlayground¶

Source: frontend/src/components/ai/PromptPlayground.tsx

Main slide-out panel for prompt editing and testing.

Props:

interface PromptPlaygroundProps {
  isOpen: boolean;
  onClose: () => void;
  recommendation?: AiAuditRecommendationItem | null;
  sourceEventId?: number | null;
  enrichedSuggestion?: EnrichedSuggestion | null;
  initialShowDiffPreview?: boolean;
}

Key Features:

Model Editors: Accordion panels for each AI model
Syntax Highlighting: Variables highlighted in green ({variable_name})
Line Numbers: Synchronized scrolling with editor content
Diff Preview: Side-by-side comparison when applying suggestions
Toast Notifications: Success/error feedback

State Management:

// Prompts and editing
const [prompts, setPrompts] = useState<AllPromptsResponse | null>(null);
const [editedConfigs, setEditedConfigs] = useState<Record<string, Record<string, unknown>>>({});

// A/B testing
const [showABTest, setShowABTest] = useState(false);
const [abTestResults, setAbTestResults] = useState<ABTestResult[]>([]);
const [isRunningABTest, setIsRunningABTest] = useState(false);

PromptABTest¶

Source: frontend/src/components/ai/PromptABTest.tsx

Split-view A/B testing component.

interface PromptABTestProps {
  originalPrompt: string;
  modifiedPrompt: string;
  results: ABTestResult[];
  isRunning: boolean;
  onRunTest: (eventId: number) => void;
  onRunRandomTests: (count: number) => void;
  onPromoteB: () => void;
}

Delta Indicator Logic:

// Green (improvement): B is less alarming (delta <= -5)
// Red (regression): B is more alarming (delta >= +5)
// Gray (neutral): Within threshold (-4 to +4)

ABTestStats¶

Source: frontend/src/components/ai/ABTestStats.tsx

Calculate and display A/B test statistics.

export function calculateStats(results: ABTestResult[]): ABTestStats {
  return {
    totalTests: results.length,
    avgScoreDelta: /* average of all score deltas */,
    improvementRate: /* percentage where B was better */,
  };
}

Prompt Diff Utilities¶

Source: frontend/src/utils/promptDiff.ts

Functions for generating diffs and applying suggestions.

// Apply a suggestion to a prompt
export function applySuggestion(
  originalPrompt: string,
  suggestion: EnrichedSuggestion
): InsertionResult;

// Generate visual diff between prompts
export function generateDiff(
  originalPrompt: string,
  modifiedPrompt: string,
  contextLines?: number
): DiffLine[];

// Check if suggestion is already applied
export function isSuggestionApplied(prompt: string, suggestion: EnrichedSuggestion): boolean;

A/B Testing Service¶

Source: frontend/src/services/abTestService.ts

Frontend service for executing A/B tests.

export interface ABTestService {
  runTest(
    eventId: number,
    originalPrompt: string,
    modifiedPrompt: string,
    modelConfig: ModelConfig
  ): Promise<ABTestResult>;

  runRandomTests(
    count: number,
    originalPrompt: string,
    modifiedPrompt: string,
    modelConfig: ModelConfig
  ): Promise<ABTestResult[]>;

  getAvailableEvents(limit?: number): Promise<EventSummary[]>;
}

Implementing New Prompt Templates¶

Adding a New AI Model¶

Add enum value:

# backend/models/prompt_version.py
class AIModel(str, Enum):
    # ... existing models ...
    NEW_MODEL = "new_model"

Add validation schema:

# backend/api/schemas/prompt_management.py
class NewModelConfig(BaseModel):
    custom_field: str = Field(..., min_length=1)
    threshold: float = Field(0.5, ge=0.0, le=1.0)

Add to validators map:

validators: dict[AIModelEnum, tuple[type[BaseModel], dict[str, str]]] = {
    # ... existing models ...
    AIModelEnum.NEW_MODEL: (NewModelConfig, {"custom_field": "custom_field"}),
}

Add default configuration:

# backend/services/prompt_service.py
DEFAULT_CONFIGS: dict[str, dict[str, Any]] = {
    # ... existing models ...
    AIModelEnum.NEW_MODEL.value: {
        "custom_field": "default value",
        "threshold": 0.5,
    },
}

Add frontend editor:

// frontend/src/components/ai/PromptPlayground.tsx
const MODEL_CONFIGS: ModelConfig[] = [
  // ... existing models ...
  {
    name: 'new_model',
    displayName: 'New Model',
    description: 'Description of the model',
  },
];

// Add case in renderModelEditor()
case 'new_model':
  return (
    <div className="space-y-4">
      {/* Custom editor UI */}
    </div>
  );

Add migration:

alembic revision --autogenerate -m "add new_model enum value"

Creating Prompt Templates¶

Effective prompts should:

Be specific: Define exact output format
Include context variables: Use {detections}, {time_context}, etc.
Handle edge cases: Instructions for ambiguous situations
Define risk levels: Clear criteria for each level

Example template structure:

## Role
You are a security analysis AI for a home monitoring system.

## Input Context
- Detection data: {detections}
- Time context: {time_context}
- Weather: {weather}
- Cross-camera data: {cross_camera_data}

## Analysis Instructions
1. Analyze all detections in context
2. Consider time of day and normal patterns
3. Evaluate for security concerns

## Output Format
Respond with valid JSON:
{
  "risk_score": 0-100,
  "risk_level": "low|medium|high|critical",
  "reasoning": "explanation",
  "recommended_actions": ["action1", "action2"]
}

## Risk Level Criteria
- LOW (0-29): Normal activity
- MEDIUM (30-59): Unusual but not threatening
- HIGH (60-84): Concerning activity
- CRITICAL (85-100): Potential threat

Metrics and Monitoring¶

Prometheus Metrics¶

Prompt-related metrics are exposed at /api/metrics:

# Recorded on each prompt execution
record_prompt_latency(version: str, latency_seconds: float)

# Recorded on shadow mode comparison
record_shadow_comparison(model: str)

# Recorded on rollback
record_prompt_rollback(model: str, reason: str)

Monitoring Recommendations¶

Latency: Alert if prompt execution time increases significantly
Consistency: Track score variance across similar events
Rollback rate: Monitor automatic rollback frequency
A/B test duration: Ensure tests run long enough for statistical validity

Testing¶

Unit Tests¶

Backend:

# Run prompt service tests
uv run pytest backend/tests/unit/services/test_prompt_service.py -v
uv run pytest backend/tests/unit/services/test_prompt_ab_testing.py -v
uv run pytest backend/tests/unit/services/test_prompt_version_service.py -v

# Run API route tests
uv run pytest backend/tests/unit/api/routes/test_prompt_management.py -v
uv run pytest backend/tests/unit/api/routes/test_prompt_config.py -v

Frontend:

cd frontend
npm test -- --grep "PromptPlayground"
npm test -- --grep "promptDiff"
npm test -- --grep "abTestService"

Test Coverage¶

Key test scenarios:

Version creation and history
Optimistic locking conflicts
A/B test traffic splitting
Shadow mode comparison
Automatic rollback triggers
Import/export validation
Diff generation accuracy

Prompt Management API - Complete REST API reference
AI Audit Dashboard - Includes Prompt Playground documentation
AI Pipeline Architecture - How prompts fit in the pipeline
Risk Analysis - Nemotron prompt usage in risk scoring

Source Files¶

File	Description
`backend/models/prompt_version.py`	Database model
`backend/api/routes/prompt_management.py`	API routes
`backend/api/schemas/prompt_management.py`	Pydantic schemas
`backend/services/prompt_service.py`	Service layer
`frontend/src/components/ai/PromptPlayground.tsx`	Main UI component
`frontend/src/components/ai/PromptABTest.tsx`	A/B test component
`frontend/src/services/abTestService.ts`	A/B test service
`frontend/src/utils/promptDiff.ts`	Diff utilities

Back to Developer Hub