Distributed Tracing¶

Distributed tracing dashboard for tracking requests through the AI pipeline.
What You're Looking At¶
The Distributed Tracing page provides end-to-end visibility into how requests flow through the security monitoring pipeline. Powered by Jaeger and visualized through Grafana, this page helps you understand latency, identify bottlenecks, and debug issues across services.
Layout Overview¶
+----------------------------------------------------------+
| HEADER: Activity Icon | "Distributed Tracing" | Buttons |
+----------------------------------------------------------+
| |
| +----------------------------------------------------+ |
| | | |
| | GRAFANA DASHBOARD EMBED | |
| | | |
| | +------------+ +------------+ +------------+ | |
| | | Service | | Operation | | Time | | |
| | | Filter | | Filter | | Range | | |
| | +------------+ +------------+ +------------+ | |
| | | |
| | +------------------------------------------+ | |
| | | | | |
| | | TRACE LIST / TIMELINE | | |
| | | | | |
| | +------------------------------------------+ | |
| | | |
| | +------------------------------------------+ | |
| | | | | |
| | | TRACE DETAIL VIEW | | |
| | | | | |
| | +------------------------------------------+ | |
| | | |
| +----------------------------------------------------+ |
| |
+----------------------------------------------------------+
The page embeds the HSI Distributed Tracing dashboard from Grafana, which provides:
- Service Filter - Select which service(s) to view traces from
- Operation Filter - Filter by specific operations (e.g.,
/api/events,detect) - Time Range - Select the time period to analyze
- Trace List - Searchable list of traces with duration and status
- Trace Detail - Detailed span breakdown for selected traces
Key Components¶
Header Controls¶
| Button | Function |
|---|---|
| Open in Grafana | Opens the full Grafana dashboard in a new tab for advanced features |
| Open Jaeger | Opens the native Jaeger UI at localhost:16686 |
| Refresh | Reloads the embedded dashboard |
Understanding Traces¶
A trace represents a single request as it flows through the system. Each trace contains multiple spans:
Trace: "Process Detection"
├── Span: "receive_image" (backend) - 5ms
├── Span: "detect_objects" (yolo26) - 150ms
│ ├── Span: "preprocess" - 10ms
│ ├── Span: "inference" - 130ms
│ └── Span: "postprocess" - 10ms
├── Span: "batch_detection" (backend) - 2ms
└── Span: "analyze_batch" (nemotron) - 800ms
├── Span: "build_prompt" - 5ms
├── Span: "llm_inference" - 790ms
└── Span: "parse_response" - 5ms
Trace Timeline¶
The timeline view shows spans as horizontal bars:
| Element | Meaning |
|---|---|
| Bar width | Duration of the span |
| Bar position | When the span started relative to trace start |
| Bar color | Service that executed the span |
| Nesting | Parent-child relationships between spans |
Timeline Patterns:
| Pattern | Meaning |
|---|---|
| Sequential bars | Operations happening one after another |
| Overlapping bars | Concurrent/parallel operations |
| Long gaps | Time waiting (network, queue, etc.) |
| One very long bar | Bottleneck in that operation |
Span Details¶
Click a span to see detailed information:
| Field | Description |
|---|---|
| Service | Which service executed this span |
| Operation | The operation name (e.g., detect, analyze) |
| Duration | How long the span took |
| Start Time | Absolute timestamp |
| Tags | Key-value metadata (e.g., camera.name, model.name) |
| Logs | Events that occurred during the span |
Service Identification¶
Traces are tagged with service names:
| Service | Description | Typical Operations |
|---|---|---|
| hsi-backend | FastAPI backend | API requests, batch processing |
| hsi-yolo26 | YOLO26 detector | Object detection, image processing |
| hsi-nemotron | Nemotron LLM | Risk analysis, prompt processing |
| hsi-frontend | React frontend | User interactions (if instrumented) |
Understanding the AI Pipeline¶
Detection Flow¶
A typical detection flows through the system like this:
sequenceDiagram
participant C as Camera
participant B as Backend
participant R as YOLO26
participant N as Nemotron
participant DB as Database
C->>B: Upload Image
activate B
Note over B: Span: receive_image
B->>R: Detect Objects
activate R
Note over R: Span: detect_objects
R-->>B: Detections
deactivate R
B->>B: Add to Batch
Note over B: Span: batch_detection
B->>N: Analyze Batch
activate N
Note over N: Span: analyze_batch
N-->>B: Risk Assessment
deactivate N
B->>DB: Save Event
Note over B: Span: save_event
deactivate B Latency Breakdown¶
Typical latency distribution for a single detection:
| Stage | Typical Duration | Notes |
|---|---|---|
| Image Upload | 10-50ms | Network dependent |
| Object Detection | 100-200ms | GPU dependent |
| Batch Aggregation | 0-90s | Waits for batch window |
| LLM Analysis | 500-2000ms | Model dependent |
| Database Save | 5-20ms | Disk I/O |
Finding Issues¶
Slow Requests¶
To find slow requests:
- Set a time range covering the issue period
- Sort traces by duration (descending)
- Click on the slowest traces
- Look for spans with unusually long durations
Common Bottlenecks:
| Long Span | Likely Cause | Solution |
|---|---|---|
llm_inference | LLM processing | Normal for complex analyses |
detect_objects | GPU saturation | Check GPU utilization |
db_query | Database performance | Add indexes, optimize queries |
http_request | Network latency | Check connectivity |
Error Traces¶
Traces with errors are typically highlighted in red or orange:
- Filter by
error=truetag - Look at span logs for error messages
- Check the stack trace in span details
Common Error Patterns:
| Error | Location | Common Cause |
|---|---|---|
timeout | yolo26 spans | GPU overloaded |
connection_refused | backend spans | Service down |
out_of_memory | nemotron spans | Model too large for GPU |
validation_error | API spans | Invalid request data |
Missing Spans¶
If traces seem incomplete:
- Check if all services are instrumented
- Verify trace context is propagated between services
- Check if sampling is dropping traces
Correlation Features¶
Trace to Logs¶
Click "View Logs" in a span to see correlated log entries:
- Logs are filtered to the span's time range
- Log entries include the trace ID for cross-reference
- Use this to see detailed debug output during the span
Trace to Profiling¶
Correlate traces with continuous profiling:
- Note the time range of a slow trace
- Open the Profiling page
- Select the same time range
- See which code paths consumed resources during that trace
Trace to Events¶
Security events include trace IDs:
- Find an event in the Timeline
- Note the event timestamp
- Search for traces in that time window
- Find the trace that created that event
Settings & Configuration¶
Grafana URL¶
The Grafana URL is automatically configured from the backend. If the embedded dashboard fails to load:
- Verify Grafana is running
- Check the
grafana_urlconfig setting - Verify network connectivity
Jaeger Configuration¶
Jaeger must be configured as a data source in Grafana:
# Grafana provisioning (monitoring/grafana/provisioning/datasources/prometheus.yml)
- name: Jaeger
type: jaeger
url: http://jaeger:16686
access: proxy
Sampling¶
Trace sampling is configured per-service:
| Setting | Default | Description |
|---|---|---|
| Sample Rate | 1.0 | Percentage of traces to keep (1.0 = 100%) |
| Rate Limiting | None | Max traces per second |
For production, consider reducing sampling to 10-20% to reduce storage.
Retention¶
Trace data retention is configured in Jaeger:
| Setting | Default | Description |
|---|---|---|
| Retention Period | 7 days | How long traces are kept |
| Max Traces | Unlimited | Maximum traces to store |
Troubleshooting¶
Dashboard Shows "No Data"¶
- Check Jaeger is running:
docker ps | grep jaeger - Verify services are instrumented: Check OpenTelemetry configuration
- Check time range: Ensure the selected time range has traces
- Verify datasource: Confirm Jaeger is configured in Grafana
Traces are Missing Spans¶
- Check service connectivity: Ensure all services can reach Jaeger
- Verify trace propagation: Check that trace headers are passed between services
- Check sampling: Traces might be sampled out
High Latency in Tracing¶
Tracing overhead should be minimal (<1%), but if you notice impact:
- Reduce the number of spans per trace
- Increase sampling rate (keep fewer traces)
- Use asynchronous span export
"Failed to load configuration" Error¶
The frontend couldn't fetch the Grafana URL:
- Verify the backend is running
- Check network connectivity
- The dashboard will use
/grafanaas a fallback
Technical Deep Dive¶
Architecture¶
flowchart LR
subgraph Services["Instrumented Services"]
B[Backend]
R[YOLO26]
N[Nemotron]
end
subgraph Collection["Trace Collection"]
A[Alloy]
J[Jaeger]
end
subgraph Visualization["Visualization"]
G[Grafana]
F[Frontend]
end
B -->|OTLP| A
R -->|OTLP| A
N -->|OTLP| A
A -->|push| J
J -->|query| G
G -->|iframe| F
style Services fill:#e0f2fe
style Collection fill:#fef3c7
style Visualization fill:#dcfce7 Related Code¶
Frontend:
- Tracing Page:
frontend/src/components/tracing/TracingPage.tsx - Grafana URL Utility:
frontend/src/utils/grafanaUrl.ts
Backend:
- Tracing Configuration:
backend/core/telemetry.py
Infrastructure:
- Jaeger Container:
docker-compose.prod.yml(jaeger service) - Grafana Dashboard:
monitoring/grafana/dashboards/tracing.json - Alloy Configuration:
monitoring/alloy/config.alloy
OpenTelemetry Integration¶
The system uses OpenTelemetry for distributed tracing:
# Example: Creating a span in Python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("detect_objects") as span:
span.set_attribute("camera.name", camera_name)
span.set_attribute("image.size", image_size)
result = detector.detect(image)
span.set_attribute("detection.count", len(result))
Trace Context Propagation¶
Trace context is propagated between services using W3C Trace Context headers:
Quick Reference¶
When to Use Tracing¶
| Scenario | What to Look For |
|---|---|
| Slow API response | Long spans in the trace |
| Failed request | Error tags and span logs |
| Intermittent issues | Compare fast vs slow traces |
| Service debugging | Spans from specific service |
Common Actions¶
| I want to... | Do this... |
|---|---|
| Find slow requests | Sort by duration, click longest |
| Find errors | Filter by error=true |
| See request flow | Expand trace to see all spans |
| Debug a specific event | Search by time range of event |
| Compare performance | Select two traces, use compare view |
| Get more details | Open in Jaeger for full UI |