Grafana Dashboards¶
Pre-configured Grafana dashboards for monitoring pipeline health, AI service performance, and system observability.
Key Files:
monitoring/grafana/dashboards/consolidated.json- Main operations dashboardmonitoring/grafana/dashboards/tracing.json- Distributed tracing dashboardmonitoring/grafana/dashboards/logs.json- Log aggregation dashboardmonitoring/grafana/provisioning/dashboards/dashboard.yml- Dashboard provisioningmonitoring/grafana/provisioning/datasources/prometheus.yml- Datasource configuration
Overview¶
Grafana provides unified visualization across all observability data types: Prometheus metrics, Loki logs, Jaeger traces, and Pyroscope profiles. Dashboards are provisioned automatically via configuration files, ensuring consistent deployment across environments.
The system includes three primary dashboards: an operations dashboard consolidating all metrics, a distributed tracing dashboard for request flow analysis, and a logs dashboard for centralized log viewing with error pattern detection.
Datasources are configured with cross-correlation features, enabling seamless navigation from metrics to traces to logs.
Architecture¶
graph TD
subgraph "Datasources"
PROM[Prometheus<br/>prometheus.yml:6-16]
LOKI[Loki<br/>prometheus.yml:198-214]
JAEG[Jaeger<br/>prometheus.yml:41-101]
PYRO[Pyroscope<br/>prometheus.yml:216-231]
API[Backend API<br/>prometheus.yml:30-40]
end
subgraph "Dashboards"
CONS[Consolidated<br/>monitoring/grafana/dashboards/consolidated.json]
TRAC[Tracing<br/>monitoring/grafana/dashboards/tracing.json]
LOGS[Logs<br/>monitoring/grafana/dashboards/logs.json]
end
subgraph "Panels"
STAT[Stat Panels]
GRAPH[Time Series]
TABLE[Tables]
PIE[Pie Charts]
GAUGE[Gauges]
end
PROM --> CONS
PROM --> TRAC
LOKI --> LOGS
JAEG --> TRAC
PYRO --> CONS
API --> CONS
CONS --> STAT
CONS --> GRAPH
CONS --> GAUGE
LOGS --> TABLE
LOGS --> PIE
TRAC --> TABLE Datasource Configuration¶
Prometheus¶
Primary metrics datasource (monitoring/grafana/provisioning/datasources/prometheus.yml:6-16):
- name: Prometheus
uid: PBFA97CFB590B2093
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: '15s'
httpMethod: POST
Jaeger with Trace-to-Metrics¶
Distributed tracing with metric correlation (monitoring/grafana/provisioning/datasources/prometheus.yml:41-101):
- name: Jaeger
uid: PC9A941E8F2E49454
type: jaeger
url: http://jaeger:16686
jsonData:
nodeGraph:
enabled: true
tracesToMetrics:
datasourceUid: PBFA97CFB590B2093
spanStartTimeShift: '-5m'
spanEndTimeShift: '5m'
queries:
- name: 'Pipeline Errors/min'
query: 'rate(hsi_pipeline_errors_total[1m]) * 60'
- name: 'Detection Queue Depth'
query: 'hsi_detection_queue_depth'
- name: 'YOLO26 Latency (p95)'
query: 'histogram_quantile(0.95, rate(yolo26_inference_latency_seconds_bucket[5m]))'
Loki with Trace Correlation¶
Log aggregation with trace linking (monitoring/grafana/provisioning/datasources/prometheus.yml:198-214):
- name: Loki
uid: loki
type: loki
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- name: TraceID
matcherRegex: 'trace_id=([a-f0-9]{32})'
url: '${__value.raw}'
datasourceUid: PC9A941E8F2E49454
urlDisplayLabel: 'View Trace'
Pyroscope for Profiling¶
Continuous profiling with trace correlation (monitoring/grafana/provisioning/datasources/prometheus.yml:216-231):
- name: Pyroscope
uid: pyroscope
type: grafana-pyroscope-datasource
url: http://pyroscope:4040
jsonData:
tracesToProfiles:
datasourceUid: PC9A941E8F2E49454
tags:
- key: service.name
value: service
profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
Consolidated Operations Dashboard¶
The main dashboard (monitoring/grafana/dashboards/consolidated.json) provides system-wide visibility.
Executive Summary Row¶
| Panel | Metric | Thresholds |
|---|---|---|
| GPU Utilization | hsi_gpu_utilization | Green <70%, Yellow 70-90%, Red >90% |
| Inference FPS | hsi_inference_fps | Red <0.1, Yellow 0.1-0.5, Green >0.5 |
| Detection Queue | hsi_detection_queue_depth | Green <10, Yellow 10-50, Red >50 |
| Pipeline P95 Latency | hsi_detect_latency_p95_ms / 1000 | Green <30s, Yellow 30-60s, Red >60s |
| GPU Temperature | hsi_gpu_temperature | Green <70C, Yellow 70-85C, Red >85C |
Example PromQL (monitoring/grafana/dashboards/consolidated.json:98-99):
GPU Memory Gauge¶
Displays GPU VRAM usage (monitoring/grafana/dashboards/consolidated.json:276-362):
Thresholds:
- Green: <80%
- Yellow: 80-95%
- Red: >95%
Event Rate Panel¶
Time series of event creation rate (monitoring/grafana/dashboards/consolidated.json):
Shows events per minute with sparkline.
Detection Throughput Panel¶
Detection processing rate:
Risk Score Distribution¶
Histogram of LLM-assigned risk scores:
Cache Hit Rate Panel¶
Cache effectiveness visualization:
sum(rate(hsi_cache_hits_total[5m])) / (sum(rate(hsi_cache_hits_total[5m])) + sum(rate(hsi_cache_misses_total[5m])))
Worker Status Panel¶
Pipeline worker states:
Tracing Dashboard¶
The tracing dashboard (monitoring/grafana/dashboards/tracing.json:1-332) provides distributed trace exploration.
Pipeline Analysis Traces¶
Full pipeline traces with duration thresholds (monitoring/grafana/dashboards/tracing.json:40-77):
| Threshold | Duration | Color |
|---|---|---|
| Normal | <60s | Green |
| Warning | 60-180s | Yellow |
| Critical | >180s | Red |
Query configuration:
{
"datasource": { "type": "jaeger", "uid": "PC9A941E8F2E49454" },
"queryType": "search",
"service": "nemotron-backend",
"operation": "analysis_processing",
"limit": 15
}
Detection Processing Panel¶
YOLO26 detection traces (monitoring/grafana/dashboards/tracing.json:86-143):
| Threshold | Duration | Meaning |
|---|---|---|
| Green | <5s | Normal detection time |
| Yellow | 5-30s | Slow detection |
| Red | >30s | Detection timeout risk |
LLM Inference Panel¶
Nemotron LLM traces (monitoring/grafana/dashboards/tracing.json:144-200):
| Threshold | Duration | Meaning |
|---|---|---|
| Green | <30s | Normal inference |
| Yellow | 30-120s | Slow inference |
| Red | >120s | Inference timeout risk |
Error Traces Panel¶
Traces with error tags (monitoring/grafana/dashboards/tracing.json:210-253):
All Recent Traces Panel¶
General trace view (monitoring/grafana/dashboards/tracing.json:262-318):
Logs Dashboard¶
The logs dashboard (monitoring/grafana/dashboards/logs.json) provides centralized log analysis.
Error Rate Stat¶
Error percentage from logs (monitoring/grafana/dashboards/logs.json:67-73):
sum(count_over_time({container=~"$service", level=~"ERROR|CRITICAL"} [5m])) / (sum(count_over_time({container=~"$service"} [5m])) > 0)
Thresholds:
- Green: <1%
- Yellow: 1-5%
- Red: >5%
Log Throughput Stat¶
Log entries per second (monitoring/grafana/dashboards/logs.json:112-118):
Log Volume by Level¶
Stacked bar chart by log level (monitoring/grafana/dashboards/logs.json:173-248):
sum by (level) (count_over_time({container=~"$service", level=~"$level"} |~ "$search" [$__interval]))
Color mapping:
- DEBUG: Blue
- INFO: Green
- WARNING: Yellow
- ERROR: Orange
- CRITICAL: Red
Level Distribution Pie Chart¶
Percentage breakdown (monitoring/grafana/dashboards/logs.json:250-311):
Top Error Patterns Table¶
Most common errors (monitoring/grafana/dashboards/logs.json:358-365):
topk(10, sum by (level, container) (count_over_time({container=~"$service", level=~"ERROR|CRITICAL"} [15m])))
Live Log Stream Panel¶
Real-time log viewing (monitoring/grafana/dashboards/logs.json):
Dashboard Variables¶
Service Variable¶
Filter by container/service (monitoring/grafana/dashboards/logs.json):
| Variable | Type | Values |
|---|---|---|
$service | Custom | backend\|ai-yolo26\|ai-llm\|ai-florence\|redis\|postgres |
Level Variable¶
Filter by log level:
| Variable | Type | Values |
|---|---|---|
$level | Custom | DEBUG\|INFO\|WARNING\|ERROR\|CRITICAL |
Search Variable¶
Free-text log search:
| Variable | Type | Default |
|---|---|---|
$search | Textbox | "" |
Dashboard Provisioning¶
Dashboards are automatically loaded (monitoring/grafana/provisioning/dashboards/dashboard.yml):
apiVersion: 1
providers:
- name: 'HSI Dashboards'
orgId: 1
folder: 'Home Security Intelligence'
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
Alert Rules¶
Grafana alert rules are provisioned in monitoring/grafana/provisioning/alerting/log-alerts.yml:
High Error Rate Alert¶
- alert: HighErrorRate
expr: |
sum(count_over_time({container=~"backend|ai-.*"} |~ "ERROR|CRITICAL" [5m]))
/ sum(count_over_time({container=~"backend|ai-.*"} [5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: 'High error rate in logs'
description: 'Error rate exceeds 5% for 5 minutes'
No Logs Alert¶
- alert: NoLogsReceived
expr: |
sum(count_over_time({container="backend"} [5m])) == 0
for: 10m
labels:
severity: critical
annotations:
summary: 'No logs received from backend'
PromQL Query Examples¶
SLI Queries¶
# API Availability (from recording rules)
hsi:api_availability:ratio_rate1h
# Detection P95 Latency
hsi:detection_latency:p95_5m
# Error Budget Remaining
hsi:error_budget:api_availability_remaining * 100
Infrastructure Queries¶
# GPU Memory Pressure
hsi_gpu_memory_used_mb / hsi_gpu_memory_total_mb > 0.9
# Worker Pool Utilization
hsi_worker_busy_count / hsi_worker_active_count
# Queue Backpressure
hsi_detection_queue_depth > 100 or hsi_analysis_queue_depth > 50
AI Service Queries¶
# LLM Token Throughput
rate(hsi_nemotron_tokens_input_total[5m]) + rate(hsi_nemotron_tokens_output_total[5m])
# Enrichment Model Error Rate
sum by (model) (rate(hsi_enrichment_model_errors_total[5m])) / sum by (model) (rate(hsi_enrichment_model_calls_total[5m]))
# Detection Confidence Distribution
histogram_quantile(0.95, rate(hsi_detection_confidence_bucket[5m]))
Best Practices¶
Panel Design¶
- Use appropriate visualization types: Stats for current values, time series for trends, tables for detailed data
- Set meaningful thresholds: Based on SLOs and operational experience
- Include units: Percent, seconds, bytes, etc.
- Limit queries: Avoid heavy aggregations; use recording rules
Query Optimization¶
- Use recording rules for frequently-used aggregations
- Limit time ranges to what's necessary
- Avoid regex where possible
- Use
rate()overincrease()for better resolution
Dashboard Organization¶
- Executive summary first: Key metrics at top
- Progressive detail: General to specific as you scroll
- Related panels grouped: Use rows and collapsible sections
- Consistent time ranges: Align panel refresh intervals
Related Documents¶
- Prometheus Metrics - Metric definitions
- Distributed Tracing - Trace data sources
- Structured Logging - Log format for Loki
- Alertmanager - Alert routing from Grafana