Profiling Operations Runbook¶
Operational procedures for managing Pyroscope continuous profiling in production.
Quick Reference¶
| Task | Command |
|---|---|
| Check Pyroscope health | curl http://localhost:4040/ready |
| View Pyroscope UI | Open http://localhost:4040 |
| Restart Pyroscope | podman-compose -f docker-compose.prod.yml restart pyroscope |
| View profiler logs (AI svc) | podman exec ai-yolo26 cat /tmp/profiler.log |
| Check backend profiling | podman logs backend 2>&1 \| grep -i pyroscope |
| Disable profiling globally | Set PYROSCOPE_ENABLED=false in .env |
Automated Regression Alert Response Procedures¶
This section covers response procedures for automated regression detection alerts (NEM-4133).
ALERT-REG-001: ServiceCPUSpike / ServiceCPUSpikeCritical¶
Alert Condition: CPU usage >50% (warning) or >100% (critical) above 24-hour average for 15+ minutes.
Symptoms:
- Service consuming significantly more CPU than historical baseline
- Increased response times
- Higher infrastructure costs
Diagnosis:
# 1. Check current CPU regression ratio
curl -s "http://localhost:9090/api/v1/query?query=job:service_cpu_regression_ratio:5m_vs_24h" | jq '.data.result'
# 2. View CPU profile in Grafana Pyroscope
# Open: http://localhost:3002/d/hsi-profiling
# Select the affected service from dropdown
# 3. Compare current vs baseline flame graphs
# Enable "Comparison" mode in the dashboard
# Look for new hot functions or significantly increased function times
# 4. Check for recent deployments
git log --oneline --since="24 hours ago"
# 5. Check if workload increased
curl -s "http://localhost:9090/api/v1/query?query=rate(hsi_detections_processed_total[1h])" | jq
Resolution:
- If caused by code regression:
# Identify the problematic commit using flame graph comparison
# Roll back to previous version if needed
podman-compose -f docker-compose.prod.yml pull [service]
podman-compose -f docker-compose.prod.yml up -d [service]
-
If caused by increased workload:
-
Scale the service if possible
- Implement rate limiting
-
Optimize hot code paths identified in flame graph
-
If caused by memory pressure (GC overhead):
- Check memory alerts alongside CPU
- Increase memory allocation
- Investigate memory leaks
Escalation: If unresolved after 30 minutes, escalate to on-call engineer.
ALERT-REG-002: ServiceMemoryGrowth / ServiceMemoryGrowthCritical¶
Alert Condition: Memory usage >25% (warning) or >50% (critical) above 6-hour average for 30+ minutes.
Symptoms:
- Gradual memory increase over time
- Service restarts due to OOM
- Degraded performance
Diagnosis:
# 1. Check current memory regression ratio
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_regression_ratio:current_vs_6h" | jq '.data.result'
# 2. Check memory growth rate
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:deriv1h" | jq '.data.result'
# 3. Check memory profile in Pyroscope
# Select "Memory Bytes" or "Memory Allocations" profile type
# Look for functions allocating large amounts
# 4. Check container memory limits
podman stats --no-stream [container_name]
# 5. For Python services, check for common leak patterns
podman exec [container] python -c "import tracemalloc; tracemalloc.start()"
Resolution:
- If memory leak suspected:
# Restart service as immediate mitigation
podman-compose -f docker-compose.prod.yml restart [service]
# Schedule investigation of leak source
-
If caused by caching:
-
Review cache eviction policies
- Reduce cache size limits
-
Add cache entry TTLs
-
If caused by large request buffers:
- Implement streaming for large responses
- Add request size limits
ALERT-REG-003: PotentialMemoryLeak¶
Alert Condition: Memory projected to double within 24 hours based on current growth rate.
Symptoms:
- Steadily increasing memory usage
- Linear growth pattern visible in monitoring
- No correlation with workload
Diagnosis:
# 1. Check projected memory
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:predicted_24h" | jq '.data.result'
# 2. Check growth rate (bytes/hour)
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:deriv1h" | jq '.data.result'
# 3. Analyze memory allocation profile over time
# In Grafana Pyroscope, compare memory profiles from:
# - 6 hours ago
# - Current
# Look for functions with significantly more allocations
# 4. For Python: enable memory profiling
podman exec [container] python -c "
import tracemalloc
tracemalloc.start()
# ... run suspect code ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('lineno')[:10]:
print(stat)
"
Resolution:
- Immediate mitigation:
# Set up scheduled restarts until fix is deployed
# Add to crontab or systemd timer:
# 0 */4 * * * podman-compose -f docker-compose.prod.yml restart [service]
-
Investigation:
-
Use memory profiler to identify leak source
- Check for unclosed database connections
- Check for unbounded caches or queues
-
Review recent code changes for retained references
-
Long-term fix:
- Deploy code fix
- Add memory monitoring to CI/CD pipeline
- Implement memory pressure alerts
ALERT-REG-004: BackendHighLatency / BackendHighLatencyCritical¶
Alert Condition: Backend API P99 latency >2s (warning) or >5s (critical) for 10+ minutes.
Symptoms:
- Slow API responses
- UI timeouts
- WebSocket disconnections
Diagnosis:
# 1. Check current latency
curl -s "http://localhost:9090/api/v1/query?query=job:backend_api_latency:p99_5m" | jq '.data.result'
# 2. Check database query latency
curl -s "http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,rate(hsi_db_query_duration_seconds_bucket[5m]))" | jq
# 3. Check Redis latency
curl -s "http://localhost:9090/api/v1/query?query=redis_slowlog_length" | jq
# 4. Check CPU usage (may be contention)
curl -s "http://localhost:9090/api/v1/query?query=job:backend_cpu_seconds:rate5m" | jq
# 5. View backend flame graph for hot paths
# Open http://localhost:3002/d/hsi-profiling
# Select "nemotron-backend" service
Resolution:
- If database is slow:
# Check for long-running queries
podman exec postgres psql -U hsi -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
# Check for missing indexes
podman exec postgres psql -U hsi -c "EXPLAIN ANALYZE [slow_query];"
- If Redis is slow:
# Check slow log
podman exec redis redis-cli slowlog get 10
# Check memory usage
podman exec redis redis-cli info memory
- If CPU contention:
- Scale backend instances
- Optimize hot code paths from flame graph
- Add caching for expensive operations
ALERT-REG-005: YOLO26LatencyRegression¶
Alert Condition: YOLO26 inference P95 latency increased >50% compared to 1-hour average.
Symptoms:
- Object detection taking longer
- Real-time detection pipeline backing up
- Detection queue growing
Diagnosis:
# 1. Check YOLO26 latency
curl -s "http://localhost:9090/api/v1/query?query=job:yolo26_inference_latency:p95_5m" | jq
# 2. Check GPU utilization
curl -s "http://localhost:9090/api/v1/query?query=yolo26_gpu_utilization" | jq
# 3. Check GPU temperature (throttling?)
curl -s "http://localhost:9090/api/v1/query?query=yolo26_gpu_temperature" | jq
# 4. Check if model is loaded
curl -s "http://localhost:9090/api/v1/query?query=yolo26_model_loaded" | jq
# 5. View YOLO26 flame graph
# Open http://localhost:3002/d/hsi-profiling
# Select "ai-yolo26" service
Resolution:
-
If GPU throttling:
-
Improve cooling
- Reduce batch size
-
Lower power limit
-
If model not optimally loaded:
- If input resolution changed:
- Verify input preprocessing
- Check for larger than expected images
ALERT-REG-006: MultiServiceCPURegression¶
Alert Condition: 2 or more services showing >30% CPU increase simultaneously.
Symptoms:
- System-wide slowdown
- Multiple services affected
- Infrastructure-level issue likely
Diagnosis:
# 1. Check which services are affected
curl -s "http://localhost:9090/api/v1/query?query=job:service_cpu_regression_ratio:5m_vs_24h>1.3" | jq '.data.result[].metric.job'
# 2. Check host-level metrics
podman stats --no-stream
# 3. Check for noisy neighbor (other processes)
top -b -n 1 | head -20
# 4. Check disk I/O (may cause CPU wait)
iostat -x 1 5
# 5. Check network issues
netstat -s | grep -i error
Resolution:
-
If host resource exhaustion:
-
Identify and stop non-essential processes
- Scale out to additional hosts
-
Increase host resources
-
If shared dependency issue:
-
Check database/Redis health
- Check network connectivity
-
Verify shared storage performance
-
If coordinated attack/abuse:
- Implement rate limiting
- Block abusive traffic
- Scale defensive capacity
Incident Response Procedures¶
INC-PROF-001: Pyroscope Server Unavailable¶
Symptoms:
- No new profiling data in Grafana dashboard
- "Connection refused" errors in service logs
- Pyroscope UI not accessible at port 4040
Diagnosis:
# Check if Pyroscope container is running
podman ps | grep pyroscope
# Check container health
podman inspect pyroscope --format='{{.State.Health.Status}}'
# Check container logs
podman logs pyroscope --tail 100
# Test internal connectivity
podman exec backend curl -s http://pyroscope:4040/ready
Resolution:
# Restart Pyroscope
podman-compose -f docker-compose.prod.yml restart pyroscope
# If restart fails, recreate container
podman-compose -f docker-compose.prod.yml up -d --force-recreate pyroscope
# Verify recovery
curl http://localhost:4040/ready
# Expected: "ready"
Impact: Profiling data is lost during outage but services continue operating normally.
INC-PROF-002: High CPU Overhead from Profiling¶
Symptoms:
- Higher than expected CPU usage (>5% overhead)
- Services responding slower than normal
- py-spy processes consuming excessive CPU
Diagnosis:
# Check py-spy processes
podman exec ai-yolo26 ps aux | grep py-spy
# Check profiler log for errors
podman exec ai-yolo26 cat /tmp/profiler.log
# Check profile interval
podman exec ai-yolo26 env | grep PROFILE_INTERVAL
Resolution:
Option 1: Increase profile interval (less frequent profiling)
# Edit docker-compose.prod.yml or create override
# Set PROFILE_INTERVAL=60 (default is 30)
podman-compose -f docker-compose.prod.yml up -d ai-yolo26
Option 2: Disable profiling on specific service
# Add to docker-compose override
# PYROSCOPE_ENABLED=false for the affected service
podman-compose -f docker-compose.prod.yml up -d ai-yolo26
Option 3: Disable profiling globally
Impact: Reduced profiling coverage but improved service performance.
INC-PROF-003: Profiler Not Collecting Data for Service¶
Symptoms:
- Specific service missing from Pyroscope UI
- Service running but no profiles being collected
- Profiler log shows errors
Diagnosis:
# Check if SERVICE_NAME is set
podman exec ai-yolo26 env | grep SERVICE_NAME
# Check profiler script is running
podman exec ai-yolo26 pgrep -a pyroscope
# Check profiler log
podman exec ai-yolo26 cat /tmp/profiler.log
# For backend (SDK-based), check initialization
podman logs backend 2>&1 | grep -i "pyroscope profiling"
Resolution:
For AI services (py-spy based):
# Restart the service to reinitialize profiler
podman-compose -f docker-compose.prod.yml restart ai-yolo26
# Verify profiler started
podman exec ai-yolo26 cat /tmp/profiler.log
# Should see: "Starting profiler for ai-yolo26"
For backend (SDK based):
# Check SDK is installed
podman exec backend pip show pyroscope-io
# Restart backend
podman-compose -f docker-compose.prod.yml restart backend
# Verify initialization
podman logs backend 2>&1 | grep -i "pyroscope profiling initialized"
INC-PROF-004: Pyroscope Storage Full (NEM-3928)¶
Symptoms:
- Pyroscope queries becoming slow
- Disk usage growing rapidly in pyroscope volume
- "disk full" or "quota exceeded" errors in logs
Diagnosis:
# Check volume usage
podman volume inspect pyroscope_data
# Check container disk usage
podman exec pyroscope df -h /data
# Check retention settings (NEM-3928)
podman exec pyroscope cat /etc/pyroscope/config.yml
# Check compactor status
podman logs pyroscope 2>&1 | grep -i "compactor\|retention\|cleanup"
Retention Configuration (NEM-3928):
The Pyroscope retention policy is configured in monitoring/pyroscope/pyroscope-config.yml:
| Setting | Value | Description |
|---|---|---|
limits.compactor_blocks_retention_period | 720h | Maximum block age (30 days) |
pyroscopedb.retention_policy_min_free_disk_gb | 10 GB | Delete oldest blocks when free space below this |
pyroscopedb.retention_policy_min_disk_available_percentage | 5% | Secondary disk space threshold |
compactor.cleanup_interval | 15m | How often retention is enforced |
compactor.deletion_delay | 2h | Delay before permanent deletion |
Resolution:
# Option 1: Wait for automatic cleanup (recommended)
# The compactor runs every 15 minutes and will delete old blocks
# when disk space drops below thresholds. Check logs:
podman logs pyroscope 2>&1 | grep -i "deleting\|cleanup\|retention"
# Option 2: Reduce retention period for faster cleanup
# Edit monitoring/pyroscope/pyroscope-config.yml:
# Change: compactor_blocks_retention_period: 720h
# To: compactor_blocks_retention_period: 168h # 7 days
# Restart Pyroscope to apply new config
podman-compose -f docker-compose.prod.yml restart pyroscope
# Option 3: Force immediate compaction
# Trigger a compaction cycle by restarting Pyroscope
podman-compose -f docker-compose.prod.yml restart pyroscope
# Option 4: If urgent, clear all data (last resort)
podman-compose -f docker-compose.prod.yml stop pyroscope
podman volume rm pyroscope_data # WARNING: Deletes all profiling history
podman-compose -f docker-compose.prod.yml up -d pyroscope
Prevention:
- Monitor Pyroscope disk usage with the "Storage growth per day" metric
- Alert when disk usage exceeds 80% of expected maximum (~16GB for 30-day retention)
- Consider reducing
compactor_blocks_retention_periodfor disk-constrained environments
Impact: Automatic retention prevents disk exhaustion. Manual intervention only needed if retention policy is insufficient for workload.
Maintenance Procedures¶
MAINT-PROF-001: Updating Pyroscope Version¶
Pre-flight Checks:
# Check current version
podman exec pyroscope pyroscope --version
# Review release notes for breaking changes
# https://github.com/grafana/pyroscope/releases
Procedure:
# 1. Pull new image
podman pull grafana/pyroscope:latest
# 2. Stop Pyroscope
podman-compose -f docker-compose.prod.yml stop pyroscope
# 3. Backup configuration
cp monitoring/pyroscope/pyroscope-config.yml monitoring/pyroscope/pyroscope-config.yml.bak
# 4. Update version in docker-compose.prod.yml if pinned
# image: grafana/pyroscope:1.18.0 -> grafana/pyroscope:1.19.0
# 5. Recreate container
podman-compose -f docker-compose.prod.yml up -d pyroscope
# 6. Verify health
curl http://localhost:4040/ready
Rollback:
podman-compose -f docker-compose.prod.yml stop pyroscope
# Revert docker-compose.prod.yml version
podman-compose -f docker-compose.prod.yml up -d pyroscope
MAINT-PROF-002: Adding Profiling to New Service¶
Procedure:
- Add py-spy to Dockerfile:
# Install py-spy for profiling
RUN uv tool install py-spy && \
cp /root/.local/bin/py-spy /usr/local/bin/py-spy && \
chmod +x /usr/local/bin/py-spy
# Copy profiler scripts
COPY --chmod=755 scripts/pyroscope-profiler.sh /usr/local/bin/pyroscope-profiler.sh
COPY --chmod=755 scripts/ai-entrypoint.sh /usr/local/bin/ai-entrypoint.sh
# Install procps for pgrep
RUN apt-get update && apt-get install -y procps && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/local/bin/ai-entrypoint.sh"]
- Add environment variables in docker-compose.prod.yml:
my-new-service:
labels:
pyroscope.profile: 'true'
pyroscope.service: 'my-new-service'
environment:
- SERVICE_NAME=my-new-service
- PYROSCOPE_ENABLED=${PYROSCOPE_ENABLED:-true}
- PYROSCOPE_URL=http://pyroscope:4040
- Rebuild and deploy:
podman-compose -f docker-compose.prod.yml build --no-cache my-new-service
podman-compose -f docker-compose.prod.yml up -d my-new-service
- Verify profiling:
MAINT-PROF-003: Configuring Grafana Datasource¶
Procedure:
Pyroscope datasource is auto-provisioned. If manual setup needed:
# 1. Check if datasource exists
curl -s http://admin:admin@localhost:3002/api/datasources | jq '.[].name' # pragma: allowlist secret
# 2. If missing, add via provisioning
cat > monitoring/grafana/provisioning/datasources/pyroscope.yml << 'EOF'
apiVersion: 1
datasources:
- name: Pyroscope
type: pyroscope
url: http://pyroscope:4040
access: proxy
isDefault: false
EOF
# 3. Restart Grafana
podman-compose -f docker-compose.prod.yml restart grafana
Health Monitoring¶
Pyroscope Health Check¶
#!/bin/bash
# Check Pyroscope health and alert if down
if ! curl -sf http://localhost:4040/ready > /dev/null 2>&1; then
echo "ALERT: Pyroscope is not responding"
# Add alerting integration here
exit 1
fi
echo "OK: Pyroscope is healthy"
Profile Data Freshness¶
#!/bin/bash
# Check if profiles are being collected (data within last 5 minutes)
# Query Pyroscope API for recent data
SERVICES="nemotron-backend ai-yolo26 ai-florence ai-clip"
for service in $SERVICES; do
# Check for data in last 5 minutes
RESULT=$(curl -s "http://localhost:4040/pyroscope/render?query=${service}&from=now-5m&until=now&format=json" | jq '.flamebearer.numTicks // 0')
if [ "$RESULT" -eq 0 ]; then
echo "WARNING: No recent profiles for $service"
else
echo "OK: $service has recent profile data"
fi
done
Performance Baselines¶
| Metric | Expected | Alert Threshold |
|---|---|---|
| Pyroscope CPU usage | < 5% | > 10% |
| Pyroscope memory usage | < 500MB | > 1GB |
| Profile push latency | < 1s | > 5s |
| Profiler overhead per service | 1-3% | > 5% |
| Storage growth per day | ~350-700MB | > 1GB |
| Total storage (30-day) | ~10-20GB | > 30GB |
Related Documentation¶
| Document | Purpose |
|---|---|
| Profiling Guide | User-facing profiling documentation |
| Monitoring Guide | Full observability stack |
| Pyroscope UI | Frontend dashboard documentation |
| AI Performance | AI service performance tuning |
Appendix: Configuration Files¶
Pyroscope Server Configuration (NEM-3928)¶
Location: monitoring/pyroscope/pyroscope-config.yml
# Pyroscope 1.18.0 configuration (NEM-3928)
# Comprehensive retention policy to prevent disk bloat
# Storage configuration
storage:
backend: filesystem
filesystem:
dir: /data
# Server configuration
server:
http_listen_port: 4040
# PyroscopeDB retention policy
# Disk-based retention to prevent storage exhaustion
pyroscopedb:
retention_policy_min_free_disk_gb: 10 # Delete oldest when below 10GB free
retention_policy_min_disk_available_percentage: 0.05 # Or below 5% free
retention_policy_enforcement_interval: 5m # Check every 5 minutes
max_block_duration: 3h # Max block size before compaction
# Compactor configuration
# Compacts blocks and enforces time-based retention
compactor:
data_dir: /data/compactor
compaction_interval: 2h # Compact every 2 hours
cleanup_interval: 15m # Apply retention every 15 minutes
deletion_delay: 2h # Safety buffer before deletion
block_cleanup_enabled: true
# Limits configuration
# Time-based retention and ingestion limits
limits:
compactor_blocks_retention_period: 720h # 30 days max retention
ingestion_rate_mb: 4 # Max ingestion rate
ingestion_burst_size_mb: 8 # Burst allowance
Retention Tuning Guide¶
| Scenario | Recommended Changes |
|---|---|
| Limited disk space (<50GB) | compactor_blocks_retention_period: 168h (7 days) |
| High-volume profiling | Increase ingestion_rate_mb and ingestion_burst_size_mb |
| Faster cleanup | Reduce cleanup_interval to 5m |
| More disk safety margin | Increase retention_policy_min_free_disk_gb to 20 |
| Development/testing | compactor_blocks_retention_period: 24h (1 day) |
AI Entrypoint Script¶
Location: scripts/ai-entrypoint.sh
#!/bin/bash
# Starts profiler in background if enabled, then runs main command
set -e
if [ "${PYROSCOPE_ENABLED:-true}" = "true" ] && [ -n "$SERVICE_NAME" ]; then
nohup /usr/local/bin/pyroscope-profiler.sh "$SERVICE_NAME" \
"${PYROSCOPE_URL:-http://pyroscope:4040}" \
"${PROFILE_INTERVAL:-30}" >> /tmp/profiler.log 2>&1 &
fi
exec "$@"
Profiler Script¶
Location: scripts/pyroscope-profiler.sh
Captures CPU profiles using py-spy and pushes to Pyroscope in Speedscope format. Key parameters:
--nonblocking: Minimizes impact on profiled process--duration: Profile capture duration (default 30s)--format speedscope: Compatible with Pyroscope ingestion