Skip to content

Profiling Operations Runbook

Operational procedures for managing Pyroscope continuous profiling in production.

Quick Reference

Task Command
Check Pyroscope health curl http://localhost:4040/ready
View Pyroscope UI Open http://localhost:4040
Restart Pyroscope podman-compose -f docker-compose.prod.yml restart pyroscope
View profiler logs (AI svc) podman exec ai-yolo26 cat /tmp/profiler.log
Check backend profiling podman logs backend 2>&1 \| grep -i pyroscope
Disable profiling globally Set PYROSCOPE_ENABLED=false in .env

Automated Regression Alert Response Procedures

This section covers response procedures for automated regression detection alerts (NEM-4133).

ALERT-REG-001: ServiceCPUSpike / ServiceCPUSpikeCritical

Alert Condition: CPU usage >50% (warning) or >100% (critical) above 24-hour average for 15+ minutes.

Symptoms:

  • Service consuming significantly more CPU than historical baseline
  • Increased response times
  • Higher infrastructure costs

Diagnosis:

# 1. Check current CPU regression ratio
curl -s "http://localhost:9090/api/v1/query?query=job:service_cpu_regression_ratio:5m_vs_24h" | jq '.data.result'

# 2. View CPU profile in Grafana Pyroscope
# Open: http://localhost:3002/d/hsi-profiling
# Select the affected service from dropdown

# 3. Compare current vs baseline flame graphs
# Enable "Comparison" mode in the dashboard
# Look for new hot functions or significantly increased function times

# 4. Check for recent deployments
git log --oneline --since="24 hours ago"

# 5. Check if workload increased
curl -s "http://localhost:9090/api/v1/query?query=rate(hsi_detections_processed_total[1h])" | jq

Resolution:

  1. If caused by code regression:
# Identify the problematic commit using flame graph comparison
# Roll back to previous version if needed
podman-compose -f docker-compose.prod.yml pull [service]
podman-compose -f docker-compose.prod.yml up -d [service]
  1. If caused by increased workload:

  2. Scale the service if possible

  3. Implement rate limiting
  4. Optimize hot code paths identified in flame graph

  5. If caused by memory pressure (GC overhead):

  6. Check memory alerts alongside CPU
  7. Increase memory allocation
  8. Investigate memory leaks

Escalation: If unresolved after 30 minutes, escalate to on-call engineer.


ALERT-REG-002: ServiceMemoryGrowth / ServiceMemoryGrowthCritical

Alert Condition: Memory usage >25% (warning) or >50% (critical) above 6-hour average for 30+ minutes.

Symptoms:

  • Gradual memory increase over time
  • Service restarts due to OOM
  • Degraded performance

Diagnosis:

# 1. Check current memory regression ratio
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_regression_ratio:current_vs_6h" | jq '.data.result'

# 2. Check memory growth rate
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:deriv1h" | jq '.data.result'

# 3. Check memory profile in Pyroscope
# Select "Memory Bytes" or "Memory Allocations" profile type
# Look for functions allocating large amounts

# 4. Check container memory limits
podman stats --no-stream [container_name]

# 5. For Python services, check for common leak patterns
podman exec [container] python -c "import tracemalloc; tracemalloc.start()"

Resolution:

  1. If memory leak suspected:
# Restart service as immediate mitigation
podman-compose -f docker-compose.prod.yml restart [service]

# Schedule investigation of leak source
  1. If caused by caching:

  2. Review cache eviction policies

  3. Reduce cache size limits
  4. Add cache entry TTLs

  5. If caused by large request buffers:

  6. Implement streaming for large responses
  7. Add request size limits

ALERT-REG-003: PotentialMemoryLeak

Alert Condition: Memory projected to double within 24 hours based on current growth rate.

Symptoms:

  • Steadily increasing memory usage
  • Linear growth pattern visible in monitoring
  • No correlation with workload

Diagnosis:

# 1. Check projected memory
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:predicted_24h" | jq '.data.result'

# 2. Check growth rate (bytes/hour)
curl -s "http://localhost:9090/api/v1/query?query=job:service_memory_bytes:deriv1h" | jq '.data.result'

# 3. Analyze memory allocation profile over time
# In Grafana Pyroscope, compare memory profiles from:
# - 6 hours ago
# - Current
# Look for functions with significantly more allocations

# 4. For Python: enable memory profiling
podman exec [container] python -c "
import tracemalloc
tracemalloc.start()
# ... run suspect code ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('lineno')[:10]:
    print(stat)
"

Resolution:

  1. Immediate mitigation:
# Set up scheduled restarts until fix is deployed
# Add to crontab or systemd timer:
# 0 */4 * * * podman-compose -f docker-compose.prod.yml restart [service]
  1. Investigation:

  2. Use memory profiler to identify leak source

  3. Check for unclosed database connections
  4. Check for unbounded caches or queues
  5. Review recent code changes for retained references

  6. Long-term fix:

  7. Deploy code fix
  8. Add memory monitoring to CI/CD pipeline
  9. Implement memory pressure alerts

ALERT-REG-004: BackendHighLatency / BackendHighLatencyCritical

Alert Condition: Backend API P99 latency >2s (warning) or >5s (critical) for 10+ minutes.

Symptoms:

  • Slow API responses
  • UI timeouts
  • WebSocket disconnections

Diagnosis:

# 1. Check current latency
curl -s "http://localhost:9090/api/v1/query?query=job:backend_api_latency:p99_5m" | jq '.data.result'

# 2. Check database query latency
curl -s "http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,rate(hsi_db_query_duration_seconds_bucket[5m]))" | jq

# 3. Check Redis latency
curl -s "http://localhost:9090/api/v1/query?query=redis_slowlog_length" | jq

# 4. Check CPU usage (may be contention)
curl -s "http://localhost:9090/api/v1/query?query=job:backend_cpu_seconds:rate5m" | jq

# 5. View backend flame graph for hot paths
# Open http://localhost:3002/d/hsi-profiling
# Select "nemotron-backend" service

Resolution:

  1. If database is slow:
# Check for long-running queries
podman exec postgres psql -U hsi -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"

# Check for missing indexes
podman exec postgres psql -U hsi -c "EXPLAIN ANALYZE [slow_query];"
  1. If Redis is slow:
# Check slow log
podman exec redis redis-cli slowlog get 10

# Check memory usage
podman exec redis redis-cli info memory
  1. If CPU contention:
  2. Scale backend instances
  3. Optimize hot code paths from flame graph
  4. Add caching for expensive operations

ALERT-REG-005: YOLO26LatencyRegression

Alert Condition: YOLO26 inference P95 latency increased >50% compared to 1-hour average.

Symptoms:

  • Object detection taking longer
  • Real-time detection pipeline backing up
  • Detection queue growing

Diagnosis:

# 1. Check YOLO26 latency
curl -s "http://localhost:9090/api/v1/query?query=job:yolo26_inference_latency:p95_5m" | jq

# 2. Check GPU utilization
curl -s "http://localhost:9090/api/v1/query?query=yolo26_gpu_utilization" | jq

# 3. Check GPU temperature (throttling?)
curl -s "http://localhost:9090/api/v1/query?query=yolo26_gpu_temperature" | jq

# 4. Check if model is loaded
curl -s "http://localhost:9090/api/v1/query?query=yolo26_model_loaded" | jq

# 5. View YOLO26 flame graph
# Open http://localhost:3002/d/hsi-profiling
# Select "ai-yolo26" service

Resolution:

  1. If GPU throttling:

  2. Improve cooling

  3. Reduce batch size
  4. Lower power limit

  5. If model not optimally loaded:

# Restart to reinitialize TensorRT
podman-compose -f docker-compose.prod.yml restart ai-yolo26
  1. If input resolution changed:
  2. Verify input preprocessing
  3. Check for larger than expected images

ALERT-REG-006: MultiServiceCPURegression

Alert Condition: 2 or more services showing >30% CPU increase simultaneously.

Symptoms:

  • System-wide slowdown
  • Multiple services affected
  • Infrastructure-level issue likely

Diagnosis:

# 1. Check which services are affected
curl -s "http://localhost:9090/api/v1/query?query=job:service_cpu_regression_ratio:5m_vs_24h>1.3" | jq '.data.result[].metric.job'

# 2. Check host-level metrics
podman stats --no-stream

# 3. Check for noisy neighbor (other processes)
top -b -n 1 | head -20

# 4. Check disk I/O (may cause CPU wait)
iostat -x 1 5

# 5. Check network issues
netstat -s | grep -i error

Resolution:

  1. If host resource exhaustion:

  2. Identify and stop non-essential processes

  3. Scale out to additional hosts
  4. Increase host resources

  5. If shared dependency issue:

  6. Check database/Redis health

  7. Check network connectivity
  8. Verify shared storage performance

  9. If coordinated attack/abuse:

  10. Implement rate limiting
  11. Block abusive traffic
  12. Scale defensive capacity

Incident Response Procedures

INC-PROF-001: Pyroscope Server Unavailable

Symptoms:

  • No new profiling data in Grafana dashboard
  • "Connection refused" errors in service logs
  • Pyroscope UI not accessible at port 4040

Diagnosis:

# Check if Pyroscope container is running
podman ps | grep pyroscope

# Check container health
podman inspect pyroscope --format='{{.State.Health.Status}}'

# Check container logs
podman logs pyroscope --tail 100

# Test internal connectivity
podman exec backend curl -s http://pyroscope:4040/ready

Resolution:

# Restart Pyroscope
podman-compose -f docker-compose.prod.yml restart pyroscope

# If restart fails, recreate container
podman-compose -f docker-compose.prod.yml up -d --force-recreate pyroscope

# Verify recovery
curl http://localhost:4040/ready
# Expected: "ready"

Impact: Profiling data is lost during outage but services continue operating normally.


INC-PROF-002: High CPU Overhead from Profiling

Symptoms:

  • Higher than expected CPU usage (>5% overhead)
  • Services responding slower than normal
  • py-spy processes consuming excessive CPU

Diagnosis:

# Check py-spy processes
podman exec ai-yolo26 ps aux | grep py-spy

# Check profiler log for errors
podman exec ai-yolo26 cat /tmp/profiler.log

# Check profile interval
podman exec ai-yolo26 env | grep PROFILE_INTERVAL

Resolution:

Option 1: Increase profile interval (less frequent profiling)

# Edit docker-compose.prod.yml or create override
# Set PROFILE_INTERVAL=60 (default is 30)
podman-compose -f docker-compose.prod.yml up -d ai-yolo26

Option 2: Disable profiling on specific service

# Add to docker-compose override
# PYROSCOPE_ENABLED=false for the affected service
podman-compose -f docker-compose.prod.yml up -d ai-yolo26

Option 3: Disable profiling globally

echo "PYROSCOPE_ENABLED=false" >> .env
podman-compose -f docker-compose.prod.yml up -d

Impact: Reduced profiling coverage but improved service performance.


INC-PROF-003: Profiler Not Collecting Data for Service

Symptoms:

  • Specific service missing from Pyroscope UI
  • Service running but no profiles being collected
  • Profiler log shows errors

Diagnosis:

# Check if SERVICE_NAME is set
podman exec ai-yolo26 env | grep SERVICE_NAME

# Check profiler script is running
podman exec ai-yolo26 pgrep -a pyroscope

# Check profiler log
podman exec ai-yolo26 cat /tmp/profiler.log

# For backend (SDK-based), check initialization
podman logs backend 2>&1 | grep -i "pyroscope profiling"

Resolution:

For AI services (py-spy based):

# Restart the service to reinitialize profiler
podman-compose -f docker-compose.prod.yml restart ai-yolo26

# Verify profiler started
podman exec ai-yolo26 cat /tmp/profiler.log
# Should see: "Starting profiler for ai-yolo26"

For backend (SDK based):

# Check SDK is installed
podman exec backend pip show pyroscope-io

# Restart backend
podman-compose -f docker-compose.prod.yml restart backend

# Verify initialization
podman logs backend 2>&1 | grep -i "pyroscope profiling initialized"

INC-PROF-004: Pyroscope Storage Full (NEM-3928)

Symptoms:

  • Pyroscope queries becoming slow
  • Disk usage growing rapidly in pyroscope volume
  • "disk full" or "quota exceeded" errors in logs

Diagnosis:

# Check volume usage
podman volume inspect pyroscope_data

# Check container disk usage
podman exec pyroscope df -h /data

# Check retention settings (NEM-3928)
podman exec pyroscope cat /etc/pyroscope/config.yml

# Check compactor status
podman logs pyroscope 2>&1 | grep -i "compactor\|retention\|cleanup"

Retention Configuration (NEM-3928):

The Pyroscope retention policy is configured in monitoring/pyroscope/pyroscope-config.yml:

Setting Value Description
limits.compactor_blocks_retention_period 720h Maximum block age (30 days)
pyroscopedb.retention_policy_min_free_disk_gb 10 GB Delete oldest blocks when free space below this
pyroscopedb.retention_policy_min_disk_available_percentage 5% Secondary disk space threshold
compactor.cleanup_interval 15m How often retention is enforced
compactor.deletion_delay 2h Delay before permanent deletion

Resolution:

# Option 1: Wait for automatic cleanup (recommended)
# The compactor runs every 15 minutes and will delete old blocks
# when disk space drops below thresholds. Check logs:
podman logs pyroscope 2>&1 | grep -i "deleting\|cleanup\|retention"

# Option 2: Reduce retention period for faster cleanup
# Edit monitoring/pyroscope/pyroscope-config.yml:
# Change: compactor_blocks_retention_period: 720h
# To:     compactor_blocks_retention_period: 168h  # 7 days

# Restart Pyroscope to apply new config
podman-compose -f docker-compose.prod.yml restart pyroscope

# Option 3: Force immediate compaction
# Trigger a compaction cycle by restarting Pyroscope
podman-compose -f docker-compose.prod.yml restart pyroscope

# Option 4: If urgent, clear all data (last resort)
podman-compose -f docker-compose.prod.yml stop pyroscope
podman volume rm pyroscope_data  # WARNING: Deletes all profiling history
podman-compose -f docker-compose.prod.yml up -d pyroscope

Prevention:

  • Monitor Pyroscope disk usage with the "Storage growth per day" metric
  • Alert when disk usage exceeds 80% of expected maximum (~16GB for 30-day retention)
  • Consider reducing compactor_blocks_retention_period for disk-constrained environments

Impact: Automatic retention prevents disk exhaustion. Manual intervention only needed if retention policy is insufficient for workload.


Maintenance Procedures

MAINT-PROF-001: Updating Pyroscope Version

Pre-flight Checks:

# Check current version
podman exec pyroscope pyroscope --version

# Review release notes for breaking changes
# https://github.com/grafana/pyroscope/releases

Procedure:

# 1. Pull new image
podman pull grafana/pyroscope:latest

# 2. Stop Pyroscope
podman-compose -f docker-compose.prod.yml stop pyroscope

# 3. Backup configuration
cp monitoring/pyroscope/pyroscope-config.yml monitoring/pyroscope/pyroscope-config.yml.bak

# 4. Update version in docker-compose.prod.yml if pinned
# image: grafana/pyroscope:1.18.0 -> grafana/pyroscope:1.19.0

# 5. Recreate container
podman-compose -f docker-compose.prod.yml up -d pyroscope

# 6. Verify health
curl http://localhost:4040/ready

Rollback:

podman-compose -f docker-compose.prod.yml stop pyroscope
# Revert docker-compose.prod.yml version
podman-compose -f docker-compose.prod.yml up -d pyroscope

MAINT-PROF-002: Adding Profiling to New Service

Procedure:

  1. Add py-spy to Dockerfile:
# Install py-spy for profiling
RUN uv tool install py-spy && \
    cp /root/.local/bin/py-spy /usr/local/bin/py-spy && \
    chmod +x /usr/local/bin/py-spy

# Copy profiler scripts
COPY --chmod=755 scripts/pyroscope-profiler.sh /usr/local/bin/pyroscope-profiler.sh
COPY --chmod=755 scripts/ai-entrypoint.sh /usr/local/bin/ai-entrypoint.sh

# Install procps for pgrep
RUN apt-get update && apt-get install -y procps && rm -rf /var/lib/apt/lists/*

ENTRYPOINT ["/usr/local/bin/ai-entrypoint.sh"]
  1. Add environment variables in docker-compose.prod.yml:
my-new-service:
  labels:
    pyroscope.profile: 'true'
    pyroscope.service: 'my-new-service'
  environment:
    - SERVICE_NAME=my-new-service
    - PYROSCOPE_ENABLED=${PYROSCOPE_ENABLED:-true}
    - PYROSCOPE_URL=http://pyroscope:4040
  1. Rebuild and deploy:
podman-compose -f docker-compose.prod.yml build --no-cache my-new-service
podman-compose -f docker-compose.prod.yml up -d my-new-service
  1. Verify profiling:
    podman exec my-new-service cat /tmp/profiler.log
    # Should see: "Starting profiler for my-new-service"
    

MAINT-PROF-003: Configuring Grafana Datasource

Procedure:

Pyroscope datasource is auto-provisioned. If manual setup needed:

# 1. Check if datasource exists
curl -s http://admin:admin@localhost:3002/api/datasources | jq '.[].name' # pragma: allowlist secret

# 2. If missing, add via provisioning
cat > monitoring/grafana/provisioning/datasources/pyroscope.yml << 'EOF'
apiVersion: 1
datasources:
  - name: Pyroscope
    type: pyroscope
    url: http://pyroscope:4040
    access: proxy
    isDefault: false
EOF

# 3. Restart Grafana
podman-compose -f docker-compose.prod.yml restart grafana

Health Monitoring

Pyroscope Health Check

#!/bin/bash
# Check Pyroscope health and alert if down

if ! curl -sf http://localhost:4040/ready > /dev/null 2>&1; then
    echo "ALERT: Pyroscope is not responding"
    # Add alerting integration here
    exit 1
fi

echo "OK: Pyroscope is healthy"

Profile Data Freshness

#!/bin/bash
# Check if profiles are being collected (data within last 5 minutes)

# Query Pyroscope API for recent data
SERVICES="nemotron-backend ai-yolo26 ai-florence ai-clip"

for service in $SERVICES; do
    # Check for data in last 5 minutes
    RESULT=$(curl -s "http://localhost:4040/pyroscope/render?query=${service}&from=now-5m&until=now&format=json" | jq '.flamebearer.numTicks // 0')

    if [ "$RESULT" -eq 0 ]; then
        echo "WARNING: No recent profiles for $service"
    else
        echo "OK: $service has recent profile data"
    fi
done

Performance Baselines

Metric Expected Alert Threshold
Pyroscope CPU usage < 5% > 10%
Pyroscope memory usage < 500MB > 1GB
Profile push latency < 1s > 5s
Profiler overhead per service 1-3% > 5%
Storage growth per day ~350-700MB > 1GB
Total storage (30-day) ~10-20GB > 30GB

Document Purpose
Profiling Guide User-facing profiling documentation
Monitoring Guide Full observability stack
Pyroscope UI Frontend dashboard documentation
AI Performance AI service performance tuning

Appendix: Configuration Files

Pyroscope Server Configuration (NEM-3928)

Location: monitoring/pyroscope/pyroscope-config.yml

# Pyroscope 1.18.0 configuration (NEM-3928)
# Comprehensive retention policy to prevent disk bloat

# Storage configuration
storage:
  backend: filesystem
  filesystem:
    dir: /data

# Server configuration
server:
  http_listen_port: 4040

# PyroscopeDB retention policy
# Disk-based retention to prevent storage exhaustion
pyroscopedb:
  retention_policy_min_free_disk_gb: 10 # Delete oldest when below 10GB free
  retention_policy_min_disk_available_percentage: 0.05 # Or below 5% free
  retention_policy_enforcement_interval: 5m # Check every 5 minutes
  max_block_duration: 3h # Max block size before compaction

# Compactor configuration
# Compacts blocks and enforces time-based retention
compactor:
  data_dir: /data/compactor
  compaction_interval: 2h # Compact every 2 hours
  cleanup_interval: 15m # Apply retention every 15 minutes
  deletion_delay: 2h # Safety buffer before deletion
  block_cleanup_enabled: true

# Limits configuration
# Time-based retention and ingestion limits
limits:
  compactor_blocks_retention_period: 720h # 30 days max retention
  ingestion_rate_mb: 4 # Max ingestion rate
  ingestion_burst_size_mb: 8 # Burst allowance

Retention Tuning Guide

Scenario Recommended Changes
Limited disk space (<50GB) compactor_blocks_retention_period: 168h (7 days)
High-volume profiling Increase ingestion_rate_mb and ingestion_burst_size_mb
Faster cleanup Reduce cleanup_interval to 5m
More disk safety margin Increase retention_policy_min_free_disk_gb to 20
Development/testing compactor_blocks_retention_period: 24h (1 day)

AI Entrypoint Script

Location: scripts/ai-entrypoint.sh

#!/bin/bash
# Starts profiler in background if enabled, then runs main command
set -e

if [ "${PYROSCOPE_ENABLED:-true}" = "true" ] && [ -n "$SERVICE_NAME" ]; then
    nohup /usr/local/bin/pyroscope-profiler.sh "$SERVICE_NAME" \
        "${PYROSCOPE_URL:-http://pyroscope:4040}" \
        "${PROFILE_INTERVAL:-30}" >> /tmp/profiler.log 2>&1 &
fi

exec "$@"

Profiler Script

Location: scripts/pyroscope-profiler.sh

Captures CPU profiles using py-spy and pushes to Pyroscope in Speedscope format. Key parameters:

  • --nonblocking: Minimizes impact on profiled process
  • --duration: Profile capture duration (default 30s)
  • --format speedscope: Compatible with Pyroscope ingestion