Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

This runbook covers four cost anomaly scenarios: unexpected runtime spikes, budget exhaustion, runaway retries, and high-cost sessions. Use the Cloud Billing budget alerts as the primary entry point.

1. Detecting Cost Anomalies

Budget alert channels

AlertThresholdAction
gpu-budget-50pct50 % of monthly budget consumedReview trend; no immediate action
gpu-budget-90pct90 % of monthly budget consumedNotify Engineering and Finance
gpu-budget-exceeded100 % / forecast 120 %Activate containment (this runbook)
gpu-daily-spikeDaily spend > 2× 7-day rolling averageInvestigate within 1 hour

Manual cost review

# View GPU cost breakdown for the current month (BigQuery billing export)
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     sku.description,
     SUM(cost) AS total_cost,
     SUM(usage.amount) AS total_usage,
     usage.unit
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`
   WHERE
     DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
     AND project.id = 'impulse-gpu-runtime'
   GROUP BY 1, 4
   ORDER BY 2 DESC"

2. Unexpected Runtime Spikes

Symptoms: GPU accelerator-seconds billed in a given hour is significantly higher than baseline; Cloud Monitoring shows vertex_ai/custom_jobs/running_jobs elevated.

Investigate

# List all currently running Vertex AI custom jobs
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING" \
  --format="table(name,displayName,createTime)"

# Check for jobs that have been running unusually long
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="table(name,displayName,createTime)"

Correlate with job submissions

# Count job submissions per hour for the last 24 hours (scheduler logs)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  awk '{print $1}' | cut -dT -f1 | cut -dH -f1 | sort | uniq -c

Contain

If you identify an unexpected surge of long-running jobs:
# Pause new intake immediately
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

# Cancel jobs that exceed the expected maximum runtime (e.g. 2 hours)
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="value(name)" | \
  while read -r job; do
    gcloud ai custom-jobs cancel "$job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

3. Budget Exhaustion

Symptoms: The gpu-budget-exceeded alert fires; GCP Budget enforcer may start restricting project spend; new Vertex AI jobs may be rejected.
GCP Budget alerts are informational by default and do not automatically stop resource usage. You must manually take containment steps.

Immediate containment

  1. Pause job intake:
    kubectl set env deployment/gpu-scheduler \
      -n gpu-runtime \
      PAUSE_JOB_INTAKE=true
    
  2. Cancel all non-critical running jobs (coordinate with Product):
    gcloud ai custom-jobs list \
      --project=impulse-gpu-runtime \
      --region=us-central1 \
      --filter="state=JOB_STATE_RUNNING" \
      --format="value(name)" | \
      while read -r job; do
        gcloud ai custom-jobs cancel "$job" \
          --project=impulse-gpu-runtime \
          --region=us-central1
      done
    
  3. Notify Finance and Engineering with current spend, forecast, and containment actions taken.

Request emergency budget increase

  1. Log into the GCP Console → Billing → Budgets & Alerts.
  2. Select the impulse-gpu-runtime budget.
  3. Click Edit and increase the budget amount.
  4. Notify the Finance team of the temporary increase and the expected overage.

Resume job intake

Once Finance approves the temporary budget increase:
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false

4. Runaway Retries

Symptoms: A single job_id or small set of jobs appears repeatedly in Vertex AI job history; DLQ message count grows; billing for the same logical job is abnormally high.

Identify runaway jobs

# Find job_ids with > 5 submission attempts in the last 24 hours
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  jq -r '.job_id' | sort | uniq -c | sort -rn | head -20

Block a specific job from further retries

# Mark job as permanently failed to stop retry loop
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "FAILED", "error_code": "MAX_RETRIES_EXCEEDED", "retry_blocked": true}'

Verify retry circuit-breaker configuration

The scheduler enforces a maximum of 5 total attempts per job_id. Verify this setting:
kubectl get configmap gpu-scheduler-config -n gpu-runtime -o yaml | \
  grep -E "max_retries|retry"
If max_retries is set higher than 5 or is missing, update the ConfigMap and restart the scheduler:
kubectl patch configmap gpu-scheduler-config -n gpu-runtime \
  --type=merge \
  -p '{"data":{"max_retries":"5"}}'

kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime

5. High-Cost Sessions

Symptoms: A small number of user sessions account for a disproportionate share of GPU spend; per-session cost exceeds configured caps.

Identify high-cost sessions

# Query BigQuery billing for top sessions this month
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     labels.value AS session_id,
     SUM(cost) AS total_cost,
     COUNT(*) AS job_count
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`,
     UNNEST(labels) AS labels
   WHERE
     labels.key = 'session_id'
     AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
   GROUP BY 1
   ORDER BY 2 DESC
   LIMIT 20"

Terminate and cap a high-cost session

# Get all running jobs for a session
curl -s "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/jobs?status=RUNNING" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq -r '.[].vertex_job_name' | \
  while read -r vertex_job; do
    gcloud ai custom-jobs cancel "$vertex_job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

# Set a per-session GPU spend cap (update billing enforcement config)
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/limits" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_gpu_cost_usd": 50.0, "action_on_limit": "terminate"}'

Long-term prevention

ActionOwnerTimeline
Implement per-session GPU spend caps at the scheduler levelBackend Engineering2 weeks
Add session_id label to all Vertex AI custom job submissionsBackend Engineering1 week
Create per-customer spend anomaly detection alert in Cloud MonitoringSRE1 week
Review and enforce max_runtime_seconds per job tierProduct + Backend1 week

Post-Incident Checklist

  • Root cause of cost anomaly identified and documented
  • Cost impact quantified and reported to Finance
  • Runaway jobs cancelled and retry loops blocked
  • Budget restored to normal limits (or increase formally approved)
  • Preventive controls (caps, alerts, circuit-breakers) reviewed and updated
  • Incident postmortem scheduled within 48 hours for spend anomalies > $500