GPU Runtime Cost Anomaly Runbook

This runbook covers four cost anomaly scenarios: unexpected runtime spikes, budget exhaustion, runaway retries, and high-cost sessions. Use the Cloud Billing budget alerts as the primary entry point.

1. Detecting Cost Anomalies

Budget alert channels

Alert	Threshold	Action
`gpu-budget-50pct`	50 % of monthly budget consumed	Review trend; no immediate action
`gpu-budget-90pct`	90 % of monthly budget consumed	Notify Engineering and Finance
`gpu-budget-exceeded`	100 % / forecast 120 %	Activate containment (this runbook)
`gpu-daily-spike`	Daily spend > 2× 7-day rolling average	Investigate within 1 hour

Manual cost review

# View GPU cost breakdown for the current month (BigQuery billing export)
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     sku.description,
     SUM(cost) AS total_cost,
     SUM(usage.amount) AS total_usage,
     usage.unit
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`
   WHERE
     DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
     AND project.id = 'impulse-gpu-runtime'
   GROUP BY 1, 4
   ORDER BY 2 DESC"

2. Unexpected Runtime Spikes

Symptoms: GPU accelerator-seconds billed in a given hour is significantly higher than baseline; Cloud Monitoring shows vertex_ai/custom_jobs/running_jobs elevated.

Investigate

# List all currently running Vertex AI custom jobs
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING" \
  --format="table(name,displayName,createTime)"

# Check for jobs that have been running unusually long
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="table(name,displayName,createTime)"

Correlate with job submissions

# Count job submissions per hour for the last 24 hours (scheduler logs)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  awk '{print $1}' | cut -dT -f1 | cut -dH -f1 | sort | uniq -c

Contain

If you identify an unexpected surge of long-running jobs:

# Pause new intake immediately
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

# Cancel jobs that exceed the expected maximum runtime (e.g. 2 hours)
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="value(name)" | \
  while read -r job; do
    gcloud ai custom-jobs cancel "$job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

3. Budget Exhaustion

Symptoms: The gpu-budget-exceeded alert fires; GCP Budget enforcer may start restricting project spend; new Vertex AI jobs may be rejected.

GCP Budget alerts are informational by default and do not automatically stop resource usage. You must manually take containment steps.

Immediate containment

Pause job intake:

kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

Cancel all non-critical running jobs (coordinate with Product):

gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING" \
  --format="value(name)" | \
  while read -r job; do
    gcloud ai custom-jobs cancel "$job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

Notify Finance and Engineering with current spend, forecast, and containment actions taken.

Request emergency budget increase

Log into the GCP Console → Billing → Budgets & Alerts.
Select the impulse-gpu-runtime budget.
Click Edit and increase the budget amount.
Notify the Finance team of the temporary increase and the expected overage.

Resume job intake

Once Finance approves the temporary budget increase:

kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false

4. Runaway Retries

Symptoms: A single job_id or small set of jobs appears repeatedly in Vertex AI job history; DLQ message count grows; billing for the same logical job is abnormally high.

Identify runaway jobs

# Find job_ids with > 5 submission attempts in the last 24 hours
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  jq -r '.job_id' | sort | uniq -c | sort -rn | head -20

Block a specific job from further retries

# Mark job as permanently failed to stop retry loop
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "FAILED", "error_code": "MAX_RETRIES_EXCEEDED", "retry_blocked": true}'

Verify retry circuit-breaker configuration

The scheduler enforces a maximum of 5 total attempts per job_id. Verify this setting:

kubectl get configmap gpu-scheduler-config -n gpu-runtime -o yaml | \
  grep -E "max_retries|retry"

If max_retries is set higher than 5 or is missing, update the ConfigMap and restart the scheduler:

kubectl patch configmap gpu-scheduler-config -n gpu-runtime \
  --type=merge \
  -p '{"data":{"max_retries":"5"}}'

kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime

5. High-Cost Sessions

Symptoms: A small number of user sessions account for a disproportionate share of GPU spend; per-session cost exceeds configured caps.

Identify high-cost sessions

# Query BigQuery billing for top sessions this month
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     labels.value AS session_id,
     SUM(cost) AS total_cost,
     COUNT(*) AS job_count
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`,
     UNNEST(labels) AS labels
   WHERE
     labels.key = 'session_id'
     AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
   GROUP BY 1
   ORDER BY 2 DESC
   LIMIT 20"

Terminate and cap a high-cost session

# Get all running jobs for a session
curl -s "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/jobs?status=RUNNING" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq -r '.[].vertex_job_name' | \
  while read -r vertex_job; do
    gcloud ai custom-jobs cancel "$vertex_job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

# Set a per-session GPU spend cap (update billing enforcement config)
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/limits" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_gpu_cost_usd": 50.0, "action_on_limit": "terminate"}'

Long-term prevention

Action	Owner	Timeline
Implement per-session GPU spend caps at the scheduler level	Backend Engineering	2 weeks
Add `session_id` label to all Vertex AI custom job submissions	Backend Engineering	1 week
Create per-customer spend anomaly detection alert in Cloud Monitoring	SRE	1 week
Review and enforce `max_runtime_seconds` per job tier	Product + Backend	1 week

Post-Incident Checklist

Root cause of cost anomaly identified and documented
Cost impact quantified and reported to Finance
Runaway jobs cancelled and retry loops blocked
Budget restored to normal limits (or increase formally approved)
Preventive controls (caps, alerts, circuit-breakers) reviewed and updated
Incident postmortem scheduled within 48 hours for spend anomalies > $500

Runbooks

Documentation Index

​1. Detecting Cost Anomalies

​Budget alert channels

​Manual cost review

​2. Unexpected Runtime Spikes

​Investigate

​Correlate with job submissions

​Contain

​3. Budget Exhaustion

​Immediate containment

​Request emergency budget increase

​Resume job intake

​4. Runaway Retries

​Identify runaway jobs

​Block a specific job from further retries

​Verify retry circuit-breaker configuration

​5. High-Cost Sessions

​Identify high-cost sessions

​Terminate and cap a high-cost session

​Long-term prevention

​Post-Incident Checklist

1. Detecting Cost Anomalies

Budget alert channels

Manual cost review

2. Unexpected Runtime Spikes

Investigate

Correlate with job submissions

Contain

3. Budget Exhaustion

Immediate containment

Request emergency budget increase

Resume job intake

4. Runaway Retries

Identify runaway jobs

Block a specific job from further retries

Verify retry circuit-breaker configuration

5. High-Cost Sessions

Identify high-cost sessions

Terminate and cap a high-cost session

Long-term prevention

Post-Incident Checklist