> ## Documentation Index > Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt > Use this file to discover all available pages before exploring further. # GPU Runtime Cost Anomaly Runbook > Detection, investigation, and remediation procedures for unexpected GPU cost events This runbook covers four cost anomaly scenarios: unexpected runtime spikes, budget exhaustion, runaway retries, and high-cost sessions. Use the Cloud Billing budget alerts as the primary entry point. *** ## 1. Detecting Cost Anomalies ### Budget alert channels | Alert | Threshold | Action | | --------------------- | -------------------------------------- | ----------------------------------- | | `gpu-budget-50pct` | 50 % of monthly budget consumed | Review trend; no immediate action | | `gpu-budget-90pct` | 90 % of monthly budget consumed | Notify Engineering and Finance | | `gpu-budget-exceeded` | 100 % / forecast 120 % | Activate containment (this runbook) | | `gpu-daily-spike` | Daily spend > 2× 7-day rolling average | Investigate within 1 hour | ### Manual cost review ```bash theme={null} # View GPU cost breakdown for the current month (BigQuery billing export) bq query --use_legacy_sql=false \ --project=impulse-gpu-runtime \ "SELECT sku.description, SUM(cost) AS total_cost, SUM(usage.amount) AS total_usage, usage.unit FROM \`billing_export.gcp_billing_export_v1_XXXXX\` WHERE DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH) AND service.description = 'Vertex AI' AND project.id = 'impulse-gpu-runtime' GROUP BY 1, 4 ORDER BY 2 DESC" ``` *** ## 2. Unexpected Runtime Spikes **Symptoms:** GPU accelerator-seconds billed in a given hour is significantly higher than baseline; Cloud Monitoring shows `vertex_ai/custom_jobs/running_jobs` elevated. ### Investigate ```bash theme={null} # List all currently running Vertex AI custom jobs gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_RUNNING" \ --format="table(name,displayName,createTime)" # Check for jobs that have been running unusually long gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \ --format="table(name,displayName,createTime)" ``` ### Correlate with job submissions ```bash theme={null} # Count job submissions per hour for the last 24 hours (scheduler logs) kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \ grep "job_submitted" | \ awk '{print $1}' | cut -dT -f1 | cut -dH -f1 | sort | uniq -c ``` ### Contain If you identify an unexpected surge of long-running jobs: ```bash theme={null} # Pause new intake immediately kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ PAUSE_JOB_INTAKE=true # Cancel jobs that exceed the expected maximum runtime (e.g. 2 hours) gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \ --format="value(name)" | \ while read -r job; do gcloud ai custom-jobs cancel "$job" \ --project=impulse-gpu-runtime \ --region=us-central1 done ``` *** ## 3. Budget Exhaustion **Symptoms:** The `gpu-budget-exceeded` alert fires; GCP Budget enforcer may start restricting project spend; new Vertex AI jobs may be rejected. GCP Budget alerts are informational by default and do **not** automatically stop resource usage. You must manually take containment steps. ### Immediate containment 1. **Pause job intake:** ```bash theme={null} kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ PAUSE_JOB_INTAKE=true ``` 2. **Cancel all non-critical running jobs** (coordinate with Product): ```bash theme={null} gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_RUNNING" \ --format="value(name)" | \ while read -r job; do gcloud ai custom-jobs cancel "$job" \ --project=impulse-gpu-runtime \ --region=us-central1 done ``` 3. **Notify Finance and Engineering** with current spend, forecast, and containment actions taken. ### Request emergency budget increase 1. Log into the [GCP Console → Billing → Budgets & Alerts](https://console.cloud.google.com/billing/budgets). 2. Select the `impulse-gpu-runtime` budget. 3. Click **Edit** and increase the budget amount. 4. Notify the Finance team of the temporary increase and the expected overage. ### Resume job intake Once Finance approves the temporary budget increase: ```bash theme={null} kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ PAUSE_JOB_INTAKE=false ``` *** ## 4. Runaway Retries **Symptoms:** A single `job_id` or small set of jobs appears repeatedly in Vertex AI job history; DLQ message count grows; billing for the same logical job is abnormally high. ### Identify runaway jobs ```bash theme={null} # Find job_ids with > 5 submission attempts in the last 24 hours kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \ grep "job_submitted" | \ jq -r '.job_id' | sort | uniq -c | sort -rn | head -20 ``` ### Block a specific job from further retries ```bash theme={null} # Mark job as permanently failed to stop retry loop curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \ -H "Content-Type: application/json" \ -d '{"status": "FAILED", "error_code": "MAX_RETRIES_EXCEEDED", "retry_blocked": true}' ``` ### Verify retry circuit-breaker configuration The scheduler enforces a maximum of 5 total attempts per `job_id`. Verify this setting: ```bash theme={null} kubectl get configmap gpu-scheduler-config -n gpu-runtime -o yaml | \ grep -E "max_retries|retry" ``` If `max_retries` is set higher than 5 or is missing, update the ConfigMap and restart the scheduler: ```bash theme={null} kubectl patch configmap gpu-scheduler-config -n gpu-runtime \ --type=merge \ -p '{"data":{"max_retries":"5"}}' kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime ``` *** ## 5. High-Cost Sessions **Symptoms:** A small number of user sessions account for a disproportionate share of GPU spend; per-session cost exceeds configured caps. ### Identify high-cost sessions ```bash theme={null} # Query BigQuery billing for top sessions this month bq query --use_legacy_sql=false \ --project=impulse-gpu-runtime \ "SELECT labels.value AS session_id, SUM(cost) AS total_cost, COUNT(*) AS job_count FROM \`billing_export.gcp_billing_export_v1_XXXXX\`, UNNEST(labels) AS labels WHERE labels.key = 'session_id' AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH) AND service.description = 'Vertex AI' GROUP BY 1 ORDER BY 2 DESC LIMIT 20" ``` ### Terminate and cap a high-cost session ```bash theme={null} # Get all running jobs for a session curl -s "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/jobs?status=RUNNING" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq -r '.[].vertex_job_name' | \ while read -r vertex_job; do gcloud ai custom-jobs cancel "$vertex_job" \ --project=impulse-gpu-runtime \ --region=us-central1 done # Set a per-session GPU spend cap (update billing enforcement config) curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/limits" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \ -H "Content-Type: application/json" \ -d '{"max_gpu_cost_usd": 50.0, "action_on_limit": "terminate"}' ``` ### Long-term prevention | Action | Owner | Timeline | | --------------------------------------------------------------------- | ------------------- | -------- | | Implement per-session GPU spend caps at the scheduler level | Backend Engineering | 2 weeks | | Add `session_id` label to all Vertex AI custom job submissions | Backend Engineering | 1 week | | Create per-customer spend anomaly detection alert in Cloud Monitoring | SRE | 1 week | | Review and enforce `max_runtime_seconds` per job tier | Product + Backend | 1 week | *** ## Post-Incident Checklist * [ ] Root cause of cost anomaly identified and documented * [ ] Cost impact quantified and reported to Finance * [ ] Runaway jobs cancelled and retry loops blocked * [ ] Budget restored to normal limits (or increase formally approved) * [ ] Preventive controls (caps, alerts, circuit-breakers) reviewed and updated * [ ] Incident postmortem scheduled within 48 hours for spend anomalies > \$500