> ## Documentation Index
> Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Runtime Cost Anomaly Runbook

> Detection, investigation, and remediation procedures for unexpected GPU cost events

This runbook covers four cost anomaly scenarios: unexpected runtime spikes, budget exhaustion, runaway retries, and high-cost sessions. Use the Cloud Billing budget alerts as the primary entry point.

***

## 1. Detecting Cost Anomalies

### Budget alert channels

| Alert                 | Threshold                              | Action                              |
| --------------------- | -------------------------------------- | ----------------------------------- |
| `gpu-budget-50pct`    | 50 % of monthly budget consumed        | Review trend; no immediate action   |
| `gpu-budget-90pct`    | 90 % of monthly budget consumed        | Notify Engineering and Finance      |
| `gpu-budget-exceeded` | 100 % / forecast 120 %                 | Activate containment (this runbook) |
| `gpu-daily-spike`     | Daily spend > 2× 7-day rolling average | Investigate within 1 hour           |

### Manual cost review

```bash theme={null}
# View GPU cost breakdown for the current month (BigQuery billing export)
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     sku.description,
     SUM(cost) AS total_cost,
     SUM(usage.amount) AS total_usage,
     usage.unit
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`
   WHERE
     DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
     AND project.id = 'impulse-gpu-runtime'
   GROUP BY 1, 4
   ORDER BY 2 DESC"
```

***

## 2. Unexpected Runtime Spikes

**Symptoms:** GPU accelerator-seconds billed in a given hour is significantly higher than baseline; Cloud Monitoring shows `vertex_ai/custom_jobs/running_jobs` elevated.

### Investigate

```bash theme={null}
# List all currently running Vertex AI custom jobs
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING" \
  --format="table(name,displayName,createTime)"

# Check for jobs that have been running unusually long
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="table(name,displayName,createTime)"
```

### Correlate with job submissions

```bash theme={null}
# Count job submissions per hour for the last 24 hours (scheduler logs)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  awk '{print $1}' | cut -dT -f1 | cut -dH -f1 | sort | uniq -c
```

### Contain

If you identify an unexpected surge of long-running jobs:

```bash theme={null}
# Pause new intake immediately
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

# Cancel jobs that exceed the expected maximum runtime (e.g. 2 hours)
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '2 hours ago' -Iseconds)" \
  --format="value(name)" | \
  while read -r job; do
    gcloud ai custom-jobs cancel "$job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done
```

***

## 3. Budget Exhaustion

**Symptoms:** The `gpu-budget-exceeded` alert fires; GCP Budget enforcer may start restricting project spend; new Vertex AI jobs may be rejected.

<Warning>
  GCP Budget alerts are informational by default and do **not** automatically stop resource usage. You must manually take containment steps.
</Warning>

### Immediate containment

1. **Pause job intake:**
   ```bash theme={null}
   kubectl set env deployment/gpu-scheduler \
     -n gpu-runtime \
     PAUSE_JOB_INTAKE=true
   ```

2. **Cancel all non-critical running jobs** (coordinate with Product):
   ```bash theme={null}
   gcloud ai custom-jobs list \
     --project=impulse-gpu-runtime \
     --region=us-central1 \
     --filter="state=JOB_STATE_RUNNING" \
     --format="value(name)" | \
     while read -r job; do
       gcloud ai custom-jobs cancel "$job" \
         --project=impulse-gpu-runtime \
         --region=us-central1
     done
   ```

3. **Notify Finance and Engineering** with current spend, forecast, and containment actions taken.

### Request emergency budget increase

1. Log into the [GCP Console → Billing → Budgets & Alerts](https://console.cloud.google.com/billing/budgets).
2. Select the `impulse-gpu-runtime` budget.
3. Click **Edit** and increase the budget amount.
4. Notify the Finance team of the temporary increase and the expected overage.

### Resume job intake

Once Finance approves the temporary budget increase:

```bash theme={null}
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false
```

***

## 4. Runaway Retries

**Symptoms:** A single `job_id` or small set of jobs appears repeatedly in Vertex AI job history; DLQ message count grows; billing for the same logical job is abnormally high.

### Identify runaway jobs

```bash theme={null}
# Find job_ids with > 5 submission attempts in the last 24 hours
kubectl logs -n gpu-runtime -l app=gpu-scheduler --since=24h | \
  grep "job_submitted" | \
  jq -r '.job_id' | sort | uniq -c | sort -rn | head -20
```

### Block a specific job from further retries

```bash theme={null}
# Mark job as permanently failed to stop retry loop
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "FAILED", "error_code": "MAX_RETRIES_EXCEEDED", "retry_blocked": true}'
```

### Verify retry circuit-breaker configuration

The scheduler enforces a maximum of 5 total attempts per `job_id`. Verify this setting:

```bash theme={null}
kubectl get configmap gpu-scheduler-config -n gpu-runtime -o yaml | \
  grep -E "max_retries|retry"
```

If `max_retries` is set higher than 5 or is missing, update the ConfigMap and restart the scheduler:

```bash theme={null}
kubectl patch configmap gpu-scheduler-config -n gpu-runtime \
  --type=merge \
  -p '{"data":{"max_retries":"5"}}'

kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime
```

***

## 5. High-Cost Sessions

**Symptoms:** A small number of user sessions account for a disproportionate share of GPU spend; per-session cost exceeds configured caps.

### Identify high-cost sessions

```bash theme={null}
# Query BigQuery billing for top sessions this month
bq query --use_legacy_sql=false \
  --project=impulse-gpu-runtime \
  "SELECT
     labels.value AS session_id,
     SUM(cost) AS total_cost,
     COUNT(*) AS job_count
   FROM \`billing_export.gcp_billing_export_v1_XXXXX\`,
     UNNEST(labels) AS labels
   WHERE
     labels.key = 'session_id'
     AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
     AND service.description = 'Vertex AI'
   GROUP BY 1
   ORDER BY 2 DESC
   LIMIT 20"
```

### Terminate and cap a high-cost session

```bash theme={null}
# Get all running jobs for a session
curl -s "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/jobs?status=RUNNING" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq -r '.[].vertex_job_name' | \
  while read -r vertex_job; do
    gcloud ai custom-jobs cancel "$vertex_job" \
      --project=impulse-gpu-runtime \
      --region=us-central1
  done

# Set a per-session GPU spend cap (update billing enforcement config)
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/sessions/$SESSION_ID/limits" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_gpu_cost_usd": 50.0, "action_on_limit": "terminate"}'
```

### Long-term prevention

| Action                                                                | Owner               | Timeline |
| --------------------------------------------------------------------- | ------------------- | -------- |
| Implement per-session GPU spend caps at the scheduler level           | Backend Engineering | 2 weeks  |
| Add `session_id` label to all Vertex AI custom job submissions        | Backend Engineering | 1 week   |
| Create per-customer spend anomaly detection alert in Cloud Monitoring | SRE                 | 1 week   |
| Review and enforce `max_runtime_seconds` per job tier                 | Product + Backend   | 1 week   |

***

## Post-Incident Checklist

* [ ] Root cause of cost anomaly identified and documented
* [ ] Cost impact quantified and reported to Finance
* [ ] Runaway jobs cancelled and retry loops blocked
* [ ] Budget restored to normal limits (or increase formally approved)
* [ ] Preventive controls (caps, alerts, circuit-breakers) reviewed and updated
* [ ] Incident postmortem scheduled within 48 hours for spend anomalies > \$500
