> ## Documentation Index
> Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Runtime Quota Exhaustion Runbook

> Detection, containment, and recovery procedures for Vertex AI GPU quota exhaustion events

GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected with `RESOURCE_EXHAUSTED` until quota is freed or increased.

***

## 1. Detection

### Alerts

The following Cloud Monitoring alert policies should fire before quota is fully exhausted:

| Alert                              | Threshold                       | Channel      |
| ---------------------------------- | ------------------------------- | ------------ |
| `gpu-quota-utilization-high`       | > 80 % for 5 min                | PagerDuty P2 |
| `gpu-quota-exhausted`              | 100 % (job submissions failing) | PagerDuty P1 |
| `vertex-job-submission-error-rate` | > 10 % for 3 min                | PagerDuty P1 |

### Manual check

```bash theme={null}
# View current GPU quota per region
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS|NVIDIA_L4_GPUS"

# Check recent Vertex AI job submission failures
gcloud logging read \
  'resource.type="aiplatform.googleapis.com/CustomJob" AND protoPayload.status.code=8' \
  --project=impulse-gpu-runtime \
  --limit=20 \
  --format="table(timestamp, protoPayload.resourceName, protoPayload.status.message)"
```

***

## 2. Immediate Containment

<Warning>
  Pause new job intake as soon as quota exhaustion is confirmed to prevent cascading DLQ growth.
</Warning>

### Option A — Environment variable pause (preferred, zero-downtime)

```bash theme={null}
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
```

The scheduler will stop pulling from `gpu-job-request-sub` and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause.

### Option B — Suspend subscription delivery

```bash theme={null}
# Seek subscription to current time — stops delivery until resumed
gcloud pubsub subscriptions seek gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --time="$(date -Iseconds)"
```

<Note>
  Option B affects all consumers. Use Option A if only the GPU scheduler should be paused.
</Note>

***

## 3. Quota Increase Request

### Via GCP Console (fastest for emergency increases)

1. Go to **IAM & Admin → Quotas** in the [GCP Console](https://console.cloud.google.com).
2. Filter by **Service: Vertex AI API** and **Region: us-central1**.
3. Select the affected GPU quota metric (e.g. `NVIDIA_T4_GPUS`).
4. Click **Edit Quotas**, enter the new limit, and add a justification referencing this incident.
5. For P1 incidents, select **Urgent** and include the on-call contact.

### Via `gcloud` (for scripted requests)

```bash theme={null}
gcloud alpha services quota update \
  --consumer=project/impulse-gpu-runtime \
  --service=aiplatform.googleapis.com \
  --metric=aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus \
  --value=<NEW_LIMIT>
```

### Expected approval times

| Request type        | Typical approval time |
| ------------------- | --------------------- |
| \< 2× current limit | 1–4 hours             |
| 2–5× current limit  | 4–24 hours            |
| > 5× current limit  | 1–3 business days     |

***

## 4. Alternative Region Failover

If quota cannot be increased quickly, route jobs to a secondary region:

```bash theme={null}
# Update scheduler to target backup region
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  VERTEX_REGION=us-east4        # backup region

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Verify GPU quota availability in backup region
gcloud compute regions describe us-east4 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS"
```

<Warning>
  Cross-region failover incurs higher data-transfer costs and slightly higher latency. Document the failover in the incident ticket and revert as soon as the primary region quota is restored.
</Warning>

***

## 5. Resuming Job Intake

Once quota is confirmed available (utilization \< 70 %):

```bash theme={null}
# Resume intake via environment variable
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Monitor submission success rate for 5 minutes
watch -n30 "gcloud monitoring read \
  'metric.type=custom.googleapis.com/gpu_scheduler/job_submission_success_rate' \
  --project=impulse-gpu-runtime \
  --interval='5m' | tail -3"
```

***

## 6. DLQ Replay After Recovery

Jobs that failed with `VERTEX_QUOTA_EXCEEDED` during the exhaustion window must be replayed:

```bash theme={null}
# Identify quota-failure DLQ messages
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --limit=500 \
  --format=json > /tmp/dlq-quota-failures.json

cat /tmp/dlq-quota-failures.json | \
  jq '[.[] | select((.message.data | @base64d | fromjson | .error_code) == "VERTEX_QUOTA_EXCEEDED")]' \
  | jq length
```

Follow the full replay procedure in the [DLQ Recovery Runbook](/gpu-operations/dlq-recovery).

***

## 7. Long-Term Remediation

After the incident is resolved, take the following preventive actions:

| Action                                                           | Owner                | Timeline       |
| ---------------------------------------------------------------- | -------------------- | -------------- |
| Review quota headroom (maintain ≥ 30 % buffer)                   | Platform Engineering | Within 1 week  |
| Implement per-customer GPU concurrency caps                      | Backend Engineering  | Within 2 weeks |
| Enable preemptible/Spot GPU jobs to reduce committed quota usage | Platform Engineering | Within 1 week  |
| Schedule quarterly quota reviews                                 | On-call SRE          | Recurring      |
| Configure `gpu-quota-utilization-high` alert at 70 %             | SRE                  | Immediate      |

***

## Post-Incident Checklist

* [ ] Quota increased or failover region confirmed operational
* [ ] Job intake resumed and submission success rate > 95 %
* [ ] DLQ drained and quota-failure messages replayed
* [ ] No customer-visible jobs permanently lost
* [ ] Incident postmortem scheduled within 48 hours
* [ ] Quota buffer alert threshold updated if necessary
