GPU Runtime Quota Exhaustion Runbook

GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected with RESOURCE_EXHAUSTED until quota is freed or increased.

1. Detection

Alerts

The following Cloud Monitoring alert policies should fire before quota is fully exhausted:

Alert	Threshold	Channel
`gpu-quota-utilization-high`	> 80 % for 5 min	PagerDuty P2
`gpu-quota-exhausted`	100 % (job submissions failing)	PagerDuty P1
`vertex-job-submission-error-rate`	> 10 % for 3 min	PagerDuty P1

Manual check

# View current GPU quota per region
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS|NVIDIA_L4_GPUS"

# Check recent Vertex AI job submission failures
gcloud logging read \
  'resource.type="aiplatform.googleapis.com/CustomJob" AND protoPayload.status.code=8' \
  --project=impulse-gpu-runtime \
  --limit=20 \
  --format="table(timestamp, protoPayload.resourceName, protoPayload.status.message)"

2. Immediate Containment

Pause new job intake as soon as quota exhaustion is confirmed to prevent cascading DLQ growth.

Option A — Environment variable pause (preferred, zero-downtime)

kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

The scheduler will stop pulling from gpu-job-request-sub and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause.

Option B — Suspend subscription delivery

# Seek subscription to current time — stops delivery until resumed
gcloud pubsub subscriptions seek gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --time="$(date -Iseconds)"

Option B affects all consumers. Use Option A if only the GPU scheduler should be paused.

3. Quota Increase Request

Via GCP Console (fastest for emergency increases)

Go to IAM & Admin → Quotas in the GCP Console.
Filter by Service: Vertex AI API and Region: us-central1.
Select the affected GPU quota metric (e.g. NVIDIA_T4_GPUS).
Click Edit Quotas, enter the new limit, and add a justification referencing this incident.
For P1 incidents, select Urgent and include the on-call contact.

Via `gcloud` (for scripted requests)

gcloud alpha services quota update \
  --consumer=project/impulse-gpu-runtime \
  --service=aiplatform.googleapis.com \
  --metric=aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus \
  --value=<NEW_LIMIT>

Expected approval times

Request type	Typical approval time
< 2× current limit	1–4 hours
2–5× current limit	4–24 hours
> 5× current limit	1–3 business days

4. Alternative Region Failover

If quota cannot be increased quickly, route jobs to a secondary region:

# Update scheduler to target backup region
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  VERTEX_REGION=us-east4        # backup region

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Verify GPU quota availability in backup region
gcloud compute regions describe us-east4 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS"

Cross-region failover incurs higher data-transfer costs and slightly higher latency. Document the failover in the incident ticket and revert as soon as the primary region quota is restored.

5. Resuming Job Intake

Once quota is confirmed available (utilization < 70 %):

# Resume intake via environment variable
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Monitor submission success rate for 5 minutes
watch -n30 "gcloud monitoring read \
  'metric.type=custom.googleapis.com/gpu_scheduler/job_submission_success_rate' \
  --project=impulse-gpu-runtime \
  --interval='5m' | tail -3"

6. DLQ Replay After Recovery

Jobs that failed with VERTEX_QUOTA_EXCEEDED during the exhaustion window must be replayed:

# Identify quota-failure DLQ messages
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --limit=500 \
  --format=json > /tmp/dlq-quota-failures.json

cat /tmp/dlq-quota-failures.json | \
  jq '[.[] | select((.message.data | @base64d | fromjson | .error_code) == "VERTEX_QUOTA_EXCEEDED")]' \
  | jq length

Follow the full replay procedure in the DLQ Recovery Runbook.

7. Long-Term Remediation

After the incident is resolved, take the following preventive actions:

Action	Owner	Timeline
Review quota headroom (maintain ≥ 30 % buffer)	Platform Engineering	Within 1 week
Implement per-customer GPU concurrency caps	Backend Engineering	Within 2 weeks
Enable preemptible/Spot GPU jobs to reduce committed quota usage	Platform Engineering	Within 1 week
Schedule quarterly quota reviews	On-call SRE	Recurring
Configure `gpu-quota-utilization-high` alert at 70 %	SRE	Immediate

Post-Incident Checklist

Quota increased or failover region confirmed operational
Job intake resumed and submission success rate > 95 %
DLQ drained and quota-failure messages replayed
No customer-visible jobs permanently lost
Incident postmortem scheduled within 48 hours
Quota buffer alert threshold updated if necessary

Runbooks

Documentation Index

​1. Detection

​Alerts

​Manual check

​2. Immediate Containment

​Option A — Environment variable pause (preferred, zero-downtime)

​Option B — Suspend subscription delivery

​3. Quota Increase Request

​Via GCP Console (fastest for emergency increases)

​Via gcloud (for scripted requests)

​Expected approval times

​4. Alternative Region Failover

​5. Resuming Job Intake

​6. DLQ Replay After Recovery

​7. Long-Term Remediation

​Post-Incident Checklist

1. Detection

Alerts

Manual check

2. Immediate Containment

Option A — Environment variable pause (preferred, zero-downtime)

Option B — Suspend subscription delivery

3. Quota Increase Request

Via GCP Console (fastest for emergency increases)

Via `gcloud` (for scripted requests)

Expected approval times

4. Alternative Region Failover

5. Resuming Job Intake

6. DLQ Replay After Recovery

7. Long-Term Remediation

Post-Incident Checklist