GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected with RESOURCE_EXHAUSTED until quota is freed or increased.
1. Detection
Alerts
The following Cloud Monitoring alert policies should fire before quota is fully exhausted:
| Alert | Threshold | Channel |
|---|
gpu-quota-utilization-high | > 80 % for 5 min | PagerDuty P2 |
gpu-quota-exhausted | 100 % (job submissions failing) | PagerDuty P1 |
vertex-job-submission-error-rate | > 10 % for 3 min | PagerDuty P1 |
Manual check
# View current GPU quota per region
gcloud compute regions describe us-central1 \
--project=impulse-gpu-runtime \
--format="yaml(quotas)" | \
grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS|NVIDIA_L4_GPUS"
# Check recent Vertex AI job submission failures
gcloud logging read \
'resource.type="aiplatform.googleapis.com/CustomJob" AND protoPayload.status.code=8' \
--project=impulse-gpu-runtime \
--limit=20 \
--format="table(timestamp, protoPayload.resourceName, protoPayload.status.message)"
Pause new job intake as soon as quota exhaustion is confirmed to prevent cascading DLQ growth.
Option A — Environment variable pause (preferred, zero-downtime)
kubectl set env deployment/gpu-scheduler \
-n gpu-runtime \
PAUSE_JOB_INTAKE=true
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
The scheduler will stop pulling from gpu-job-request-sub and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause.
Option B — Suspend subscription delivery
# Seek subscription to current time — stops delivery until resumed
gcloud pubsub subscriptions seek gpu-job-request-sub \
--project=impulse-gpu-runtime \
--time="$(date -Iseconds)"
Option B affects all consumers. Use Option A if only the GPU scheduler should be paused.
3. Quota Increase Request
Via GCP Console (fastest for emergency increases)
- Go to IAM & Admin → Quotas in the GCP Console.
- Filter by Service: Vertex AI API and Region: us-central1.
- Select the affected GPU quota metric (e.g.
NVIDIA_T4_GPUS).
- Click Edit Quotas, enter the new limit, and add a justification referencing this incident.
- For P1 incidents, select Urgent and include the on-call contact.
Via gcloud (for scripted requests)
gcloud alpha services quota update \
--consumer=project/impulse-gpu-runtime \
--service=aiplatform.googleapis.com \
--metric=aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus \
--value=<NEW_LIMIT>
Expected approval times
| Request type | Typical approval time |
|---|
| < 2× current limit | 1–4 hours |
| 2–5× current limit | 4–24 hours |
| > 5× current limit | 1–3 business days |
4. Alternative Region Failover
If quota cannot be increased quickly, route jobs to a secondary region:
# Update scheduler to target backup region
kubectl set env deployment/gpu-scheduler \
-n gpu-runtime \
VERTEX_REGION=us-east4 # backup region
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
# Verify GPU quota availability in backup region
gcloud compute regions describe us-east4 \
--project=impulse-gpu-runtime \
--format="yaml(quotas)" | \
grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS"
Cross-region failover incurs higher data-transfer costs and slightly higher latency. Document the failover in the incident ticket and revert as soon as the primary region quota is restored.
5. Resuming Job Intake
Once quota is confirmed available (utilization < 70 %):
# Resume intake via environment variable
kubectl set env deployment/gpu-scheduler \
-n gpu-runtime \
PAUSE_JOB_INTAKE=false
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
# Monitor submission success rate for 5 minutes
watch -n30 "gcloud monitoring read \
'metric.type=custom.googleapis.com/gpu_scheduler/job_submission_success_rate' \
--project=impulse-gpu-runtime \
--interval='5m' | tail -3"
6. DLQ Replay After Recovery
Jobs that failed with VERTEX_QUOTA_EXCEEDED during the exhaustion window must be replayed:
# Identify quota-failure DLQ messages
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
--project=impulse-gpu-runtime \
--limit=500 \
--format=json > /tmp/dlq-quota-failures.json
cat /tmp/dlq-quota-failures.json | \
jq '[.[] | select((.message.data | @base64d | fromjson | .error_code) == "VERTEX_QUOTA_EXCEEDED")]' \
| jq length
Follow the full replay procedure in the DLQ Recovery Runbook.
After the incident is resolved, take the following preventive actions:
| Action | Owner | Timeline |
|---|
| Review quota headroom (maintain ≥ 30 % buffer) | Platform Engineering | Within 1 week |
| Implement per-customer GPU concurrency caps | Backend Engineering | Within 2 weeks |
| Enable preemptible/Spot GPU jobs to reduce committed quota usage | Platform Engineering | Within 1 week |
| Schedule quarterly quota reviews | On-call SRE | Recurring |
Configure gpu-quota-utilization-high alert at 70 % | SRE | Immediate |
Post-Incident Checklist