GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected withDocumentation Index
Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
RESOURCE_EXHAUSTED until quota is freed or increased.
1. Detection
Alerts
The following Cloud Monitoring alert policies should fire before quota is fully exhausted:| Alert | Threshold | Channel |
|---|---|---|
gpu-quota-utilization-high | > 80 % for 5 min | PagerDuty P2 |
gpu-quota-exhausted | 100 % (job submissions failing) | PagerDuty P1 |
vertex-job-submission-error-rate | > 10 % for 3 min | PagerDuty P1 |
Manual check
2. Immediate Containment
Option A — Environment variable pause (preferred, zero-downtime)
gpu-job-request-sub and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause.
Option B — Suspend subscription delivery
Option B affects all consumers. Use Option A if only the GPU scheduler should be paused.
3. Quota Increase Request
Via GCP Console (fastest for emergency increases)
- Go to IAM & Admin → Quotas in the GCP Console.
- Filter by Service: Vertex AI API and Region: us-central1.
- Select the affected GPU quota metric (e.g.
NVIDIA_T4_GPUS). - Click Edit Quotas, enter the new limit, and add a justification referencing this incident.
- For P1 incidents, select Urgent and include the on-call contact.
Via gcloud (for scripted requests)
Expected approval times
| Request type | Typical approval time |
|---|---|
| < 2× current limit | 1–4 hours |
| 2–5× current limit | 4–24 hours |
| > 5× current limit | 1–3 business days |
4. Alternative Region Failover
If quota cannot be increased quickly, route jobs to a secondary region:5. Resuming Job Intake
Once quota is confirmed available (utilization < 70 %):6. DLQ Replay After Recovery
Jobs that failed withVERTEX_QUOTA_EXCEEDED during the exhaustion window must be replayed:
7. Long-Term Remediation
After the incident is resolved, take the following preventive actions:| Action | Owner | Timeline |
|---|---|---|
| Review quota headroom (maintain ≥ 30 % buffer) | Platform Engineering | Within 1 week |
| Implement per-customer GPU concurrency caps | Backend Engineering | Within 2 weeks |
| Enable preemptible/Spot GPU jobs to reduce committed quota usage | Platform Engineering | Within 1 week |
| Schedule quarterly quota reviews | On-call SRE | Recurring |
Configure gpu-quota-utilization-high alert at 70 % | SRE | Immediate |
Post-Incident Checklist
- Quota increased or failover region confirmed operational
- Job intake resumed and submission success rate > 95 %
- DLQ drained and quota-failure messages replayed
- No customer-visible jobs permanently lost
- Incident postmortem scheduled within 48 hours
- Quota buffer alert threshold updated if necessary