> ## Documentation Index > Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt > Use this file to discover all available pages before exploring further. # GPU Runtime Quota Exhaustion Runbook > Detection, containment, and recovery procedures for Vertex AI GPU quota exhaustion events GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected with `RESOURCE_EXHAUSTED` until quota is freed or increased. *** ## 1. Detection ### Alerts The following Cloud Monitoring alert policies should fire before quota is fully exhausted: | Alert | Threshold | Channel | | ---------------------------------- | ------------------------------- | ------------ | | `gpu-quota-utilization-high` | > 80 % for 5 min | PagerDuty P2 | | `gpu-quota-exhausted` | 100 % (job submissions failing) | PagerDuty P1 | | `vertex-job-submission-error-rate` | > 10 % for 3 min | PagerDuty P1 | ### Manual check ```bash theme={null} # View current GPU quota per region gcloud compute regions describe us-central1 \ --project=impulse-gpu-runtime \ --format="yaml(quotas)" | \ grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS|NVIDIA_L4_GPUS" # Check recent Vertex AI job submission failures gcloud logging read \ 'resource.type="aiplatform.googleapis.com/CustomJob" AND protoPayload.status.code=8' \ --project=impulse-gpu-runtime \ --limit=20 \ --format="table(timestamp, protoPayload.resourceName, protoPayload.status.message)" ``` *** ## 2. Immediate Containment Pause new job intake as soon as quota exhaustion is confirmed to prevent cascading DLQ growth. ### Option A — Environment variable pause (preferred, zero-downtime) ```bash theme={null} kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ PAUSE_JOB_INTAKE=true kubectl rollout status deployment/gpu-scheduler -n gpu-runtime ``` The scheduler will stop pulling from `gpu-job-request-sub` and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause. ### Option B — Suspend subscription delivery ```bash theme={null} # Seek subscription to current time — stops delivery until resumed gcloud pubsub subscriptions seek gpu-job-request-sub \ --project=impulse-gpu-runtime \ --time="$(date -Iseconds)" ``` Option B affects all consumers. Use Option A if only the GPU scheduler should be paused. *** ## 3. Quota Increase Request ### Via GCP Console (fastest for emergency increases) 1. Go to **IAM & Admin → Quotas** in the [GCP Console](https://console.cloud.google.com). 2. Filter by **Service: Vertex AI API** and **Region: us-central1**. 3. Select the affected GPU quota metric (e.g. `NVIDIA_T4_GPUS`). 4. Click **Edit Quotas**, enter the new limit, and add a justification referencing this incident. 5. For P1 incidents, select **Urgent** and include the on-call contact. ### Via `gcloud` (for scripted requests) ```bash theme={null} gcloud alpha services quota update \ --consumer=project/impulse-gpu-runtime \ --service=aiplatform.googleapis.com \ --metric=aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus \ --value= ``` ### Expected approval times | Request type | Typical approval time | | ------------------- | --------------------- | | \< 2× current limit | 1–4 hours | | 2–5× current limit | 4–24 hours | | > 5× current limit | 1–3 business days | *** ## 4. Alternative Region Failover If quota cannot be increased quickly, route jobs to a secondary region: ```bash theme={null} # Update scheduler to target backup region kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ VERTEX_REGION=us-east4 # backup region kubectl rollout status deployment/gpu-scheduler -n gpu-runtime # Verify GPU quota availability in backup region gcloud compute regions describe us-east4 \ --project=impulse-gpu-runtime \ --format="yaml(quotas)" | \ grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS" ``` Cross-region failover incurs higher data-transfer costs and slightly higher latency. Document the failover in the incident ticket and revert as soon as the primary region quota is restored. *** ## 5. Resuming Job Intake Once quota is confirmed available (utilization \< 70 %): ```bash theme={null} # Resume intake via environment variable kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ PAUSE_JOB_INTAKE=false kubectl rollout status deployment/gpu-scheduler -n gpu-runtime # Monitor submission success rate for 5 minutes watch -n30 "gcloud monitoring read \ 'metric.type=custom.googleapis.com/gpu_scheduler/job_submission_success_rate' \ --project=impulse-gpu-runtime \ --interval='5m' | tail -3" ``` *** ## 6. DLQ Replay After Recovery Jobs that failed with `VERTEX_QUOTA_EXCEEDED` during the exhaustion window must be replayed: ```bash theme={null} # Identify quota-failure DLQ messages gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \ --project=impulse-gpu-runtime \ --limit=500 \ --format=json > /tmp/dlq-quota-failures.json cat /tmp/dlq-quota-failures.json | \ jq '[.[] | select((.message.data | @base64d | fromjson | .error_code) == "VERTEX_QUOTA_EXCEEDED")]' \ | jq length ``` Follow the full replay procedure in the [DLQ Recovery Runbook](/gpu-operations/dlq-recovery). *** ## 7. Long-Term Remediation After the incident is resolved, take the following preventive actions: | Action | Owner | Timeline | | ---------------------------------------------------------------- | -------------------- | -------------- | | Review quota headroom (maintain ≥ 30 % buffer) | Platform Engineering | Within 1 week | | Implement per-customer GPU concurrency caps | Backend Engineering | Within 2 weeks | | Enable preemptible/Spot GPU jobs to reduce committed quota usage | Platform Engineering | Within 1 week | | Schedule quarterly quota reviews | On-call SRE | Recurring | | Configure `gpu-quota-utilization-high` alert at 70 % | SRE | Immediate | *** ## Post-Incident Checklist * [ ] Quota increased or failover region confirmed operational * [ ] Job intake resumed and submission success rate > 95 % * [ ] DLQ drained and quota-failure messages replayed * [ ] No customer-visible jobs permanently lost * [ ] Incident postmortem scheduled within 48 hours * [ ] Quota buffer alert threshold updated if necessary