Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

GPU quota exhaustion occurs when the number of requested GPU accelerator cores in a GCP region reaches the project limit. New Vertex AI Custom Jobs are rejected with RESOURCE_EXHAUSTED until quota is freed or increased.

1. Detection

Alerts

The following Cloud Monitoring alert policies should fire before quota is fully exhausted:
AlertThresholdChannel
gpu-quota-utilization-high> 80 % for 5 minPagerDuty P2
gpu-quota-exhausted100 % (job submissions failing)PagerDuty P1
vertex-job-submission-error-rate> 10 % for 3 minPagerDuty P1

Manual check

# View current GPU quota per region
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS|NVIDIA_L4_GPUS"

# Check recent Vertex AI job submission failures
gcloud logging read \
  'resource.type="aiplatform.googleapis.com/CustomJob" AND protoPayload.status.code=8' \
  --project=impulse-gpu-runtime \
  --limit=20 \
  --format="table(timestamp, protoPayload.resourceName, protoPayload.status.message)"

2. Immediate Containment

Pause new job intake as soon as quota exhaustion is confirmed to prevent cascading DLQ growth.

Option A — Environment variable pause (preferred, zero-downtime)

kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=true

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
The scheduler will stop pulling from gpu-job-request-sub and wait. Messages remain in the subscription (up to the 7-day retention window) and are not moved to the DLQ during this pause.

Option B — Suspend subscription delivery

# Seek subscription to current time — stops delivery until resumed
gcloud pubsub subscriptions seek gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --time="$(date -Iseconds)"
Option B affects all consumers. Use Option A if only the GPU scheduler should be paused.

3. Quota Increase Request

Via GCP Console (fastest for emergency increases)

  1. Go to IAM & Admin → Quotas in the GCP Console.
  2. Filter by Service: Vertex AI API and Region: us-central1.
  3. Select the affected GPU quota metric (e.g. NVIDIA_T4_GPUS).
  4. Click Edit Quotas, enter the new limit, and add a justification referencing this incident.
  5. For P1 incidents, select Urgent and include the on-call contact.

Via gcloud (for scripted requests)

gcloud alpha services quota update \
  --consumer=project/impulse-gpu-runtime \
  --service=aiplatform.googleapis.com \
  --metric=aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus \
  --value=<NEW_LIMIT>

Expected approval times

Request typeTypical approval time
< 2× current limit1–4 hours
2–5× current limit4–24 hours
> 5× current limit1–3 business days

4. Alternative Region Failover

If quota cannot be increased quickly, route jobs to a secondary region:
# Update scheduler to target backup region
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  VERTEX_REGION=us-east4        # backup region

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Verify GPU quota availability in backup region
gcloud compute regions describe us-east4 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | \
  grep -A3 -E "NVIDIA_T4_GPUS|NVIDIA_A100_GPUS"
Cross-region failover incurs higher data-transfer costs and slightly higher latency. Document the failover in the incident ticket and revert as soon as the primary region quota is restored.

5. Resuming Job Intake

Once quota is confirmed available (utilization < 70 %):
# Resume intake via environment variable
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  PAUSE_JOB_INTAKE=false

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

# Monitor submission success rate for 5 minutes
watch -n30 "gcloud monitoring read \
  'metric.type=custom.googleapis.com/gpu_scheduler/job_submission_success_rate' \
  --project=impulse-gpu-runtime \
  --interval='5m' | tail -3"

6. DLQ Replay After Recovery

Jobs that failed with VERTEX_QUOTA_EXCEEDED during the exhaustion window must be replayed:
# Identify quota-failure DLQ messages
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --limit=500 \
  --format=json > /tmp/dlq-quota-failures.json

cat /tmp/dlq-quota-failures.json | \
  jq '[.[] | select((.message.data | @base64d | fromjson | .error_code) == "VERTEX_QUOTA_EXCEEDED")]' \
  | jq length
Follow the full replay procedure in the DLQ Recovery Runbook.

7. Long-Term Remediation

After the incident is resolved, take the following preventive actions:
ActionOwnerTimeline
Review quota headroom (maintain ≥ 30 % buffer)Platform EngineeringWithin 1 week
Implement per-customer GPU concurrency capsBackend EngineeringWithin 2 weeks
Enable preemptible/Spot GPU jobs to reduce committed quota usagePlatform EngineeringWithin 1 week
Schedule quarterly quota reviewsOn-call SRERecurring
Configure gpu-quota-utilization-high alert at 70 %SREImmediate

Post-Incident Checklist

  • Quota increased or failover region confirmed operational
  • Job intake resumed and submission success rate > 95 %
  • DLQ drained and quota-failure messages replayed
  • No customer-visible jobs permanently lost
  • Incident postmortem scheduled within 48 hours
  • Quota buffer alert threshold updated if necessary