> ## Documentation Index > Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt > Use this file to discover all available pages before exploring further. # GPU Runtime Failure Recovery Runbook > Recovery procedures for scheduler outages, Vertex quota exhaustion, artifact upload failures, preemption storms, and Pub/Sub consumer failures This runbook provides step-by-step recovery procedures for the most common GPU Runtime failure modes. Each section is self-contained and can be followed independently. *** ## 1. Scheduler Outage **Symptoms:** Jobs remain `QUEUED` indefinitely; no new Vertex AI jobs are being submitted; `gpu-scheduler` pods in `CrashLoopBackOff` or `Error` state. ### Diagnose ```bash theme={null} # Check pod status kubectl get pods -n gpu-runtime -l app=gpu-scheduler # Review recent crash logs kubectl logs -n gpu-runtime -l app=gpu-scheduler --previous --tail=100 # Check Kubernetes events for OOM kills or probe failures kubectl describe pod -n gpu-runtime -l app=gpu-scheduler | \ grep -A5 "Events:" ``` ### Recover ```bash theme={null} kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime kubectl rollout status deployment/gpu-scheduler -n gpu-runtime ``` ```bash theme={null} kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \ curl -s http://localhost:8080/healthz # Expected: {"status":"ok","version":"..."} ``` ```bash theme={null} gcloud pubsub subscriptions describe gpu-job-request-sub \ --project=impulse-gpu-runtime \ --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)" ``` If `oldestUnackedMessageAge` exceeds the `ackDeadlineSeconds` (600 s), messages may have already moved to the DLQ. Follow the [DLQ Recovery Runbook](/gpu-operations/dlq-recovery). ```bash theme={null} kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime ``` See the [Deployment Runbook](/gpu-operations/deployment) for full rollback procedures. *** ## 2. Vertex AI Quota Exhaustion **Symptoms:** New custom jobs fail immediately with `RESOURCE_EXHAUSTED`; Cloud Monitoring shows quota utilization at 100%. See the dedicated [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion) for full recovery steps. Quick summary: 1. Pause new job intake (modify subscription or set scheduler env `PAUSE_JOB_INTAKE=true`). 2. Request an emergency quota increase via the GCP Console or Support. 3. Resume intake once quota headroom is confirmed. 4. Replay DLQ messages that failed due to quota errors. *** ## 3. Artifact Upload Failure **Symptoms:** Jobs complete on Vertex AI but output artifacts are missing from GCS; jobs show `ARTIFACT_UPLOAD_FAILED` in the result message. ### Diagnose ```bash theme={null} # Check GCS bucket permissions for the runtime SA gsutil iam get gs://impulse-gpu-artifacts # List recent failed uploads (look for 403 / 429 in GCS audit logs) gcloud logging read \ 'resource.type="gcs_bucket" AND protoPayload.status.code!=0 AND resource.labels.bucket_name="impulse-gpu-artifacts"' \ --project=impulse-gpu-runtime \ --limit=20 \ --format="table(timestamp, protoPayload.authenticationInfo.principalEmail, protoPayload.status.message)" ``` ### Recover ```bash theme={null} gcloud storage buckets get-iam-policy gs://impulse-gpu-artifacts \ --format="json" | jq '.bindings[] | select(.role == "roles/storage.objectAdmin")' ``` The `gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com` SA must be listed. If missing, add it: ```bash theme={null} gsutil iam ch \ serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com:objectAdmin \ gs://impulse-gpu-artifacts ``` ```bash theme={null} gsutil du -s gs://impulse-gpu-artifacts ``` Confirm the bucket has not hit a soft storage cap. If 429 rate-limit errors appear, implement exponential backoff in the runtime upload code. ```bash theme={null} # Re-queue jobs with ARTIFACT_UPLOAD_FAILED status curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=ARTIFACT_UPLOAD_FAILED&limit=50" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \ | jq -r '.[].job_id' \ | while read -r job_id; do curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" done ``` *** ## 4. Runtime Preemption Storms **Symptoms:** Large numbers of Vertex AI custom jobs moving to `JOB_STATE_FAILED` in a short window with error `Preempted`; DLQ message count spikes; customer-facing error rate increases. Preemption storms occur when GCP reclaims preemptible/Spot GPU VMs in a region simultaneously, typically during high-demand periods. ### Immediate containment ```bash theme={null} # Pause new preemptible job submission kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ GPU_VM_TYPE=on-demand # Switch to on-demand temporarily kubectl rollout status deployment/gpu-scheduler -n gpu-runtime ``` ### Assess scope ```bash theme={null} # Count failed jobs in the last 30 minutes gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \ --format="value(name)" | wc -l # Check if failures are isolated to a single accelerator type gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \ --format="table(displayName,error.message)" ``` ### Recovery Follow [DLQ Recovery Runbook](/gpu-operations/dlq-recovery) for replaying `JOB_PREEMPTED` error-code messages. Watch the `gpu_job_success_rate` metric in Cloud Monitoring until it returns above 95%. ```bash theme={null} kubectl set env deployment/gpu-scheduler \ -n gpu-runtime \ GPU_VM_TYPE=preemptible kubectl rollout status deployment/gpu-scheduler -n gpu-runtime ``` Preemption storms usually resolve within 30–90 minutes. If they persist for more than 2 hours, switch permanently to on-demand and file a GCP support ticket. *** ## 5. Pub/Sub Consumer Failure **Symptoms:** Messages accumulate on `gpu-job-request-sub`; `oldestUnackedMessageAge` grows continuously; no new Vertex AI jobs are submitted even though the scheduler pod is `Running`. ### Diagnose ```bash theme={null} # Check subscription backlog gcloud pubsub subscriptions describe gpu-job-request-sub \ --project=impulse-gpu-runtime \ --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)" # Confirm the scheduler is actively pulling kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100 | \ grep -E "pubsub|pull|subscription" # Inspect scheduler metrics for pull errors kubectl top pod -n gpu-runtime -l app=gpu-scheduler ``` ### Recover ```bash theme={null} kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime kubectl rollout status deployment/gpu-scheduler -n gpu-runtime ``` ```bash theme={null} # Check that ackIds are being processed (backlog should decrease) watch -n10 "gcloud pubsub subscriptions describe gpu-job-request-sub \ --project=impulse-gpu-runtime \ --format='value(numUndeliveredMessages)'" ``` ```bash theme={null} # Confirm the scheduler SA still has pubsub.subscriber role gcloud projects get-iam-policy impulse-gpu-runtime \ --flatten="bindings[].members" \ --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com AND bindings.role:roles/pubsub.subscriber" \ --format="table(bindings.role)" ``` Review the [GCP Status Dashboard](https://status.cloud.google.com) for active Pub/Sub incidents in the deployment region. The default message retention is 7 days. If the outage is expected to exceed this window, escalate to Engineering to consider manual message republishing from Datastore backups. *** ## Escalation Matrix | Failure Mode | First Responder | Escalate To | SLA | | ------------------------ | --------------- | --------------------- | ------- | | Scheduler pod crash | On-call SRE | Platform Engineering | 30 min | | Vertex quota exhaustion | On-call SRE | Cloud Account Team | 2 hours | | Artifact upload failure | On-call SRE | Storage/Infra Team | 1 hour | | Preemption storm | On-call SRE | On-call SRE (monitor) | 2 hours | | Pub/Sub consumer failure | On-call SRE | Platform Engineering | 30 min |