GPU Runtime Failure Recovery Runbook

This runbook provides step-by-step recovery procedures for the most common GPU Runtime failure modes. Each section is self-contained and can be followed independently.

1. Scheduler Outage

Symptoms: Jobs remain QUEUED indefinitely; no new Vertex AI jobs are being submitted; gpu-scheduler pods in CrashLoopBackOff or Error state.

Diagnose

# Check pod status
kubectl get pods -n gpu-runtime -l app=gpu-scheduler

# Review recent crash logs
kubectl logs -n gpu-runtime -l app=gpu-scheduler --previous --tail=100

# Check Kubernetes events for OOM kills or probe failures
kubectl describe pod -n gpu-runtime -l app=gpu-scheduler | \
  grep -A5 "Events:"

Recover

Restart the scheduler deployment

kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Verify scheduler health

kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
  curl -s http://localhost:8080/healthz
# Expected: {"status":"ok","version":"..."}

Check Pub/Sub subscription backlog

gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"

If oldestUnackedMessageAge exceeds the ackDeadlineSeconds (600 s), messages may have already moved to the DLQ. Follow the DLQ Recovery Runbook.

Roll back if restart does not resolve the issue

kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime

See the Deployment Runbook for full rollback procedures.

2. Vertex AI Quota Exhaustion

Symptoms: New custom jobs fail immediately with RESOURCE_EXHAUSTED; Cloud Monitoring shows quota utilization at 100%. See the dedicated Quota Exhaustion Runbook for full recovery steps. Quick summary:

Pause new job intake (modify subscription or set scheduler env PAUSE_JOB_INTAKE=true).
Request an emergency quota increase via the GCP Console or Support.
Resume intake once quota headroom is confirmed.
Replay DLQ messages that failed due to quota errors.

3. Artifact Upload Failure

Symptoms: Jobs complete on Vertex AI but output artifacts are missing from GCS; jobs show ARTIFACT_UPLOAD_FAILED in the result message.

Diagnose

# Check GCS bucket permissions for the runtime SA
gsutil iam get gs://impulse-gpu-artifacts

# List recent failed uploads (look for 403 / 429 in GCS audit logs)
gcloud logging read \
  'resource.type="gcs_bucket" AND protoPayload.status.code!=0 AND resource.labels.bucket_name="impulse-gpu-artifacts"' \
  --project=impulse-gpu-runtime \
  --limit=20 \
  --format="table(timestamp, protoPayload.authenticationInfo.principalEmail, protoPayload.status.message)"

Recover

Verify runtime service account permissions

gcloud storage buckets get-iam-policy gs://impulse-gpu-artifacts \
  --format="json" | jq '.bindings[] | select(.role == "roles/storage.objectAdmin")'

The gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com SA must be listed. If missing, add it:

gsutil iam ch \
  serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com:objectAdmin \
  gs://impulse-gpu-artifacts

Check bucket quota and storage class

gsutil du -s gs://impulse-gpu-artifacts

Confirm the bucket has not hit a soft storage cap. If 429 rate-limit errors appear, implement exponential backoff in the runtime upload code.

Trigger artifact re-upload for failed jobs

# Re-queue jobs with ARTIFACT_UPLOAD_FAILED status
curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=ARTIFACT_UPLOAD_FAILED&limit=50" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  | jq -r '.[].job_id' \
  | while read -r job_id; do
      curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \
        -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN"
    done

4. Runtime Preemption Storms

Symptoms: Large numbers of Vertex AI custom jobs moving to JOB_STATE_FAILED in a short window with error Preempted; DLQ message count spikes; customer-facing error rate increases. Preemption storms occur when GCP reclaims preemptible/Spot GPU VMs in a region simultaneously, typically during high-demand periods.

Immediate containment

# Pause new preemptible job submission
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  GPU_VM_TYPE=on-demand   # Switch to on-demand temporarily

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Assess scope

# Count failed jobs in the last 30 minutes
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \
  --format="value(name)" | wc -l

# Check if failures are isolated to a single accelerator type
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \
  --format="table(displayName,error.message)"

Recovery

Replay preempted jobs from the DLQ

Follow DLQ Recovery Runbook for replaying JOB_PREEMPTED error-code messages.

Monitor on-demand job success rate

Watch the gpu_job_success_rate metric in Cloud Monitoring until it returns above 95%.

Re-enable preemptible jobs after storm clears

kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  GPU_VM_TYPE=preemptible
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Preemption storms usually resolve within 30–90 minutes. If they persist for more than 2 hours, switch permanently to on-demand and file a GCP support ticket.

5. Pub/Sub Consumer Failure

Symptoms: Messages accumulate on gpu-job-request-sub; oldestUnackedMessageAge grows continuously; no new Vertex AI jobs are submitted even though the scheduler pod is Running.

Diagnose

# Check subscription backlog
gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"

# Confirm the scheduler is actively pulling
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100 | \
  grep -E "pubsub|pull|subscription"

# Inspect scheduler metrics for pull errors
kubectl top pod -n gpu-runtime -l app=gpu-scheduler

Recover

Restart the scheduler to re-establish the Pub/Sub stream

kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Verify the subscription is delivering

# Check that ackIds are being processed (backlog should decrease)
watch -n10 "gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format='value(numUndeliveredMessages)'"

If backlog does not decrease after 5 minutes, check IAM

# Confirm the scheduler SA still has pubsub.subscriber role
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com AND bindings.role:roles/pubsub.subscriber" \
  --format="table(bindings.role)"

Check for Pub/Sub service health incidents

Review the GCP Status Dashboard for active Pub/Sub incidents in the deployment region.

Escalate if backlog exceeds retention window

The default message retention is 7 days. If the outage is expected to exceed this window, escalate to Engineering to consider manual message republishing from Datastore backups.

Escalation Matrix

Failure Mode	First Responder	Escalate To	SLA
Scheduler pod crash	On-call SRE	Platform Engineering	30 min
Vertex quota exhaustion	On-call SRE	Cloud Account Team	2 hours
Artifact upload failure	On-call SRE	Storage/Infra Team	1 hour
Preemption storm	On-call SRE	On-call SRE (monitor)	2 hours
Pub/Sub consumer failure	On-call SRE	Platform Engineering	30 min

Runbooks

Documentation Index

​1. Scheduler Outage

​Diagnose

​Recover

​2. Vertex AI Quota Exhaustion

​3. Artifact Upload Failure

​Diagnose

​Recover

​4. Runtime Preemption Storms

​Immediate containment

​Assess scope

​Recovery

​5. Pub/Sub Consumer Failure

​Diagnose

​Recover

​Escalation Matrix

1. Scheduler Outage

Diagnose

Recover

2. Vertex AI Quota Exhaustion

3. Artifact Upload Failure

Diagnose

Recover

4. Runtime Preemption Storms

Immediate containment

Assess scope

Recovery

5. Pub/Sub Consumer Failure

Diagnose

Recover

Escalation Matrix