> ## Documentation Index
> Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Runtime Failure Recovery Runbook

> Recovery procedures for scheduler outages, Vertex quota exhaustion, artifact upload failures, preemption storms, and Pub/Sub consumer failures

This runbook provides step-by-step recovery procedures for the most common GPU Runtime failure modes. Each section is self-contained and can be followed independently.

***

## 1. Scheduler Outage

**Symptoms:** Jobs remain `QUEUED` indefinitely; no new Vertex AI jobs are being submitted; `gpu-scheduler` pods in `CrashLoopBackOff` or `Error` state.

### Diagnose

```bash theme={null}
# Check pod status
kubectl get pods -n gpu-runtime -l app=gpu-scheduler

# Review recent crash logs
kubectl logs -n gpu-runtime -l app=gpu-scheduler --previous --tail=100

# Check Kubernetes events for OOM kills or probe failures
kubectl describe pod -n gpu-runtime -l app=gpu-scheduler | \
  grep -A5 "Events:"
```

### Recover

<Steps>
  <Step title="Restart the scheduler deployment">
    ```bash theme={null}
    kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime
    kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
    ```
  </Step>

  <Step title="Verify scheduler health">
    ```bash theme={null}
    kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
      curl -s http://localhost:8080/healthz
    # Expected: {"status":"ok","version":"..."}
    ```
  </Step>

  <Step title="Check Pub/Sub subscription backlog">
    ```bash theme={null}
    gcloud pubsub subscriptions describe gpu-job-request-sub \
      --project=impulse-gpu-runtime \
      --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"
    ```

    If `oldestUnackedMessageAge` exceeds the `ackDeadlineSeconds` (600 s), messages may have already moved to the DLQ. Follow the [DLQ Recovery Runbook](/gpu-operations/dlq-recovery).
  </Step>

  <Step title="Roll back if restart does not resolve the issue">
    ```bash theme={null}
    kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
    ```

    See the [Deployment Runbook](/gpu-operations/deployment) for full rollback procedures.
  </Step>
</Steps>

***

## 2. Vertex AI Quota Exhaustion

**Symptoms:** New custom jobs fail immediately with `RESOURCE_EXHAUSTED`; Cloud Monitoring shows quota utilization at 100%.

See the dedicated [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion) for full recovery steps. Quick summary:

1. Pause new job intake (modify subscription or set scheduler env `PAUSE_JOB_INTAKE=true`).
2. Request an emergency quota increase via the GCP Console or Support.
3. Resume intake once quota headroom is confirmed.
4. Replay DLQ messages that failed due to quota errors.

***

## 3. Artifact Upload Failure

**Symptoms:** Jobs complete on Vertex AI but output artifacts are missing from GCS; jobs show `ARTIFACT_UPLOAD_FAILED` in the result message.

### Diagnose

```bash theme={null}
# Check GCS bucket permissions for the runtime SA
gsutil iam get gs://impulse-gpu-artifacts

# List recent failed uploads (look for 403 / 429 in GCS audit logs)
gcloud logging read \
  'resource.type="gcs_bucket" AND protoPayload.status.code!=0 AND resource.labels.bucket_name="impulse-gpu-artifacts"' \
  --project=impulse-gpu-runtime \
  --limit=20 \
  --format="table(timestamp, protoPayload.authenticationInfo.principalEmail, protoPayload.status.message)"
```

### Recover

<Steps>
  <Step title="Verify runtime service account permissions">
    ```bash theme={null}
    gcloud storage buckets get-iam-policy gs://impulse-gpu-artifacts \
      --format="json" | jq '.bindings[] | select(.role == "roles/storage.objectAdmin")'
    ```

    The `gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com` SA must be listed. If missing, add it:

    ```bash theme={null}
    gsutil iam ch \
      serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com:objectAdmin \
      gs://impulse-gpu-artifacts
    ```
  </Step>

  <Step title="Check bucket quota and storage class">
    ```bash theme={null}
    gsutil du -s gs://impulse-gpu-artifacts
    ```

    Confirm the bucket has not hit a soft storage cap. If 429 rate-limit errors appear, implement exponential backoff in the runtime upload code.
  </Step>

  <Step title="Trigger artifact re-upload for failed jobs">
    ```bash theme={null}
    # Re-queue jobs with ARTIFACT_UPLOAD_FAILED status
    curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=ARTIFACT_UPLOAD_FAILED&limit=50" \
      -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
      | jq -r '.[].job_id' \
      | while read -r job_id; do
          curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \
            -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN"
        done
    ```
  </Step>
</Steps>

***

## 4. Runtime Preemption Storms

**Symptoms:** Large numbers of Vertex AI custom jobs moving to `JOB_STATE_FAILED` in a short window with error `Preempted`; DLQ message count spikes; customer-facing error rate increases.

Preemption storms occur when GCP reclaims preemptible/Spot GPU VMs in a region simultaneously, typically during high-demand periods.

### Immediate containment

```bash theme={null}
# Pause new preemptible job submission
kubectl set env deployment/gpu-scheduler \
  -n gpu-runtime \
  GPU_VM_TYPE=on-demand   # Switch to on-demand temporarily

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
```

### Assess scope

```bash theme={null}
# Count failed jobs in the last 30 minutes
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \
  --format="value(name)" | wc -l

# Check if failures are isolated to a single accelerator type
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '30 minutes ago' -Iseconds)" \
  --format="table(displayName,error.message)"
```

### Recovery

<Steps>
  <Step title="Replay preempted jobs from the DLQ">
    Follow [DLQ Recovery Runbook](/gpu-operations/dlq-recovery) for replaying `JOB_PREEMPTED` error-code messages.
  </Step>

  <Step title="Monitor on-demand job success rate">
    Watch the `gpu_job_success_rate` metric in Cloud Monitoring until it returns above 95%.
  </Step>

  <Step title="Re-enable preemptible jobs after storm clears">
    ```bash theme={null}
    kubectl set env deployment/gpu-scheduler \
      -n gpu-runtime \
      GPU_VM_TYPE=preemptible
    kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
    ```
  </Step>
</Steps>

<Note>
  Preemption storms usually resolve within 30–90 minutes. If they persist for more than 2 hours, switch permanently to on-demand and file a GCP support ticket.
</Note>

***

## 5. Pub/Sub Consumer Failure

**Symptoms:** Messages accumulate on `gpu-job-request-sub`; `oldestUnackedMessageAge` grows continuously; no new Vertex AI jobs are submitted even though the scheduler pod is `Running`.

### Diagnose

```bash theme={null}
# Check subscription backlog
gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"

# Confirm the scheduler is actively pulling
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100 | \
  grep -E "pubsub|pull|subscription"

# Inspect scheduler metrics for pull errors
kubectl top pod -n gpu-runtime -l app=gpu-scheduler
```

### Recover

<Steps>
  <Step title="Restart the scheduler to re-establish the Pub/Sub stream">
    ```bash theme={null}
    kubectl rollout restart deployment/gpu-scheduler -n gpu-runtime
    kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
    ```
  </Step>

  <Step title="Verify the subscription is delivering">
    ```bash theme={null}
    # Check that ackIds are being processed (backlog should decrease)
    watch -n10 "gcloud pubsub subscriptions describe gpu-job-request-sub \
      --project=impulse-gpu-runtime \
      --format='value(numUndeliveredMessages)'"
    ```
  </Step>

  <Step title="If backlog does not decrease after 5 minutes, check IAM">
    ```bash theme={null}
    # Confirm the scheduler SA still has pubsub.subscriber role
    gcloud projects get-iam-policy impulse-gpu-runtime \
      --flatten="bindings[].members" \
      --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com AND bindings.role:roles/pubsub.subscriber" \
      --format="table(bindings.role)"
    ```
  </Step>

  <Step title="Check for Pub/Sub service health incidents">
    Review the [GCP Status Dashboard](https://status.cloud.google.com) for active Pub/Sub incidents in the deployment region.
  </Step>

  <Step title="Escalate if backlog exceeds retention window">
    The default message retention is 7 days. If the outage is expected to exceed this window, escalate to Engineering to consider manual message republishing from Datastore backups.
  </Step>
</Steps>

***

## Escalation Matrix

| Failure Mode             | First Responder | Escalate To           | SLA     |
| ------------------------ | --------------- | --------------------- | ------- |
| Scheduler pod crash      | On-call SRE     | Platform Engineering  | 30 min  |
| Vertex quota exhaustion  | On-call SRE     | Cloud Account Team    | 2 hours |
| Artifact upload failure  | On-call SRE     | Storage/Infra Team    | 1 hour  |
| Preemption storm         | On-call SRE     | On-call SRE (monitor) | 2 hours |
| Pub/Sub consumer failure | On-call SRE     | Platform Engineering  | 30 min  |
