> ## Documentation Index > Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt > Use this file to discover all available pages before exploring further. # GPU Runtime Incident Response Checklist > Production operations checklist for GPU Runtime incidents — containment, triage, service recovery, artifact validation, and postmortem requirements This checklist is the authoritative guide for responding to GPU Runtime production incidents. Work through each phase sequentially. All actions should be logged in the active incident ticket. Create an incident ticket in Linear before beginning this checklist. Assign an **Incident Commander (IC)** and a **Communications Lead (CL)** for any P1 or P2 incident. *** ## Severity Definitions | Severity | Criteria | Response Time | | ----------------- | ---------------------------------------------------------- | --------------------- | | **P1 — Critical** | > 25 % of GPU jobs failing; budget exceeded; data loss | Immediate (\< 15 min) | | **P2 — High** | 5–25 % job failure rate; quota > 90 %; DLQ > 50 messages | \< 30 min | | **P3 — Medium** | Elevated error rate (\< 5 %); single-component degradation | \< 2 hours | | **P4 — Low** | Non-customer-impacting anomaly; cost warning | Next business day | *** ## Phase 1: Containment *Goal: Stop the bleeding — prevent further impact from spreading.* * [ ] **Acknowledge the alert** in PagerDuty and mark yourself as responder. * [ ] **Open an incident ticket** in Linear with severity, symptoms, and time of detection. * [ ] **Join the incident channel** (`#incident-gpu-runtime` in Slack) and post the Linear ticket link. * [ ] **Assess scope** — how many jobs are affected? Which region(s)? ```bash theme={null} # Check failure rate gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '15 minutes ago' -Iseconds)" \ --format="value(name)" | wc -l ``` * [ ] **Pause job intake** if failure rate > 25 % or budget is exhausted: ```bash theme={null} kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=true ``` * [ ] **Notify customers** (via status page) if P1/P2 with customer-visible impact. CL owns this step. * [ ] **Verify containment** — confirm no new failures are being created. *** ## Phase 2: Triage *Goal: Identify the root cause so the right recovery action can be selected.* * [ ] **Check scheduler pod health:** ```bash theme={null} kubectl get pods -n gpu-runtime -l app=gpu-scheduler kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100 ``` * [ ] **Check Pub/Sub subscription backlog:** ```bash theme={null} gcloud pubsub subscriptions describe gpu-job-request-sub \ --project=impulse-gpu-runtime \ --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)" ``` * [ ] **Check DLQ message count:** ```bash theme={null} gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \ --project=impulse-gpu-runtime \ --format="yaml(numUndeliveredMessages)" ``` * [ ] **Check Vertex AI quota utilization:** ```bash theme={null} gcloud compute regions describe us-central1 \ --project=impulse-gpu-runtime \ --format="yaml(quotas)" | grep -A3 "NVIDIA_T4_GPUS" ``` * [ ] **Check GPU cost and budget status:** ```bash theme={null} # Review current vs. monthly budget in GCP Console → Billing ``` * [ ] **Check for GCP incidents** at [status.cloud.google.com](https://status.cloud.google.com) (Vertex AI, Pub/Sub). * [ ] **Determine failure category** and select the appropriate runbook: * Scheduler crash → [Failure Recovery Runbook § 1](/gpu-operations/failure-recovery) * Quota exhaustion → [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion) * DLQ backlog → [DLQ Recovery Runbook](/gpu-operations/dlq-recovery) * Cost anomaly → [Cost Anomaly Runbook](/gpu-operations/cost-anomaly) * Artifact upload failure → [Failure Recovery Runbook § 3](/gpu-operations/failure-recovery) * Preemption storm → [Failure Recovery Runbook § 4](/gpu-operations/failure-recovery) * Pub/Sub consumer failure → [Failure Recovery Runbook § 5](/gpu-operations/failure-recovery) * [ ] **Post triage summary** to incident channel with root cause hypothesis and recovery plan. *** ## Phase 3: Service Recovery *Goal: Restore GPU job processing to the normal operating state.* * [ ] **Execute the relevant runbook** from the triage phase above. * [ ] **Validate scheduler restart** (if applicable): ```bash theme={null} kubectl rollout status deployment/gpu-scheduler -n gpu-runtime kubectl exec -n gpu-runtime deploy/gpu-scheduler -- curl -s http://localhost:8080/healthz ``` * [ ] **Resume job intake** once the root cause is resolved and health checks pass: ```bash theme={null} kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=false ``` * [ ] **Monitor job submission success rate** for 10 minutes after resuming: ```bash theme={null} watch -n30 "gcloud ai custom-jobs list \ --project=impulse-gpu-runtime --region=us-central1 \ --filter='state=JOB_STATE_FAILED AND updateTime>$(date -d \"5 minutes ago\" -Iseconds)' \ --format='value(name)' | wc -l" ``` * [ ] **Replay DLQ messages** once the scheduler is healthy. See [DLQ Recovery Runbook](/gpu-operations/dlq-recovery). * [ ] **Update status page** to "Monitoring" and then "Resolved" as service recovers. * [ ] **Notify customers** of resolution (CL step). *** ## Phase 4: Artifact Validation *Goal: Ensure no training artifacts were lost or corrupted during the incident.* * [ ] **Identify jobs that ran during the incident window:** ```bash theme={null} gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="updateTime>INCIDENT_START AND updateTime// ``` * [ ] **Identify jobs that completed successfully but lack artifacts** (potential upload failures): ```bash theme={null} curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true&since=INCIDENT_START" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq length ``` * [ ] **Trigger artifact re-upload** for any jobs with missing artifacts: ```bash theme={null} curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \ | jq -r '.[].job_id' \ | while read -r job_id; do curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" done ``` * [ ] **Confirm artifact counts match** expected volumes (jobs × output files per job). * [ ] **Document any permanently lost artifacts** in the incident ticket. *** ## Phase 5: Postmortem Requirements A postmortem is **required** for all P1 and P2 incidents, and is recommended for recurring P3 incidents. ### Postmortem must be completed within: * **P1:** 48 hours * **P2:** 5 business days ### Required sections * [ ] **Incident summary** — brief description, duration, severity, customer impact * [ ] **Timeline** — minute-by-minute log from first alert to resolution (use incident ticket history) * [ ] **Root cause analysis** — technical root cause (not "human error") * [ ] **Contributing factors** — monitoring gaps, deployment process issues, etc. * [ ] **Impact quantification** — number of affected jobs, customers, estimated cost, data loss (if any) * [ ] **Action items** — specific, owner-assigned, time-bound remediation tasks: * At least one action item to **detect the issue faster** * At least one action item to **reduce impact** * At least one action item to **prevent recurrence** * [ ] **Postmortem reviewed** by IC, CL, and Engineering Lead * [ ] **Postmortem linked** in the Linear incident ticket ### Postmortem template location Postmortems are stored in Notion under **Engineering → Incident Postmortems → GPU Runtime**. Use the standard template at the top of that page. *** ## Quick Reference | Runbook | Use When | | ---------------------------------------------------------------- | --------------------------------------------------------------------------- | | [Deployment Runbook](/gpu-operations/deployment) | Deploying or rolling back the GPU Runtime stack | | [DLQ Recovery Runbook](/gpu-operations/dlq-recovery) | DLQ has undelivered messages; replaying failed jobs | | [Failure Recovery Runbook](/gpu-operations/failure-recovery) | Scheduler crash, preemption storm, upload failure, Pub/Sub consumer failure | | [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion) | Vertex AI GPU quota at or near 100 % | | [Cost Anomaly Runbook](/gpu-operations/cost-anomaly) | Budget exceeded; runaway retries; high-cost sessions | | [Incident Response Checklist](/gpu-operations/incident-response) | *This document* — use for all P1/P2 incidents | *** ## Key Contacts | Role | Responsibility | | ----------------------------- | ------------------------------------------ | | On-call SRE | First responder for all GPU Runtime alerts | | Platform Engineering Lead | Escalation for infrastructure issues | | Backend Engineering Lead | Escalation for scheduler and API issues | | Finance | Budget increase approvals | | GCP Technical Account Manager | Emergency quota increase requests |