This checklist is the authoritative guide for responding to GPU Runtime production incidents. Work through each phase sequentially. All actions should be logged in the active incident ticket.Documentation Index
Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
Create an incident ticket in Linear before beginning this checklist. Assign an Incident Commander (IC) and a Communications Lead (CL) for any P1 or P2 incident.
Severity Definitions
| Severity | Criteria | Response Time |
|---|---|---|
| P1 — Critical | > 25 % of GPU jobs failing; budget exceeded; data loss | Immediate (< 15 min) |
| P2 — High | 5–25 % job failure rate; quota > 90 %; DLQ > 50 messages | < 30 min |
| P3 — Medium | Elevated error rate (< 5 %); single-component degradation | < 2 hours |
| P4 — Low | Non-customer-impacting anomaly; cost warning | Next business day |
Phase 1: Containment
Goal: Stop the bleeding — prevent further impact from spreading.- Acknowledge the alert in PagerDuty and mark yourself as responder.
- Open an incident ticket in Linear with severity, symptoms, and time of detection.
- Join the incident channel (
#incident-gpu-runtimein Slack) and post the Linear ticket link. - Assess scope — how many jobs are affected? Which region(s)?
- Pause job intake if failure rate > 25 % or budget is exhausted:
- Notify customers (via status page) if P1/P2 with customer-visible impact. CL owns this step.
- Verify containment — confirm no new failures are being created.
Phase 2: Triage
Goal: Identify the root cause so the right recovery action can be selected.- Check scheduler pod health:
- Check Pub/Sub subscription backlog:
- Check DLQ message count:
- Check Vertex AI quota utilization:
- Check GPU cost and budget status:
- Check for GCP incidents at status.cloud.google.com (Vertex AI, Pub/Sub).
- Determine failure category and select the appropriate runbook:
- Scheduler crash → Failure Recovery Runbook § 1
- Quota exhaustion → Quota Exhaustion Runbook
- DLQ backlog → DLQ Recovery Runbook
- Cost anomaly → Cost Anomaly Runbook
- Artifact upload failure → Failure Recovery Runbook § 3
- Preemption storm → Failure Recovery Runbook § 4
- Pub/Sub consumer failure → Failure Recovery Runbook § 5
- Post triage summary to incident channel with root cause hypothesis and recovery plan.
Phase 3: Service Recovery
Goal: Restore GPU job processing to the normal operating state.- Execute the relevant runbook from the triage phase above.
- Validate scheduler restart (if applicable):
- Resume job intake once the root cause is resolved and health checks pass:
- Monitor job submission success rate for 10 minutes after resuming:
- Replay DLQ messages once the scheduler is healthy. See DLQ Recovery Runbook.
- Update status page to “Monitoring” and then “Resolved” as service recovers.
- Notify customers of resolution (CL step).
Phase 4: Artifact Validation
Goal: Ensure no training artifacts were lost or corrupted during the incident.- Identify jobs that ran during the incident window:
- Check for jobs in terminal states without GCS artifacts:
- Identify jobs that completed successfully but lack artifacts (potential upload failures):
- Trigger artifact re-upload for any jobs with missing artifacts:
- Confirm artifact counts match expected volumes (jobs × output files per job).
- Document any permanently lost artifacts in the incident ticket.
Phase 5: Postmortem Requirements
A postmortem is required for all P1 and P2 incidents, and is recommended for recurring P3 incidents.Postmortem must be completed within:
- P1: 48 hours
- P2: 5 business days
Required sections
- Incident summary — brief description, duration, severity, customer impact
- Timeline — minute-by-minute log from first alert to resolution (use incident ticket history)
- Root cause analysis — technical root cause (not “human error”)
- Contributing factors — monitoring gaps, deployment process issues, etc.
- Impact quantification — number of affected jobs, customers, estimated cost, data loss (if any)
- Action items — specific, owner-assigned, time-bound remediation tasks:
- At least one action item to detect the issue faster
- At least one action item to reduce impact
- At least one action item to prevent recurrence
- Postmortem reviewed by IC, CL, and Engineering Lead
- Postmortem linked in the Linear incident ticket
Postmortem template location
Postmortems are stored in Notion under Engineering → Incident Postmortems → GPU Runtime. Use the standard template at the top of that page.Quick Reference
| Runbook | Use When |
|---|---|
| Deployment Runbook | Deploying or rolling back the GPU Runtime stack |
| DLQ Recovery Runbook | DLQ has undelivered messages; replaying failed jobs |
| Failure Recovery Runbook | Scheduler crash, preemption storm, upload failure, Pub/Sub consumer failure |
| Quota Exhaustion Runbook | Vertex AI GPU quota at or near 100 % |
| Cost Anomaly Runbook | Budget exceeded; runaway retries; high-cost sessions |
| Incident Response Checklist | This document — use for all P1/P2 incidents |
Key Contacts
| Role | Responsibility |
|---|---|
| On-call SRE | First responder for all GPU Runtime alerts |
| Platform Engineering Lead | Escalation for infrastructure issues |
| Backend Engineering Lead | Escalation for scheduler and API issues |
| Finance | Budget increase approvals |
| GCP Technical Account Manager | Emergency quota increase requests |