GPU Runtime Incident Response Checklist

Severity Definitions
Phase 1: Containment
Phase 2: Triage
Phase 3: Service Recovery
Phase 4: Artifact Validation
Phase 5: Postmortem Requirements
Postmortem must be completed within:
Required sections
Postmortem template location
Quick Reference
Key Contacts

This checklist is the authoritative guide for responding to GPU Runtime production incidents. Work through each phase sequentially. All actions should be logged in the active incident ticket.

Create an incident ticket in Linear before beginning this checklist. Assign an Incident Commander (IC) and a Communications Lead (CL) for any P1 or P2 incident.

Severity Definitions

Severity	Criteria	Response Time
P1 — Critical	> 25 % of GPU jobs failing; budget exceeded; data loss	Immediate (< 15 min)
P2 — High	5–25 % job failure rate; quota > 90 %; DLQ > 50 messages	< 30 min
P3 — Medium	Elevated error rate (< 5 %); single-component degradation	< 2 hours
P4 — Low	Non-customer-impacting anomaly; cost warning	Next business day

Phase 1: Containment

Goal: Stop the bleeding — prevent further impact from spreading.

Acknowledge the alert in PagerDuty and mark yourself as responder.
Open an incident ticket in Linear with severity, symptoms, and time of detection.
Join the incident channel (#incident-gpu-runtime in Slack) and post the Linear ticket link.

Assess scope — how many jobs are affected? Which region(s)?

# Check failure rate
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '15 minutes ago' -Iseconds)" \
  --format="value(name)" | wc -l

Pause job intake if failure rate > 25 % or budget is exhausted:

kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=true

Notify customers (via status page) if P1/P2 with customer-visible impact. CL owns this step.
Verify containment — confirm no new failures are being created.

Phase 2: Triage

Goal: Identify the root cause so the right recovery action can be selected.

Check scheduler pod health:

kubectl get pods -n gpu-runtime -l app=gpu-scheduler
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100

Check Pub/Sub subscription backlog:

gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"

Check DLQ message count:

gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(numUndeliveredMessages)"

Check Vertex AI quota utilization:

gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | grep -A3 "NVIDIA_T4_GPUS"

Check GPU cost and budget status:

# Review current vs. monthly budget in GCP Console → Billing

Check for GCP incidents at status.cloud.google.com (Vertex AI, Pub/Sub).
Determine failure category and select the appropriate runbook:
- Scheduler crash → Failure Recovery Runbook § 1
- Quota exhaustion → Quota Exhaustion Runbook
- DLQ backlog → DLQ Recovery Runbook
- Cost anomaly → Cost Anomaly Runbook
- Artifact upload failure → Failure Recovery Runbook § 3
- Preemption storm → Failure Recovery Runbook § 4
- Pub/Sub consumer failure → Failure Recovery Runbook § 5
Post triage summary to incident channel with root cause hypothesis and recovery plan.

Phase 3: Service Recovery

Goal: Restore GPU job processing to the normal operating state.

Execute the relevant runbook from the triage phase above.

Validate scheduler restart (if applicable):

kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
kubectl exec -n gpu-runtime deploy/gpu-scheduler -- curl -s http://localhost:8080/healthz

Resume job intake once the root cause is resolved and health checks pass:

kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=false

Monitor job submission success rate for 10 minutes after resuming:

watch -n30 "gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime --region=us-central1 \
  --filter='state=JOB_STATE_FAILED AND updateTime>$(date -d \"5 minutes ago\" -Iseconds)' \
  --format='value(name)' | wc -l"

Replay DLQ messages once the scheduler is healthy. See DLQ Recovery Runbook.
Update status page to “Monitoring” and then “Resolved” as service recovers.
Notify customers of resolution (CL step).

Phase 4: Artifact Validation

Goal: Ensure no training artifacts were lost or corrupted during the incident.

Identify jobs that ran during the incident window:

gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="updateTime>INCIDENT_START AND updateTime<INCIDENT_END" \
  --format="table(displayName,state,createTime,updateTime)"

Check for jobs in terminal states without GCS artifacts:

# For each SUCCEEDED job, verify artifacts exist
gsutil ls gs://impulse-gpu-artifacts/<session_id>/<job_id>/

Identify jobs that completed successfully but lack artifacts (potential upload failures):

curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true&since=INCIDENT_START" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq length

Trigger artifact re-upload for any jobs with missing artifacts:

curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  | jq -r '.[].job_id' \
  | while read -r job_id; do
      curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \
        -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN"
    done

Confirm artifact counts match expected volumes (jobs × output files per job).
Document any permanently lost artifacts in the incident ticket.

Phase 5: Postmortem Requirements

A postmortem is required for all P1 and P2 incidents, and is recommended for recurring P3 incidents.

Postmortem must be completed within:

P1: 48 hours
P2: 5 business days

Required sections

Incident summary — brief description, duration, severity, customer impact
Timeline — minute-by-minute log from first alert to resolution (use incident ticket history)
Root cause analysis — technical root cause (not “human error”)
Contributing factors — monitoring gaps, deployment process issues, etc.
Impact quantification — number of affected jobs, customers, estimated cost, data loss (if any)
Action items — specific, owner-assigned, time-bound remediation tasks:
- At least one action item to detect the issue faster
- At least one action item to reduce impact
- At least one action item to prevent recurrence
Postmortem reviewed by IC, CL, and Engineering Lead
Postmortem linked in the Linear incident ticket

Postmortem template location

Postmortems are stored in Notion under Engineering → Incident Postmortems → GPU Runtime. Use the standard template at the top of that page.

Quick Reference

Runbook	Use When
Deployment Runbook	Deploying or rolling back the GPU Runtime stack
DLQ Recovery Runbook	DLQ has undelivered messages; replaying failed jobs
Failure Recovery Runbook	Scheduler crash, preemption storm, upload failure, Pub/Sub consumer failure
Quota Exhaustion Runbook	Vertex AI GPU quota at or near 100 %
Cost Anomaly Runbook	Budget exceeded; runaway retries; high-cost sessions
Incident Response Checklist	This document — use for all P1/P2 incidents

Key Contacts

Role	Responsibility
On-call SRE	First responder for all GPU Runtime alerts
Platform Engineering Lead	Escalation for infrastructure issues
Backend Engineering Lead	Escalation for scheduler and API issues
Finance	Budget increase approvals
GCP Technical Account Manager	Emergency quota increase requests

GPU Runtime Cost Anomaly Runbook

Runbooks

Documentation Index

​Severity Definitions

​Phase 1: Containment

​Phase 2: Triage

​Phase 3: Service Recovery

​Phase 4: Artifact Validation

​Phase 5: Postmortem Requirements

​Postmortem must be completed within:

​Required sections

​Postmortem template location

​Quick Reference

​Key Contacts

Severity Definitions

Phase 1: Containment

Phase 2: Triage

Phase 3: Service Recovery

Phase 4: Artifact Validation

Phase 5: Postmortem Requirements

Postmortem must be completed within:

Required sections

Postmortem template location

Quick Reference

Key Contacts