Production operations checklist for GPU Runtime incidents — containment, triage, service recovery, artifact validation, and postmortem requirements
This checklist is the authoritative guide for responding to GPU Runtime production incidents. Work through each phase sequentially. All actions should be logged in the active incident ticket.
Create an incident ticket in Linear before beginning this checklist. Assign an Incident Commander (IC) and a Communications Lead (CL) for any P1 or P2 incident.
Goal: Ensure no training artifacts were lost or corrupted during the incident.
Identify jobs that ran during the incident window:
gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="updateTime>INCIDENT_START AND updateTime<INCIDENT_END" \ --format="table(displayName,state,createTime,updateTime)"
Check for jobs in terminal states without GCS artifacts:
# For each SUCCEEDED job, verify artifacts existgsutil ls gs://impulse-gpu-artifacts/<session_id>/<job_id>/
Identify jobs that completed successfully but lack artifacts (potential upload failures):