Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

This checklist is the authoritative guide for responding to GPU Runtime production incidents. Work through each phase sequentially. All actions should be logged in the active incident ticket.
Create an incident ticket in Linear before beginning this checklist. Assign an Incident Commander (IC) and a Communications Lead (CL) for any P1 or P2 incident.

Severity Definitions

SeverityCriteriaResponse Time
P1 — Critical> 25 % of GPU jobs failing; budget exceeded; data lossImmediate (< 15 min)
P2 — High5–25 % job failure rate; quota > 90 %; DLQ > 50 messages< 30 min
P3 — MediumElevated error rate (< 5 %); single-component degradation< 2 hours
P4 — LowNon-customer-impacting anomaly; cost warningNext business day

Phase 1: Containment

Goal: Stop the bleeding — prevent further impact from spreading.
  • Acknowledge the alert in PagerDuty and mark yourself as responder.
  • Open an incident ticket in Linear with severity, symptoms, and time of detection.
  • Join the incident channel (#incident-gpu-runtime in Slack) and post the Linear ticket link.
  • Assess scope — how many jobs are affected? Which region(s)?
    # Check failure rate
    gcloud ai custom-jobs list \
      --project=impulse-gpu-runtime \
      --region=us-central1 \
      --filter="state=JOB_STATE_FAILED AND updateTime>$(date -d '15 minutes ago' -Iseconds)" \
      --format="value(name)" | wc -l
    
  • Pause job intake if failure rate > 25 % or budget is exhausted:
    kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=true
    
  • Notify customers (via status page) if P1/P2 with customer-visible impact. CL owns this step.
  • Verify containment — confirm no new failures are being created.

Phase 2: Triage

Goal: Identify the root cause so the right recovery action can be selected.
  • Check scheduler pod health:
    kubectl get pods -n gpu-runtime -l app=gpu-scheduler
    kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=100
    
  • Check Pub/Sub subscription backlog:
    gcloud pubsub subscriptions describe gpu-job-request-sub \
      --project=impulse-gpu-runtime \
      --format="yaml(numUndeliveredMessages,oldestUnackedMessageAge)"
    
  • Check DLQ message count:
    gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \
      --project=impulse-gpu-runtime \
      --format="yaml(numUndeliveredMessages)"
    
  • Check Vertex AI quota utilization:
    gcloud compute regions describe us-central1 \
      --project=impulse-gpu-runtime \
      --format="yaml(quotas)" | grep -A3 "NVIDIA_T4_GPUS"
    
  • Check GPU cost and budget status:
    # Review current vs. monthly budget in GCP Console → Billing
    
  • Check for GCP incidents at status.cloud.google.com (Vertex AI, Pub/Sub).
  • Determine failure category and select the appropriate runbook:
  • Post triage summary to incident channel with root cause hypothesis and recovery plan.

Phase 3: Service Recovery

Goal: Restore GPU job processing to the normal operating state.
  • Execute the relevant runbook from the triage phase above.
  • Validate scheduler restart (if applicable):
    kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
    kubectl exec -n gpu-runtime deploy/gpu-scheduler -- curl -s http://localhost:8080/healthz
    
  • Resume job intake once the root cause is resolved and health checks pass:
    kubectl set env deployment/gpu-scheduler -n gpu-runtime PAUSE_JOB_INTAKE=false
    
  • Monitor job submission success rate for 10 minutes after resuming:
    watch -n30 "gcloud ai custom-jobs list \
      --project=impulse-gpu-runtime --region=us-central1 \
      --filter='state=JOB_STATE_FAILED AND updateTime>$(date -d \"5 minutes ago\" -Iseconds)' \
      --format='value(name)' | wc -l"
    
  • Replay DLQ messages once the scheduler is healthy. See DLQ Recovery Runbook.
  • Update status page to “Monitoring” and then “Resolved” as service recovers.
  • Notify customers of resolution (CL step).

Phase 4: Artifact Validation

Goal: Ensure no training artifacts were lost or corrupted during the incident.
  • Identify jobs that ran during the incident window:
    gcloud ai custom-jobs list \
      --project=impulse-gpu-runtime \
      --region=us-central1 \
      --filter="updateTime>INCIDENT_START AND updateTime<INCIDENT_END" \
      --format="table(displayName,state,createTime,updateTime)"
    
  • Check for jobs in terminal states without GCS artifacts:
    # For each SUCCEEDED job, verify artifacts exist
    gsutil ls gs://impulse-gpu-artifacts/<session_id>/<job_id>/
    
  • Identify jobs that completed successfully but lack artifacts (potential upload failures):
    curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true&since=INCIDENT_START" \
      -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq length
    
  • Trigger artifact re-upload for any jobs with missing artifacts:
    curl -s "https://api.impulselabs.ai/internal/gpu/jobs?status=SUCCEEDED&artifact_missing=true" \
      -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
      | jq -r '.[].job_id' \
      | while read -r job_id; do
          curl -s -X POST "https://api.impulselabs.ai/internal/gpu/jobs/$job_id/retry-upload" \
            -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN"
        done
    
  • Confirm artifact counts match expected volumes (jobs × output files per job).
  • Document any permanently lost artifacts in the incident ticket.

Phase 5: Postmortem Requirements

A postmortem is required for all P1 and P2 incidents, and is recommended for recurring P3 incidents.

Postmortem must be completed within:

  • P1: 48 hours
  • P2: 5 business days

Required sections

  • Incident summary — brief description, duration, severity, customer impact
  • Timeline — minute-by-minute log from first alert to resolution (use incident ticket history)
  • Root cause analysis — technical root cause (not “human error”)
  • Contributing factors — monitoring gaps, deployment process issues, etc.
  • Impact quantification — number of affected jobs, customers, estimated cost, data loss (if any)
  • Action items — specific, owner-assigned, time-bound remediation tasks:
    • At least one action item to detect the issue faster
    • At least one action item to reduce impact
    • At least one action item to prevent recurrence
  • Postmortem reviewed by IC, CL, and Engineering Lead
  • Postmortem linked in the Linear incident ticket

Postmortem template location

Postmortems are stored in Notion under Engineering → Incident Postmortems → GPU Runtime. Use the standard template at the top of that page.

Quick Reference

RunbookUse When
Deployment RunbookDeploying or rolling back the GPU Runtime stack
DLQ Recovery RunbookDLQ has undelivered messages; replaying failed jobs
Failure Recovery RunbookScheduler crash, preemption storm, upload failure, Pub/Sub consumer failure
Quota Exhaustion RunbookVertex AI GPU quota at or near 100 %
Cost Anomaly RunbookBudget exceeded; runaway retries; high-cost sessions
Incident Response ChecklistThis document — use for all P1/P2 incidents

Key Contacts

RoleResponsibility
On-call SREFirst responder for all GPU Runtime alerts
Platform Engineering LeadEscalation for infrastructure issues
Backend Engineering LeadEscalation for scheduler and API issues
FinanceBudget increase approvals
GCP Technical Account ManagerEmergency quota increase requests