GPU Runtime DLQ Recovery Runbook

When a GPU job message exceeds its maximum delivery attempts it is forwarded to the dead-letter queue (DLQ). This runbook explains how to inspect DLQ contents, safely replay messages, validate idempotency, and recover stuck jobs.

The DLQ subscription is gpu-job-request-dlq-sub on the topic gpu-job-request-dlq in the impulse-gpu-runtime GCP project.

1. Dead-Letter Queue Inspection

View undelivered message count

gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(name,numUndeliveredMessages)"

Pull and inspect DLQ messages (non-destructive)

# Pull up to 50 messages without acknowledging them
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --limit=50 \
  --format=json > /tmp/dlq-messages.json

cat /tmp/dlq-messages.json | jq '.[] | {ackId, messageId: .message.messageId, publishTime: .message.publishTime, data: (.message.data | @base64d | fromjson)}'

Inspect delivery attempt metadata

Each DLQ message carries a delivery_attempt attribute. Messages forwarded to the DLQ will have delivery_attempt equal to maxDeliveryAttempts (default: 5).

cat /tmp/dlq-messages.json | jq '.[] | {
  job_id: (.message.data | @base64d | fromjson | .job_id),
  delivery_attempt: .message.attributes.delivery_attempt,
  original_publish_time: .message.publishTime
}'

Categorise failures

Review the error_code and error_message fields in the message body to determine root cause before replaying:

`error_code`	Likely Cause	Action
`VERTEX_QUOTA_EXCEEDED`	GPU quota exhausted	See Quota Exhaustion Runbook
`ARTIFACT_UPLOAD_FAILED`	GCS write error	Verify GCS permissions; see Failure Recovery
`JOB_TIMEOUT`	Job exceeded wall-clock limit	Investigate job payload; contact requester
`SCHEDULER_INTERNAL`	Scheduler bug or transient error	Safe to replay; check scheduler logs
`INVALID_PAYLOAD`	Malformed message	Do not replay; fix upstream producer

2. Message Replay Procedures

Only replay messages with error codes that are safe to retry (e.g. SCHEDULER_INTERNAL, VERTEX_QUOTA_EXCEEDED after quota is restored). Never replay INVALID_PAYLOAD messages — they will fail again and consume quota.

Replay a single message

# Extract the raw data from the DLQ message
MESSAGE_DATA=$(cat /tmp/dlq-messages.json | jq -r '.[0].message.data')
JOB_ID=$(echo $MESSAGE_DATA | base64 -d | jq -r '.job_id')

echo "Replaying job: $JOB_ID"

# Re-publish to the main job-request topic
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message="$(echo $MESSAGE_DATA | base64 -d)" \
  --attribute="replayed_from_dlq=true,original_job_id=$JOB_ID"

Bulk replay (filtered)

Use the DLQ replay script to replay all messages matching a given error_code:

# Replay all SCHEDULER_INTERNAL failures
cat /tmp/dlq-messages.json \
  | jq -r '.[] | select((.message.data | @base64d | fromjson | .error_code) == "SCHEDULER_INTERNAL") | .message.data' \
  | while read -r data; do
      gcloud pubsub topics publish gpu-job-request \
        --project=impulse-gpu-runtime \
        --message="$(echo "$data" | base64 -d)" \
        --attribute="replayed_from_dlq=true"
    done

Acknowledge replayed messages from the DLQ

After confirming the replayed jobs succeeded, acknowledge the DLQ messages to remove them:

ACK_IDS=$(cat /tmp/dlq-messages.json | jq -r '[.[].ackId] | join(",")')
gcloud pubsub subscriptions ack gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --ack-ids="$ACK_IDS"

3. Idempotent Replay Validation

The GPU Scheduler implements idempotency checks keyed on job_id. Before replaying, verify the job is not already active or completed:

# Query the job status API
curl -s "https://api.impulselabs.ai/gpu/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq .

# Expected states that are safe to replay
# "FAILED", "EXPIRED", "UNKNOWN"

# States that must NOT be replayed
# "RUNNING", "SUCCEEDED", "QUEUED"

The scheduler will reject duplicate job_id submissions with a 409 Conflict response and drop the message from the queue automatically. This prevents double-billing.

Verify replay idempotency in scheduler logs

kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=200 | \
  grep -E "duplicate_job|idempotency_check|$JOB_ID"

Look for log entries such as:

{"level":"info","msg":"idempotency check passed","job_id":"...","prior_status":"FAILED"}
{"level":"warn","msg":"duplicate job rejected","job_id":"...","prior_status":"RUNNING"}

4. Stuck Job Recovery

A job is considered “stuck” if it remains in RUNNING state for longer than its configured max_runtime_seconds (default: 3600 s).

Identify stuck jobs

# List Vertex AI custom jobs running for > 1 hour
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '1 hour ago' -Iseconds)" \
  --format="table(name,displayName,createTime,state)"

Cancel a stuck Vertex AI job

VERTEX_JOB_NAME="projects/impulse-gpu-runtime/locations/us-central1/customJobs/JOB_NUMBER"

gcloud ai custom-jobs cancel $VERTEX_JOB_NAME \
  --project=impulse-gpu-runtime \
  --region=us-central1

Force-complete the scheduler record

After cancelling the Vertex job, mark the corresponding scheduler record as FAILED to release the lock:

curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "FAILED", "error_code": "MANUALLY_CANCELLED", "error_message": "Stuck job cancelled by on-call operator"}'

Re-queue the job (optional)

If the job should be retried after recovery, publish a fresh message:

gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message='{"job_id":"'$JOB_ID'","session_id":"'$SESSION_ID'","type":"gpu_sandbox"}' \
  --attribute="retried_by_operator=true"

Post-Recovery Checklist

DLQ message count returned to 0 (or known-good baseline)
Replayed jobs completed successfully in Vertex AI
No duplicate billings recorded for replayed job_ids
Scheduler logs show no new DLQ escalations within 30 minutes
Root cause documented in incident postmortem (if > 10 messages in DLQ)

Runbooks

Documentation Index

​1. Dead-Letter Queue Inspection

​View undelivered message count

​Pull and inspect DLQ messages (non-destructive)

​Inspect delivery attempt metadata

​Categorise failures

​2. Message Replay Procedures

​Replay a single message

​Bulk replay (filtered)

​Acknowledge replayed messages from the DLQ

​3. Idempotent Replay Validation

​Verify replay idempotency in scheduler logs

​4. Stuck Job Recovery

​Identify stuck jobs

​Cancel a stuck Vertex AI job

​Force-complete the scheduler record

​Re-queue the job (optional)

​Post-Recovery Checklist

1. Dead-Letter Queue Inspection

View undelivered message count

Pull and inspect DLQ messages (non-destructive)

Inspect delivery attempt metadata

Categorise failures

2. Message Replay Procedures

Replay a single message

Bulk replay (filtered)

Acknowledge replayed messages from the DLQ

3. Idempotent Replay Validation

Verify replay idempotency in scheduler logs

4. Stuck Job Recovery

Identify stuck jobs

Cancel a stuck Vertex AI job

Force-complete the scheduler record

Re-queue the job (optional)

Post-Recovery Checklist