> ## Documentation Index > Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt > Use this file to discover all available pages before exploring further. # GPU Runtime DLQ Recovery Runbook > Procedures for inspecting, replaying, and recovering messages from the GPU Runtime dead-letter queue When a GPU job message exceeds its maximum delivery attempts it is forwarded to the dead-letter queue (DLQ). This runbook explains how to inspect DLQ contents, safely replay messages, validate idempotency, and recover stuck jobs. The DLQ subscription is `gpu-job-request-dlq-sub` on the topic `gpu-job-request-dlq` in the `impulse-gpu-runtime` GCP project. *** ## 1. Dead-Letter Queue Inspection ### View undelivered message count ```bash theme={null} gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \ --project=impulse-gpu-runtime \ --format="yaml(name,numUndeliveredMessages)" ``` ### Pull and inspect DLQ messages (non-destructive) ```bash theme={null} # Pull up to 50 messages without acknowledging them gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \ --project=impulse-gpu-runtime \ --limit=50 \ --format=json > /tmp/dlq-messages.json cat /tmp/dlq-messages.json | jq '.[] | {ackId, messageId: .message.messageId, publishTime: .message.publishTime, data: (.message.data | @base64d | fromjson)}' ``` ### Inspect delivery attempt metadata Each DLQ message carries a `delivery_attempt` attribute. Messages forwarded to the DLQ will have `delivery_attempt` equal to `maxDeliveryAttempts` (default: 5). ```bash theme={null} cat /tmp/dlq-messages.json | jq '.[] | { job_id: (.message.data | @base64d | fromjson | .job_id), delivery_attempt: .message.attributes.delivery_attempt, original_publish_time: .message.publishTime }' ``` ### Categorise failures Review the `error_code` and `error_message` fields in the message body to determine root cause before replaying: | `error_code` | Likely Cause | Action | | ------------------------ | -------------------------------- | -------------------------------------------------------------------------------- | | `VERTEX_QUOTA_EXCEEDED` | GPU quota exhausted | See [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion) | | `ARTIFACT_UPLOAD_FAILED` | GCS write error | Verify GCS permissions; see [Failure Recovery](/gpu-operations/failure-recovery) | | `JOB_TIMEOUT` | Job exceeded wall-clock limit | Investigate job payload; contact requester | | `SCHEDULER_INTERNAL` | Scheduler bug or transient error | Safe to replay; check scheduler logs | | `INVALID_PAYLOAD` | Malformed message | Do **not** replay; fix upstream producer | *** ## 2. Message Replay Procedures Only replay messages with error codes that are safe to retry (e.g. `SCHEDULER_INTERNAL`, `VERTEX_QUOTA_EXCEEDED` after quota is restored). Never replay `INVALID_PAYLOAD` messages — they will fail again and consume quota. ### Replay a single message ```bash theme={null} # Extract the raw data from the DLQ message MESSAGE_DATA=$(cat /tmp/dlq-messages.json | jq -r '.[0].message.data') JOB_ID=$(echo $MESSAGE_DATA | base64 -d | jq -r '.job_id') echo "Replaying job: $JOB_ID" # Re-publish to the main job-request topic gcloud pubsub topics publish gpu-job-request \ --project=impulse-gpu-runtime \ --message="$(echo $MESSAGE_DATA | base64 -d)" \ --attribute="replayed_from_dlq=true,original_job_id=$JOB_ID" ``` ### Bulk replay (filtered) Use the DLQ replay script to replay all messages matching a given `error_code`: ```bash theme={null} # Replay all SCHEDULER_INTERNAL failures cat /tmp/dlq-messages.json \ | jq -r '.[] | select((.message.data | @base64d | fromjson | .error_code) == "SCHEDULER_INTERNAL") | .message.data' \ | while read -r data; do gcloud pubsub topics publish gpu-job-request \ --project=impulse-gpu-runtime \ --message="$(echo "$data" | base64 -d)" \ --attribute="replayed_from_dlq=true" done ``` ### Acknowledge replayed messages from the DLQ After confirming the replayed jobs succeeded, acknowledge the DLQ messages to remove them: ```bash theme={null} ACK_IDS=$(cat /tmp/dlq-messages.json | jq -r '[.[].ackId] | join(",")') gcloud pubsub subscriptions ack gpu-job-request-dlq-sub \ --project=impulse-gpu-runtime \ --ack-ids="$ACK_IDS" ``` *** ## 3. Idempotent Replay Validation The GPU Scheduler implements idempotency checks keyed on `job_id`. Before replaying, verify the job is not already active or completed: ```bash theme={null} # Query the job status API curl -s "https://api.impulselabs.ai/gpu/jobs/$JOB_ID/status" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq . # Expected states that are safe to replay # "FAILED", "EXPIRED", "UNKNOWN" # States that must NOT be replayed # "RUNNING", "SUCCEEDED", "QUEUED" ``` The scheduler will reject duplicate `job_id` submissions with a `409 Conflict` response and drop the message from the queue automatically. This prevents double-billing. ### Verify replay idempotency in scheduler logs ```bash theme={null} kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=200 | \ grep -E "duplicate_job|idempotency_check|$JOB_ID" ``` Look for log entries such as: ```json theme={null} {"level":"info","msg":"idempotency check passed","job_id":"...","prior_status":"FAILED"} {"level":"warn","msg":"duplicate job rejected","job_id":"...","prior_status":"RUNNING"} ``` *** ## 4. Stuck Job Recovery A job is considered "stuck" if it remains in `RUNNING` state for longer than its configured `max_runtime_seconds` (default: 3600 s). ### Identify stuck jobs ```bash theme={null} # List Vertex AI custom jobs running for > 1 hour gcloud ai custom-jobs list \ --project=impulse-gpu-runtime \ --region=us-central1 \ --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '1 hour ago' -Iseconds)" \ --format="table(name,displayName,createTime,state)" ``` ### Cancel a stuck Vertex AI job ```bash theme={null} VERTEX_JOB_NAME="projects/impulse-gpu-runtime/locations/us-central1/customJobs/JOB_NUMBER" gcloud ai custom-jobs cancel $VERTEX_JOB_NAME \ --project=impulse-gpu-runtime \ --region=us-central1 ``` ### Force-complete the scheduler record After cancelling the Vertex job, mark the corresponding scheduler record as `FAILED` to release the lock: ```bash theme={null} curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \ -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \ -H "Content-Type: application/json" \ -d '{"status": "FAILED", "error_code": "MANUALLY_CANCELLED", "error_message": "Stuck job cancelled by on-call operator"}' ``` ### Re-queue the job (optional) If the job should be retried after recovery, publish a fresh message: ```bash theme={null} gcloud pubsub topics publish gpu-job-request \ --project=impulse-gpu-runtime \ --message='{"job_id":"'$JOB_ID'","session_id":"'$SESSION_ID'","type":"gpu_sandbox"}' \ --attribute="retried_by_operator=true" ``` *** ## Post-Recovery Checklist * [ ] DLQ message count returned to 0 (or known-good baseline) * [ ] Replayed jobs completed successfully in Vertex AI * [ ] No duplicate billings recorded for replayed `job_id`s * [ ] Scheduler logs show no new DLQ escalations within 30 minutes * [ ] Root cause documented in incident postmortem (if > 10 messages in DLQ)