> ## Documentation Index
> Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Runtime DLQ Recovery Runbook

> Procedures for inspecting, replaying, and recovering messages from the GPU Runtime dead-letter queue

When a GPU job message exceeds its maximum delivery attempts it is forwarded to the dead-letter queue (DLQ). This runbook explains how to inspect DLQ contents, safely replay messages, validate idempotency, and recover stuck jobs.

<Note>
  The DLQ subscription is `gpu-job-request-dlq-sub` on the topic `gpu-job-request-dlq` in the `impulse-gpu-runtime` GCP project.
</Note>

***

## 1. Dead-Letter Queue Inspection

### View undelivered message count

```bash theme={null}
gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(name,numUndeliveredMessages)"
```

### Pull and inspect DLQ messages (non-destructive)

```bash theme={null}
# Pull up to 50 messages without acknowledging them
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --limit=50 \
  --format=json > /tmp/dlq-messages.json

cat /tmp/dlq-messages.json | jq '.[] | {ackId, messageId: .message.messageId, publishTime: .message.publishTime, data: (.message.data | @base64d | fromjson)}'
```

### Inspect delivery attempt metadata

Each DLQ message carries a `delivery_attempt` attribute. Messages forwarded to the DLQ will have `delivery_attempt` equal to `maxDeliveryAttempts` (default: 5).

```bash theme={null}
cat /tmp/dlq-messages.json | jq '.[] | {
  job_id: (.message.data | @base64d | fromjson | .job_id),
  delivery_attempt: .message.attributes.delivery_attempt,
  original_publish_time: .message.publishTime
}'
```

### Categorise failures

Review the `error_code` and `error_message` fields in the message body to determine root cause before replaying:

| `error_code`             | Likely Cause                     | Action                                                                           |
| ------------------------ | -------------------------------- | -------------------------------------------------------------------------------- |
| `VERTEX_QUOTA_EXCEEDED`  | GPU quota exhausted              | See [Quota Exhaustion Runbook](/gpu-operations/quota-exhaustion)                 |
| `ARTIFACT_UPLOAD_FAILED` | GCS write error                  | Verify GCS permissions; see [Failure Recovery](/gpu-operations/failure-recovery) |
| `JOB_TIMEOUT`            | Job exceeded wall-clock limit    | Investigate job payload; contact requester                                       |
| `SCHEDULER_INTERNAL`     | Scheduler bug or transient error | Safe to replay; check scheduler logs                                             |
| `INVALID_PAYLOAD`        | Malformed message                | Do **not** replay; fix upstream producer                                         |

***

## 2. Message Replay Procedures

<Warning>
  Only replay messages with error codes that are safe to retry (e.g. `SCHEDULER_INTERNAL`, `VERTEX_QUOTA_EXCEEDED` after quota is restored). Never replay `INVALID_PAYLOAD` messages — they will fail again and consume quota.
</Warning>

### Replay a single message

```bash theme={null}
# Extract the raw data from the DLQ message
MESSAGE_DATA=$(cat /tmp/dlq-messages.json | jq -r '.[0].message.data')
JOB_ID=$(echo $MESSAGE_DATA | base64 -d | jq -r '.job_id')

echo "Replaying job: $JOB_ID"

# Re-publish to the main job-request topic
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message="$(echo $MESSAGE_DATA | base64 -d)" \
  --attribute="replayed_from_dlq=true,original_job_id=$JOB_ID"
```

### Bulk replay (filtered)

Use the DLQ replay script to replay all messages matching a given `error_code`:

```bash theme={null}
# Replay all SCHEDULER_INTERNAL failures
cat /tmp/dlq-messages.json \
  | jq -r '.[] | select((.message.data | @base64d | fromjson | .error_code) == "SCHEDULER_INTERNAL") | .message.data' \
  | while read -r data; do
      gcloud pubsub topics publish gpu-job-request \
        --project=impulse-gpu-runtime \
        --message="$(echo "$data" | base64 -d)" \
        --attribute="replayed_from_dlq=true"
    done
```

### Acknowledge replayed messages from the DLQ

After confirming the replayed jobs succeeded, acknowledge the DLQ messages to remove them:

```bash theme={null}
ACK_IDS=$(cat /tmp/dlq-messages.json | jq -r '[.[].ackId] | join(",")')
gcloud pubsub subscriptions ack gpu-job-request-dlq-sub \
  --project=impulse-gpu-runtime \
  --ack-ids="$ACK_IDS"
```

***

## 3. Idempotent Replay Validation

The GPU Scheduler implements idempotency checks keyed on `job_id`. Before replaying, verify the job is not already active or completed:

```bash theme={null}
# Query the job status API
curl -s "https://api.impulselabs.ai/gpu/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq .

# Expected states that are safe to replay
# "FAILED", "EXPIRED", "UNKNOWN"

# States that must NOT be replayed
# "RUNNING", "SUCCEEDED", "QUEUED"
```

The scheduler will reject duplicate `job_id` submissions with a `409 Conflict` response and drop the message from the queue automatically. This prevents double-billing.

### Verify replay idempotency in scheduler logs

```bash theme={null}
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=200 | \
  grep -E "duplicate_job|idempotency_check|$JOB_ID"
```

Look for log entries such as:

```json theme={null}
{"level":"info","msg":"idempotency check passed","job_id":"...","prior_status":"FAILED"}
{"level":"warn","msg":"duplicate job rejected","job_id":"...","prior_status":"RUNNING"}
```

***

## 4. Stuck Job Recovery

A job is considered "stuck" if it remains in `RUNNING` state for longer than its configured `max_runtime_seconds` (default: 3600 s).

### Identify stuck jobs

```bash theme={null}
# List Vertex AI custom jobs running for > 1 hour
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '1 hour ago' -Iseconds)" \
  --format="table(name,displayName,createTime,state)"
```

### Cancel a stuck Vertex AI job

```bash theme={null}
VERTEX_JOB_NAME="projects/impulse-gpu-runtime/locations/us-central1/customJobs/JOB_NUMBER"

gcloud ai custom-jobs cancel $VERTEX_JOB_NAME \
  --project=impulse-gpu-runtime \
  --region=us-central1
```

### Force-complete the scheduler record

After cancelling the Vertex job, mark the corresponding scheduler record as `FAILED` to release the lock:

```bash theme={null}
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
  -H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status": "FAILED", "error_code": "MANUALLY_CANCELLED", "error_message": "Stuck job cancelled by on-call operator"}'
```

### Re-queue the job (optional)

If the job should be retried after recovery, publish a fresh message:

```bash theme={null}
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message='{"job_id":"'$JOB_ID'","session_id":"'$SESSION_ID'","type":"gpu_sandbox"}' \
  --attribute="retried_by_operator=true"
```

***

## Post-Recovery Checklist

* [ ] DLQ message count returned to 0 (or known-good baseline)
* [ ] Replayed jobs completed successfully in Vertex AI
* [ ] No duplicate billings recorded for replayed `job_id`s
* [ ] Scheduler logs show no new DLQ escalations within 30 minutes
* [ ] Root cause documented in incident postmortem (if > 10 messages in DLQ)
