Documentation Index
Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
When a GPU job message exceeds its maximum delivery attempts it is forwarded to the dead-letter queue (DLQ). This runbook explains how to inspect DLQ contents, safely replay messages, validate idempotency, and recover stuck jobs.
The DLQ subscription is gpu-job-request-dlq-sub on the topic gpu-job-request-dlq in the impulse-gpu-runtime GCP project.
1. Dead-Letter Queue Inspection
View undelivered message count
gcloud pubsub subscriptions describe gpu-job-request-dlq-sub \
--project=impulse-gpu-runtime \
--format="yaml(name,numUndeliveredMessages)"
Pull and inspect DLQ messages (non-destructive)
# Pull up to 50 messages without acknowledging them
gcloud pubsub subscriptions pull gpu-job-request-dlq-sub \
--project=impulse-gpu-runtime \
--limit=50 \
--format=json > /tmp/dlq-messages.json
cat /tmp/dlq-messages.json | jq '.[] | {ackId, messageId: .message.messageId, publishTime: .message.publishTime, data: (.message.data | @base64d | fromjson)}'
Each DLQ message carries a delivery_attempt attribute. Messages forwarded to the DLQ will have delivery_attempt equal to maxDeliveryAttempts (default: 5).
cat /tmp/dlq-messages.json | jq '.[] | {
job_id: (.message.data | @base64d | fromjson | .job_id),
delivery_attempt: .message.attributes.delivery_attempt,
original_publish_time: .message.publishTime
}'
Categorise failures
Review the error_code and error_message fields in the message body to determine root cause before replaying:
error_code | Likely Cause | Action |
|---|
VERTEX_QUOTA_EXCEEDED | GPU quota exhausted | See Quota Exhaustion Runbook |
ARTIFACT_UPLOAD_FAILED | GCS write error | Verify GCS permissions; see Failure Recovery |
JOB_TIMEOUT | Job exceeded wall-clock limit | Investigate job payload; contact requester |
SCHEDULER_INTERNAL | Scheduler bug or transient error | Safe to replay; check scheduler logs |
INVALID_PAYLOAD | Malformed message | Do not replay; fix upstream producer |
2. Message Replay Procedures
Only replay messages with error codes that are safe to retry (e.g. SCHEDULER_INTERNAL, VERTEX_QUOTA_EXCEEDED after quota is restored). Never replay INVALID_PAYLOAD messages — they will fail again and consume quota.
Replay a single message
# Extract the raw data from the DLQ message
MESSAGE_DATA=$(cat /tmp/dlq-messages.json | jq -r '.[0].message.data')
JOB_ID=$(echo $MESSAGE_DATA | base64 -d | jq -r '.job_id')
echo "Replaying job: $JOB_ID"
# Re-publish to the main job-request topic
gcloud pubsub topics publish gpu-job-request \
--project=impulse-gpu-runtime \
--message="$(echo $MESSAGE_DATA | base64 -d)" \
--attribute="replayed_from_dlq=true,original_job_id=$JOB_ID"
Bulk replay (filtered)
Use the DLQ replay script to replay all messages matching a given error_code:
# Replay all SCHEDULER_INTERNAL failures
cat /tmp/dlq-messages.json \
| jq -r '.[] | select((.message.data | @base64d | fromjson | .error_code) == "SCHEDULER_INTERNAL") | .message.data' \
| while read -r data; do
gcloud pubsub topics publish gpu-job-request \
--project=impulse-gpu-runtime \
--message="$(echo "$data" | base64 -d)" \
--attribute="replayed_from_dlq=true"
done
Acknowledge replayed messages from the DLQ
After confirming the replayed jobs succeeded, acknowledge the DLQ messages to remove them:
ACK_IDS=$(cat /tmp/dlq-messages.json | jq -r '[.[].ackId] | join(",")')
gcloud pubsub subscriptions ack gpu-job-request-dlq-sub \
--project=impulse-gpu-runtime \
--ack-ids="$ACK_IDS"
3. Idempotent Replay Validation
The GPU Scheduler implements idempotency checks keyed on job_id. Before replaying, verify the job is not already active or completed:
# Query the job status API
curl -s "https://api.impulselabs.ai/gpu/jobs/$JOB_ID/status" \
-H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" | jq .
# Expected states that are safe to replay
# "FAILED", "EXPIRED", "UNKNOWN"
# States that must NOT be replayed
# "RUNNING", "SUCCEEDED", "QUEUED"
The scheduler will reject duplicate job_id submissions with a 409 Conflict response and drop the message from the queue automatically. This prevents double-billing.
Verify replay idempotency in scheduler logs
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=200 | \
grep -E "duplicate_job|idempotency_check|$JOB_ID"
Look for log entries such as:
{"level":"info","msg":"idempotency check passed","job_id":"...","prior_status":"FAILED"}
{"level":"warn","msg":"duplicate job rejected","job_id":"...","prior_status":"RUNNING"}
4. Stuck Job Recovery
A job is considered “stuck” if it remains in RUNNING state for longer than its configured max_runtime_seconds (default: 3600 s).
Identify stuck jobs
# List Vertex AI custom jobs running for > 1 hour
gcloud ai custom-jobs list \
--project=impulse-gpu-runtime \
--region=us-central1 \
--filter="state=JOB_STATE_RUNNING AND createTime<$(date -d '1 hour ago' -Iseconds)" \
--format="table(name,displayName,createTime,state)"
Cancel a stuck Vertex AI job
VERTEX_JOB_NAME="projects/impulse-gpu-runtime/locations/us-central1/customJobs/JOB_NUMBER"
gcloud ai custom-jobs cancel $VERTEX_JOB_NAME \
--project=impulse-gpu-runtime \
--region=us-central1
Force-complete the scheduler record
After cancelling the Vertex job, mark the corresponding scheduler record as FAILED to release the lock:
curl -s -X PATCH "https://api.impulselabs.ai/internal/gpu/jobs/$JOB_ID" \
-H "Authorization: Bearer $IMPULSE_SERVICE_TOKEN" \
-H "Content-Type: application/json" \
-d '{"status": "FAILED", "error_code": "MANUALLY_CANCELLED", "error_message": "Stuck job cancelled by on-call operator"}'
Re-queue the job (optional)
If the job should be retried after recovery, publish a fresh message:
gcloud pubsub topics publish gpu-job-request \
--project=impulse-gpu-runtime \
--message='{"job_id":"'$JOB_ID'","session_id":"'$SESSION_ID'","type":"gpu_sandbox"}' \
--attribute="retried_by_operator=true"
Post-Recovery Checklist