GPU Runtime Deployment Runbook

This runbook covers the full deployment lifecycle for the GPU Runtime stack, including Terraform infrastructure, scheduler services, supporting queues, and validation checkpoints.

Always perform deployments from the development branch in a maintenance window. Coordinate with on-call before proceeding to production.

Prerequisites

Terraform ≥ 1.6 installed and authenticated (gcloud auth application-default login)
Access to the impulse-gpu-runtime GCP project with roles/editor or higher
kubectl configured for the target cluster (gke_impulse-gpu-runtime_<region>_gpu-scheduler)
Pub/Sub and Vertex AI APIs enabled in the target project

1. Terraform Rollout Steps

Pull latest Terraform state

cd infra/terraform/gpu-runtime
terraform init -reconfigure
terraform workspace select production   # or staging

Plan and review changes

terraform plan -out=gpu-runtime.tfplan

Review the plan output carefully. Expected resources include:

google_pubsub_topic — job-request and job-result topics
google_pubsub_subscription — scheduler subscriptions and DLQ subscriptions
google_vertex_ai_* — Vertex AI custom job configurations
google_service_account — runtime and scheduler service accounts

Apply the plan

terraform apply gpu-runtime.tfplan

Monitor the output for errors. A successful apply ends with:

Apply complete! Resources: N added, M changed, 0 destroyed.

Commit the updated state

Terraform remote state is stored in GCS (gs://impulse-tfstate/gpu-runtime). Verify the state file was updated:

gsutil stat gs://impulse-tfstate/gpu-runtime/terraform.tfstate

2. Scheduler Deployment Steps

Build and push the scheduler image

cd services/gpu-scheduler
docker build -t gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA .
docker push gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA

Update the Kubernetes deployment

kubectl set image deployment/gpu-scheduler \
  gpu-scheduler=gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA \
  -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Verify pod health

kubectl get pods -n gpu-runtime -l app=gpu-scheduler
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=50

All pods should reach Running state within 2 minutes. Probe endpoints:

kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
  curl -s http://localhost:8080/healthz

3. Service Account Validation

After Terraform apply, confirm service accounts have the required roles:

# Scheduler SA — must have Pub/Sub subscriber + Vertex AI user
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

# Runtime SA — must have GCS object admin + Artifact Registry reader
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

Expected roles for gpu-scheduler SA:

Role	Purpose
`roles/pubsub.subscriber`	Consume job-request messages
`roles/pubsub.publisher`	Publish job-result messages
`roles/aiplatform.user`	Submit Vertex AI custom jobs
`roles/logging.logWriter`	Emit structured logs

Expected roles for gpu-runtime SA:

Role	Purpose
`roles/storage.objectAdmin`	Write training artifacts to GCS
`roles/artifactregistry.reader`	Pull runtime container images

4. Queue and Topic Validation

# List all GPU-runtime Pub/Sub topics
gcloud pubsub topics list --project=impulse-gpu-runtime \
  --filter="name:gpu-"

# Confirm DLQ subscriptions exist
gcloud pubsub subscriptions list --project=impulse-gpu-runtime \
  --filter="name:gpu-*-dlq"

# Verify subscription ack deadline and retention
gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(ackDeadlineSeconds,messageRetentionDuration,deadLetterPolicy)"

Expected subscription configuration:

Parameter	Expected Value
`ackDeadlineSeconds`	600
`messageRetentionDuration`	7 days
`deadLetterPolicy.maxDeliveryAttempts`	5

5. Pub/Sub Validation

Send a synthetic test message and verify end-to-end flow:

# Publish a no-op probe message
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message='{"job_id":"deploy-probe","type":"noop","session_id":"probe-001"}'

# Watch scheduler logs for probe pickup (within 30 s)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --follow | grep "deploy-probe"

Verify the corresponding result message appears on the result topic:

gcloud pubsub subscriptions pull gpu-job-result-sub \
  --project=impulse-gpu-runtime \
  --auto-ack --limit=10 | grep "deploy-probe"

6. Vertex AI Validation

# List recent custom jobs to confirm connectivity
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="displayName:gpu-runtime-*" \
  --limit=5

# Verify quota availability
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | grep -A2 "NVIDIA_T4_GPUS\|NVIDIA_A100"

7. Rollback Procedures

Rollback should be initiated immediately if health checks fail after deployment. Do not wait more than 5 minutes after a failed deployment before initiating rollback.

Kubernetes rollback

kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Terraform rollback

If infrastructure changes need to be reverted:

# Identify the previous state version
gsutil ls -l gs://impulse-tfstate/gpu-runtime/ | sort -k2 | tail -5

# Restore previous state (replace TIMESTAMP with previous version)
gsutil cp gs://impulse-tfstate/gpu-runtime/terraform.tfstate.TIMESTAMP \
  gs://impulse-tfstate/gpu-runtime/terraform.tfstate

# Re-apply from the restored state
terraform apply -auto-approve

Emergency: disable job intake

If a deployment causes runaway job submission, pause the subscription:

gcloud pubsub subscriptions modify-push-config gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --push-endpoint=""   # clears push config; reverts to pull mode

Post-Deployment Checklist

All gpu-scheduler pods are Running
Health endpoint returns {"status":"ok"}
Probe message processed successfully end-to-end
No error-rate increase in Cloud Monitoring dashboards
Service account roles match the expected table above
DLQ subscriptions have zero undelivered messages
On-call notified of deployment completion

Runbooks

Documentation Index

​Prerequisites

​1. Terraform Rollout Steps

​2. Scheduler Deployment Steps

​3. Service Account Validation

​4. Queue and Topic Validation

​5. Pub/Sub Validation

​6. Vertex AI Validation

​7. Rollback Procedures

​Kubernetes rollback

​Terraform rollback

​Emergency: disable job intake

​Post-Deployment Checklist

Prerequisites

1. Terraform Rollout Steps

2. Scheduler Deployment Steps

3. Service Account Validation

4. Queue and Topic Validation

5. Pub/Sub Validation

6. Vertex AI Validation

7. Rollback Procedures

Kubernetes rollback

Terraform rollback

Emergency: disable job intake

Post-Deployment Checklist