This runbook covers the full deployment lifecycle for the GPU Runtime stack, including Terraform infrastructure, scheduler services, supporting queues, and validation checkpoints.
Always perform deployments from the development branch in a maintenance window. Coordinate with on-call before proceeding to production.
Prerequisites
- Terraform ≥ 1.6 installed and authenticated (
gcloud auth application-default login)
- Access to the
impulse-gpu-runtime GCP project with roles/editor or higher
kubectl configured for the target cluster (gke_impulse-gpu-runtime_<region>_gpu-scheduler)
- Pub/Sub and Vertex AI APIs enabled in the target project
Pull latest Terraform state
cd infra/terraform/gpu-runtime
terraform init -reconfigure
terraform workspace select production # or staging
Plan and review changes
terraform plan -out=gpu-runtime.tfplan
Review the plan output carefully. Expected resources include:
google_pubsub_topic — job-request and job-result topics
google_pubsub_subscription — scheduler subscriptions and DLQ subscriptions
google_vertex_ai_* — Vertex AI custom job configurations
google_service_account — runtime and scheduler service accounts
Apply the plan
terraform apply gpu-runtime.tfplan
Monitor the output for errors. A successful apply ends with:Apply complete! Resources: N added, M changed, 0 destroyed.
Commit the updated state
Terraform remote state is stored in GCS (gs://impulse-tfstate/gpu-runtime). Verify the state file was updated:gsutil stat gs://impulse-tfstate/gpu-runtime/terraform.tfstate
2. Scheduler Deployment Steps
Build and push the scheduler image
cd services/gpu-scheduler
docker build -t gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA .
docker push gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA
Update the Kubernetes deployment
kubectl set image deployment/gpu-scheduler \
gpu-scheduler=gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA \
-n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
Verify pod health
kubectl get pods -n gpu-runtime -l app=gpu-scheduler
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=50
All pods should reach Running state within 2 minutes. Probe endpoints:kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
curl -s http://localhost:8080/healthz
3. Service Account Validation
After Terraform apply, confirm service accounts have the required roles:
# Scheduler SA — must have Pub/Sub subscriber + Vertex AI user
gcloud projects get-iam-policy impulse-gpu-runtime \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com" \
--format="table(bindings.role)"
# Runtime SA — must have GCS object admin + Artifact Registry reader
gcloud projects get-iam-policy impulse-gpu-runtime \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com" \
--format="table(bindings.role)"
Expected roles for gpu-scheduler SA:
| Role | Purpose |
|---|
roles/pubsub.subscriber | Consume job-request messages |
roles/pubsub.publisher | Publish job-result messages |
roles/aiplatform.user | Submit Vertex AI custom jobs |
roles/logging.logWriter | Emit structured logs |
Expected roles for gpu-runtime SA:
| Role | Purpose |
|---|
roles/storage.objectAdmin | Write training artifacts to GCS |
roles/artifactregistry.reader | Pull runtime container images |
4. Queue and Topic Validation
# List all GPU-runtime Pub/Sub topics
gcloud pubsub topics list --project=impulse-gpu-runtime \
--filter="name:gpu-"
# Confirm DLQ subscriptions exist
gcloud pubsub subscriptions list --project=impulse-gpu-runtime \
--filter="name:gpu-*-dlq"
# Verify subscription ack deadline and retention
gcloud pubsub subscriptions describe gpu-job-request-sub \
--project=impulse-gpu-runtime \
--format="yaml(ackDeadlineSeconds,messageRetentionDuration,deadLetterPolicy)"
Expected subscription configuration:
| Parameter | Expected Value |
|---|
ackDeadlineSeconds | 600 |
messageRetentionDuration | 7 days |
deadLetterPolicy.maxDeliveryAttempts | 5 |
5. Pub/Sub Validation
Send a synthetic test message and verify end-to-end flow:
# Publish a no-op probe message
gcloud pubsub topics publish gpu-job-request \
--project=impulse-gpu-runtime \
--message='{"job_id":"deploy-probe","type":"noop","session_id":"probe-001"}'
# Watch scheduler logs for probe pickup (within 30 s)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --follow | grep "deploy-probe"
Verify the corresponding result message appears on the result topic:
gcloud pubsub subscriptions pull gpu-job-result-sub \
--project=impulse-gpu-runtime \
--auto-ack --limit=10 | grep "deploy-probe"
6. Vertex AI Validation
# List recent custom jobs to confirm connectivity
gcloud ai custom-jobs list \
--project=impulse-gpu-runtime \
--region=us-central1 \
--filter="displayName:gpu-runtime-*" \
--limit=5
# Verify quota availability
gcloud compute regions describe us-central1 \
--project=impulse-gpu-runtime \
--format="yaml(quotas)" | grep -A2 "NVIDIA_T4_GPUS\|NVIDIA_A100"
7. Rollback Procedures
Rollback should be initiated immediately if health checks fail after deployment. Do not wait more than 5 minutes after a failed deployment before initiating rollback.
Kubernetes rollback
kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
If infrastructure changes need to be reverted:
# Identify the previous state version
gsutil ls -l gs://impulse-tfstate/gpu-runtime/ | sort -k2 | tail -5
# Restore previous state (replace TIMESTAMP with previous version)
gsutil cp gs://impulse-tfstate/gpu-runtime/terraform.tfstate.TIMESTAMP \
gs://impulse-tfstate/gpu-runtime/terraform.tfstate
# Re-apply from the restored state
terraform apply -auto-approve
Emergency: disable job intake
If a deployment causes runaway job submission, pause the subscription:
gcloud pubsub subscriptions modify-push-config gpu-job-request-sub \
--project=impulse-gpu-runtime \
--push-endpoint="" # clears push config; reverts to pull mode
Post-Deployment Checklist