Documentation Index
Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
This runbook covers the full deployment lifecycle for the GPU Runtime stack, including Terraform infrastructure, scheduler services, supporting queues, and validation checkpoints.
Always perform deployments from the development branch in a maintenance window. Coordinate with on-call before proceeding to production.
Prerequisites
- Terraform ≥ 1.6 installed and authenticated (
gcloud auth application-default login)
- Access to the
impulse-gpu-runtime GCP project with roles/editor or higher
kubectl configured for the target cluster (gke_impulse-gpu-runtime_<region>_gpu-scheduler)
- Pub/Sub and Vertex AI APIs enabled in the target project
Pull latest Terraform state
cd infra/terraform/gpu-runtime
terraform init -reconfigure
terraform workspace select production # or staging
Plan and review changes
terraform plan -out=gpu-runtime.tfplan
Review the plan output carefully. Expected resources include:
google_pubsub_topic — job-request and job-result topics
google_pubsub_subscription — scheduler subscriptions and DLQ subscriptions
google_vertex_ai_* — Vertex AI custom job configurations
google_service_account — runtime and scheduler service accounts
Apply the plan
terraform apply gpu-runtime.tfplan
Monitor the output for errors. A successful apply ends with:Apply complete! Resources: N added, M changed, 0 destroyed.
Commit the updated state
Terraform remote state is stored in GCS (gs://impulse-tfstate/gpu-runtime). Verify the state file was updated:gsutil stat gs://impulse-tfstate/gpu-runtime/terraform.tfstate
2. Scheduler Deployment Steps
Build and push the scheduler image
cd services/gpu-scheduler
docker build -t gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA .
docker push gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA
Update the Kubernetes deployment
kubectl set image deployment/gpu-scheduler \
gpu-scheduler=gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA \
-n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
Verify pod health
kubectl get pods -n gpu-runtime -l app=gpu-scheduler
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=50
All pods should reach Running state within 2 minutes. Probe endpoints:kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
curl -s http://localhost:8080/healthz
3. Service Account Validation
After Terraform apply, confirm service accounts have the required roles:
# Scheduler SA — must have Pub/Sub subscriber + Vertex AI user
gcloud projects get-iam-policy impulse-gpu-runtime \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com" \
--format="table(bindings.role)"
# Runtime SA — must have GCS object admin + Artifact Registry reader
gcloud projects get-iam-policy impulse-gpu-runtime \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com" \
--format="table(bindings.role)"
Expected roles for gpu-scheduler SA:
| Role | Purpose |
|---|
roles/pubsub.subscriber | Consume job-request messages |
roles/pubsub.publisher | Publish job-result messages |
roles/aiplatform.user | Submit Vertex AI custom jobs |
roles/logging.logWriter | Emit structured logs |
Expected roles for gpu-runtime SA:
| Role | Purpose |
|---|
roles/storage.objectAdmin | Write training artifacts to GCS |
roles/artifactregistry.reader | Pull runtime container images |
4. Queue and Topic Validation
# List all GPU-runtime Pub/Sub topics
gcloud pubsub topics list --project=impulse-gpu-runtime \
--filter="name:gpu-"
# Confirm DLQ subscriptions exist
gcloud pubsub subscriptions list --project=impulse-gpu-runtime \
--filter="name:gpu-*-dlq"
# Verify subscription ack deadline and retention
gcloud pubsub subscriptions describe gpu-job-request-sub \
--project=impulse-gpu-runtime \
--format="yaml(ackDeadlineSeconds,messageRetentionDuration,deadLetterPolicy)"
Expected subscription configuration:
| Parameter | Expected Value |
|---|
ackDeadlineSeconds | 600 |
messageRetentionDuration | 7 days |
deadLetterPolicy.maxDeliveryAttempts | 5 |
5. Pub/Sub Validation
Send a synthetic test message and verify end-to-end flow:
# Publish a no-op probe message
gcloud pubsub topics publish gpu-job-request \
--project=impulse-gpu-runtime \
--message='{"job_id":"deploy-probe","type":"noop","session_id":"probe-001"}'
# Watch scheduler logs for probe pickup (within 30 s)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --follow | grep "deploy-probe"
Verify the corresponding result message appears on the result topic:
gcloud pubsub subscriptions pull gpu-job-result-sub \
--project=impulse-gpu-runtime \
--auto-ack --limit=10 | grep "deploy-probe"
6. Vertex AI Validation
# List recent custom jobs to confirm connectivity
gcloud ai custom-jobs list \
--project=impulse-gpu-runtime \
--region=us-central1 \
--filter="displayName:gpu-runtime-*" \
--limit=5
# Verify quota availability
gcloud compute regions describe us-central1 \
--project=impulse-gpu-runtime \
--format="yaml(quotas)" | grep -A2 "NVIDIA_T4_GPUS\|NVIDIA_A100"
7. Rollback Procedures
Rollback should be initiated immediately if health checks fail after deployment. Do not wait more than 5 minutes after a failed deployment before initiating rollback.
Kubernetes rollback
kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
If infrastructure changes need to be reverted:
# Identify the previous state version
gsutil ls -l gs://impulse-tfstate/gpu-runtime/ | sort -k2 | tail -5
# Restore previous state (replace TIMESTAMP with previous version)
gsutil cp gs://impulse-tfstate/gpu-runtime/terraform.tfstate.TIMESTAMP \
gs://impulse-tfstate/gpu-runtime/terraform.tfstate
# Re-apply from the restored state
terraform apply -auto-approve
Emergency: disable job intake
If a deployment causes runaway job submission, pause the subscription:
gcloud pubsub subscriptions modify-push-config gpu-job-request-sub \
--project=impulse-gpu-runtime \
--push-endpoint="" # clears push config; reverts to pull mode
Post-Deployment Checklist