Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

This runbook covers the full deployment lifecycle for the GPU Runtime stack, including Terraform infrastructure, scheduler services, supporting queues, and validation checkpoints.
Always perform deployments from the development branch in a maintenance window. Coordinate with on-call before proceeding to production.

Prerequisites

  • Terraform ≥ 1.6 installed and authenticated (gcloud auth application-default login)
  • Access to the impulse-gpu-runtime GCP project with roles/editor or higher
  • kubectl configured for the target cluster (gke_impulse-gpu-runtime_<region>_gpu-scheduler)
  • Pub/Sub and Vertex AI APIs enabled in the target project

1. Terraform Rollout Steps

1

Pull latest Terraform state

cd infra/terraform/gpu-runtime
terraform init -reconfigure
terraform workspace select production   # or staging
2

Plan and review changes

terraform plan -out=gpu-runtime.tfplan
Review the plan output carefully. Expected resources include:
  • google_pubsub_topic — job-request and job-result topics
  • google_pubsub_subscription — scheduler subscriptions and DLQ subscriptions
  • google_vertex_ai_* — Vertex AI custom job configurations
  • google_service_account — runtime and scheduler service accounts
3

Apply the plan

terraform apply gpu-runtime.tfplan
Monitor the output for errors. A successful apply ends with:
Apply complete! Resources: N added, M changed, 0 destroyed.
4

Commit the updated state

Terraform remote state is stored in GCS (gs://impulse-tfstate/gpu-runtime). Verify the state file was updated:
gsutil stat gs://impulse-tfstate/gpu-runtime/terraform.tfstate

2. Scheduler Deployment Steps

1

Build and push the scheduler image

cd services/gpu-scheduler
docker build -t gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA .
docker push gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA
2

Update the Kubernetes deployment

kubectl set image deployment/gpu-scheduler \
  gpu-scheduler=gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA \
  -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
3

Verify pod health

kubectl get pods -n gpu-runtime -l app=gpu-scheduler
kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=50
All pods should reach Running state within 2 minutes. Probe endpoints:
kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
  curl -s http://localhost:8080/healthz

3. Service Account Validation

After Terraform apply, confirm service accounts have the required roles:
# Scheduler SA — must have Pub/Sub subscriber + Vertex AI user
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

# Runtime SA — must have GCS object admin + Artifact Registry reader
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"
Expected roles for gpu-scheduler SA:
RolePurpose
roles/pubsub.subscriberConsume job-request messages
roles/pubsub.publisherPublish job-result messages
roles/aiplatform.userSubmit Vertex AI custom jobs
roles/logging.logWriterEmit structured logs
Expected roles for gpu-runtime SA:
RolePurpose
roles/storage.objectAdminWrite training artifacts to GCS
roles/artifactregistry.readerPull runtime container images

4. Queue and Topic Validation

# List all GPU-runtime Pub/Sub topics
gcloud pubsub topics list --project=impulse-gpu-runtime \
  --filter="name:gpu-"

# Confirm DLQ subscriptions exist
gcloud pubsub subscriptions list --project=impulse-gpu-runtime \
  --filter="name:gpu-*-dlq"

# Verify subscription ack deadline and retention
gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(ackDeadlineSeconds,messageRetentionDuration,deadLetterPolicy)"
Expected subscription configuration:
ParameterExpected Value
ackDeadlineSeconds600
messageRetentionDuration7 days
deadLetterPolicy.maxDeliveryAttempts5

5. Pub/Sub Validation

Send a synthetic test message and verify end-to-end flow:
# Publish a no-op probe message
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message='{"job_id":"deploy-probe","type":"noop","session_id":"probe-001"}'

# Watch scheduler logs for probe pickup (within 30 s)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --follow | grep "deploy-probe"
Verify the corresponding result message appears on the result topic:
gcloud pubsub subscriptions pull gpu-job-result-sub \
  --project=impulse-gpu-runtime \
  --auto-ack --limit=10 | grep "deploy-probe"

6. Vertex AI Validation

# List recent custom jobs to confirm connectivity
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="displayName:gpu-runtime-*" \
  --limit=5

# Verify quota availability
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | grep -A2 "NVIDIA_T4_GPUS\|NVIDIA_A100"

7. Rollback Procedures

Rollback should be initiated immediately if health checks fail after deployment. Do not wait more than 5 minutes after a failed deployment before initiating rollback.

Kubernetes rollback

kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime

Terraform rollback

If infrastructure changes need to be reverted:
# Identify the previous state version
gsutil ls -l gs://impulse-tfstate/gpu-runtime/ | sort -k2 | tail -5

# Restore previous state (replace TIMESTAMP with previous version)
gsutil cp gs://impulse-tfstate/gpu-runtime/terraform.tfstate.TIMESTAMP \
  gs://impulse-tfstate/gpu-runtime/terraform.tfstate

# Re-apply from the restored state
terraform apply -auto-approve

Emergency: disable job intake

If a deployment causes runaway job submission, pause the subscription:
gcloud pubsub subscriptions modify-push-config gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --push-endpoint=""   # clears push config; reverts to pull mode

Post-Deployment Checklist

  • All gpu-scheduler pods are Running
  • Health endpoint returns {"status":"ok"}
  • Probe message processed successfully end-to-end
  • No error-rate increase in Cloud Monitoring dashboards
  • Service account roles match the expected table above
  • DLQ subscriptions have zero undelivered messages
  • On-call notified of deployment completion