> ## Documentation Index
> Fetch the complete documentation index at: https://docs.impulselabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Runtime Deployment Runbook

> Step-by-step procedures for deploying and rolling back the GPU Sandbox execution runtime

This runbook covers the full deployment lifecycle for the GPU Runtime stack, including Terraform infrastructure, scheduler services, supporting queues, and validation checkpoints.

<Warning>
  Always perform deployments from the `development` branch in a maintenance window. Coordinate with on-call before proceeding to production.
</Warning>

## Prerequisites

* Terraform ≥ 1.6 installed and authenticated (`gcloud auth application-default login`)
* Access to the `impulse-gpu-runtime` GCP project with `roles/editor` or higher
* `kubectl` configured for the target cluster (`gke_impulse-gpu-runtime_<region>_gpu-scheduler`)
* Pub/Sub and Vertex AI APIs enabled in the target project

***

## 1. Terraform Rollout Steps

<Steps>
  <Step title="Pull latest Terraform state">
    ```bash theme={null}
    cd infra/terraform/gpu-runtime
    terraform init -reconfigure
    terraform workspace select production   # or staging
    ```
  </Step>

  <Step title="Plan and review changes">
    ```bash theme={null}
    terraform plan -out=gpu-runtime.tfplan
    ```

    Review the plan output carefully. Expected resources include:

    * `google_pubsub_topic` — job-request and job-result topics
    * `google_pubsub_subscription` — scheduler subscriptions and DLQ subscriptions
    * `google_vertex_ai_*` — Vertex AI custom job configurations
    * `google_service_account` — runtime and scheduler service accounts
  </Step>

  <Step title="Apply the plan">
    ```bash theme={null}
    terraform apply gpu-runtime.tfplan
    ```

    Monitor the output for errors. A successful apply ends with:

    ```
    Apply complete! Resources: N added, M changed, 0 destroyed.
    ```
  </Step>

  <Step title="Commit the updated state">
    Terraform remote state is stored in GCS (`gs://impulse-tfstate/gpu-runtime`). Verify the state file was updated:

    ```bash theme={null}
    gsutil stat gs://impulse-tfstate/gpu-runtime/terraform.tfstate
    ```
  </Step>
</Steps>

***

## 2. Scheduler Deployment Steps

<Steps>
  <Step title="Build and push the scheduler image">
    ```bash theme={null}
    cd services/gpu-scheduler
    docker build -t gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA .
    docker push gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA
    ```
  </Step>

  <Step title="Update the Kubernetes deployment">
    ```bash theme={null}
    kubectl set image deployment/gpu-scheduler \
      gpu-scheduler=gcr.io/impulse-gpu-runtime/gpu-scheduler:$GIT_SHA \
      -n gpu-runtime
    kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
    ```
  </Step>

  <Step title="Verify pod health">
    ```bash theme={null}
    kubectl get pods -n gpu-runtime -l app=gpu-scheduler
    kubectl logs -n gpu-runtime -l app=gpu-scheduler --tail=50
    ```

    All pods should reach `Running` state within 2 minutes. Probe endpoints:

    ```bash theme={null}
    kubectl exec -n gpu-runtime deploy/gpu-scheduler -- \
      curl -s http://localhost:8080/healthz
    ```
  </Step>
</Steps>

***

## 3. Service Account Validation

After Terraform apply, confirm service accounts have the required roles:

```bash theme={null}
# Scheduler SA — must have Pub/Sub subscriber + Vertex AI user
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-scheduler@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

# Runtime SA — must have GCS object admin + Artifact Registry reader
gcloud projects get-iam-policy impulse-gpu-runtime \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:gpu-runtime@impulse-gpu-runtime.iam.gserviceaccount.com" \
  --format="table(bindings.role)"
```

Expected roles for `gpu-scheduler` SA:

| Role                      | Purpose                      |
| ------------------------- | ---------------------------- |
| `roles/pubsub.subscriber` | Consume job-request messages |
| `roles/pubsub.publisher`  | Publish job-result messages  |
| `roles/aiplatform.user`   | Submit Vertex AI custom jobs |
| `roles/logging.logWriter` | Emit structured logs         |

Expected roles for `gpu-runtime` SA:

| Role                            | Purpose                         |
| ------------------------------- | ------------------------------- |
| `roles/storage.objectAdmin`     | Write training artifacts to GCS |
| `roles/artifactregistry.reader` | Pull runtime container images   |

***

## 4. Queue and Topic Validation

```bash theme={null}
# List all GPU-runtime Pub/Sub topics
gcloud pubsub topics list --project=impulse-gpu-runtime \
  --filter="name:gpu-"

# Confirm DLQ subscriptions exist
gcloud pubsub subscriptions list --project=impulse-gpu-runtime \
  --filter="name:gpu-*-dlq"

# Verify subscription ack deadline and retention
gcloud pubsub subscriptions describe gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --format="yaml(ackDeadlineSeconds,messageRetentionDuration,deadLetterPolicy)"
```

Expected subscription configuration:

| Parameter                              | Expected Value |
| -------------------------------------- | -------------- |
| `ackDeadlineSeconds`                   | 600            |
| `messageRetentionDuration`             | 7 days         |
| `deadLetterPolicy.maxDeliveryAttempts` | 5              |

***

## 5. Pub/Sub Validation

Send a synthetic test message and verify end-to-end flow:

```bash theme={null}
# Publish a no-op probe message
gcloud pubsub topics publish gpu-job-request \
  --project=impulse-gpu-runtime \
  --message='{"job_id":"deploy-probe","type":"noop","session_id":"probe-001"}'

# Watch scheduler logs for probe pickup (within 30 s)
kubectl logs -n gpu-runtime -l app=gpu-scheduler --follow | grep "deploy-probe"
```

Verify the corresponding result message appears on the result topic:

```bash theme={null}
gcloud pubsub subscriptions pull gpu-job-result-sub \
  --project=impulse-gpu-runtime \
  --auto-ack --limit=10 | grep "deploy-probe"
```

***

## 6. Vertex AI Validation

```bash theme={null}
# List recent custom jobs to confirm connectivity
gcloud ai custom-jobs list \
  --project=impulse-gpu-runtime \
  --region=us-central1 \
  --filter="displayName:gpu-runtime-*" \
  --limit=5

# Verify quota availability
gcloud compute regions describe us-central1 \
  --project=impulse-gpu-runtime \
  --format="yaml(quotas)" | grep -A2 "NVIDIA_T4_GPUS\|NVIDIA_A100"
```

***

## 7. Rollback Procedures

<Warning>
  Rollback should be initiated immediately if health checks fail after deployment. Do not wait more than 5 minutes after a failed deployment before initiating rollback.
</Warning>

### Kubernetes rollback

```bash theme={null}
kubectl rollout undo deployment/gpu-scheduler -n gpu-runtime
kubectl rollout status deployment/gpu-scheduler -n gpu-runtime
```

### Terraform rollback

If infrastructure changes need to be reverted:

```bash theme={null}
# Identify the previous state version
gsutil ls -l gs://impulse-tfstate/gpu-runtime/ | sort -k2 | tail -5

# Restore previous state (replace TIMESTAMP with previous version)
gsutil cp gs://impulse-tfstate/gpu-runtime/terraform.tfstate.TIMESTAMP \
  gs://impulse-tfstate/gpu-runtime/terraform.tfstate

# Re-apply from the restored state
terraform apply -auto-approve
```

### Emergency: disable job intake

If a deployment causes runaway job submission, pause the subscription:

```bash theme={null}
gcloud pubsub subscriptions modify-push-config gpu-job-request-sub \
  --project=impulse-gpu-runtime \
  --push-endpoint=""   # clears push config; reverts to pull mode
```

***

## Post-Deployment Checklist

* [ ] All `gpu-scheduler` pods are `Running`
* [ ] Health endpoint returns `{"status":"ok"}`
* [ ] Probe message processed successfully end-to-end
* [ ] No error-rate increase in Cloud Monitoring dashboards
* [ ] Service account roles match the expected table above
* [ ] DLQ subscriptions have zero undelivered messages
* [ ] On-call notified of deployment completion
