Google Cloud Guide¶

This document provides complete setup and implementation details for the Google Cloud pipeline (Cloud Batch + Workflows) using Terraform, Docker images, and lightweight scripts.

Table of Contents¶

Quick Reference
Design and Architecture
1) Repository Structure
2) Prerequisites
IAM Permissions
Setting up GitHub Personal Access Token
3) Terraform
4) Workflow YAML
5) Compute Resources and Instance Types
Understanding cpuMilli, memoryMib, and Machine Types
Current Resource Allocation
Tuning Compute Resources
Recommended Configuration: Dedicated VMs (taskCountPerNode=1)
6) Docker Image
Multi-stage Build Architecture
Build & Push
Local Execution with Docker Compose
- Running Locally
- Docker Compose Configuration
7) Scripts
Stage A Wrapper: scripts/run_builder.sh
Stage A: scripts/main_builder.py
Stage B: scripts/main_runner.py
8) Monitoring and Resource Groups
Label Structure
Monitoring Dashboards
Custom Filtering
9) epycloud CLI
10) Operational Notes
11) Billing and Cost Tracking
12) Implementation Summary

Quick Reference¶

Here are the essential commands to setup the pipeline. Please read the full documentation before actual execution.

# 1. Initialize configuration
epycloud config init
epycloud config edit         # Configure Google Cloud settings
epycloud config edit-secrets # Add GitHub PAT
epycloud config show         # Verify configuration

# 2. Create GitHub PAT and store in Secret Manager
PROJECT_ID=$(epycloud config show | grep 'project_id:' | awk '{print $2}')
echo -n "your_github_pat_here" | gcloud secrets create github-pat \
  --data-file=- \
  --project=${PROJECT_ID}

# 3. Initialize and deploy infrastructure
epycloud terraform init
epycloud terraform plan    # Review changes first
epycloud terraform apply

# 4. Build and push Docker image
epycloud build cloud

# 5. Add experiment file to your forecast repository and git push

# 6. Run your first workflow
epycloud run workflow --exp-id initial-test

# 7. Monitor execution
epycloud status
epycloud workflow list --exp-id initial-test

# 8. View logs and details
epycloud workflow describe <execution-id>
epycloud logs --exp-id initial-test

Next steps: - Continue reading for detailed explanations of each component - See operations.md for daily operational commands - See variable-configuration.md for configuration reference

Design and architecture¶

The pipeline is consisted from following tech stacks/components:

Infrastructure-as-code (Terraform) for:
Artifact Registry (container repo)
GCS buckets/prefixes (builder-artifacts/runner-artifacts, optional logs)
Service Accounts & IAM (Workflows runner, Batch runtime)
Workflows (deployed from YAML)
Container image (Dockerfile + requirements) with both Stage A / Stage B entrypoints
Scripts:
run_builder.sh (Setups and runs stage A).
main_builder.py (Stage A: produce N pickled inputs)
main_runner.py (Stage B: consume one pickle per task using BATCH_TASK_INDEX)
Workflow YAML:
Orchestrates: Stage A → wait → list GCS → Stage B (taskCount=N) → wait
epycloud CLI: unified command-line interface for all operations

1) Repository structure¶

epymodelingsuite-cloud/
├─ terraform/
│  ├─ main.tf                    # Google Cloud resources
│  ├─ variables.tf
│  ├─ outputs.tf
│  └─ workflow.yaml              # Workflows orchestration definition
├─ docker/
│  ├─ Dockerfile
│  ├─ pyproject.toml             # Cloud-specific dependencies
│  ├─ uv.lock                    # Locked dependency versions
│  ├─ container-structure-test.yaml  # Container tests
│  └─ scripts/                   # Docker runtime scripts
│     ├─ main_builder.py         # Stage A: Generate N input files
│     ├─ main_runner.py          # Stage B: Process individual tasks
│     ├─ main_output.py          # Stage C: Aggregate results
│     ├─ run_builder.sh          # Stage A wrapper for repo cloning
│     └─ run_output.sh           # Stage C wrapper
├─ src/epycloud/                 # CLI package
├─ cloudbuild.yaml               # Cloud Build configuration
├─ .gitignore
└─ README.md

2) Prerequisites¶

gcloud CLI authenticated to target project: gcloud auth login, gcloud config set project <PROJECT_ID>
Terraform ≥ 1.5
Docker
For Mac, OrbStack is recommended over Docker Desktop for lightweight and faster experience.
Python 3.11 (for local dev)
Make
GitHub Fine-Grained Personal Access Token (PAT) - required for accessing private repositories (epymodelingsuite and forecasting)

IAM Permissions¶

To deploy and run this infrastructure, you need the following IAM permissions in addition to the Editor role (roles/editor):

Required roles: - Project IAM Admin (roles/resourcemanager.projectIamAdmin) - To manage project-level IAM bindings (Terraform) - Secret Manager Admin (roles/secretmanager.admin) - To manage IAM policies for secrets (Terraform) - Service Account Admin (roles/iam.serviceAccountAdmin) - To manage IAM policies on service accounts (Terraform) - Cloud Build Editor (roles/cloudbuild.builds.editor) - To submit and manage Cloud Build jobs (Docker builds)

Grant permissions to your user account:

# Project IAM Admin (required for Terraform to create project-level role bindings)
gcloud projects add-iam-policy-binding your-gcp-project-id \
  --member="user:user@example.com" \
  --role="roles/resourcemanager.projectIamAdmin"

# Secret Manager Admin (required for Terraform to set IAM policies on secrets)
gcloud projects add-iam-policy-binding your-gcp-project-id \
  --member="user:user@example.com" \
  --role="roles/secretmanager.admin"

# Service Account Admin (required for Terraform to set IAM policies on service accounts)
gcloud projects add-iam-policy-binding your-gcp-project-id \
  --member="user:user@example.com" \
  --role="roles/iam.serviceAccountAdmin"

# Cloud Build Editor (required to build and push Docker images)
gcloud projects add-iam-policy-binding your-gcp-project-id \
  --member="user:user@example.com" \
  --role="roles/cloudbuild.builds.editor"

Common permission errors: - Error 403: Policy update access denied → Need Project IAM Admin and Secret Manager Admin roles - Permission 'iam.serviceAccounts.setIamPolicy' denied → Need Service Account Admin role - The caller does not have permission (Cloud Build) → Need Cloud Build Editor role

Note: Ask your Google Cloud project administrator to grant these roles if you encounter permission errors during deployment.

Initialize and configure the project:

# Initialize configuration system
epycloud config init

# Edit configuration file (opens in $EDITOR)
epycloud config edit

# Configuration structure (config.yaml):
# google_cloud:
#   project_id: your-gcp-project-id
#   region: us-central1
#   bucket_name: your-bucket-name  # Must exist
# docker:
#   repo_name: epymodelingsuite-repo
# github:
#   forecast_repo: owner/forecasting-repo
#   modeling_suite_repo: owner/epymodelingsuite
#   modeling_suite_ref: main

# Edit secrets (opens in $EDITOR with 0600 permissions)
epycloud config edit-secrets
# Add: github.personal_access_token: your_github_pat_here

# Verify configuration
epycloud config show

Note: The terraform does not create a new bucket, but rather uses an existing GCS bucket.

Setting up GitHub Personal Access Token¶

The pipeline requires a GitHub Fine-Grained Personal Access Token (PAT) to clone private repositories during Docker build and Batch job execution.

Create a fine-grained PAT: 1. Go to GitHub Settings → Developer settings → Personal access tokens → Fine-grained tokens 2. Click "Generate new token" 3. Set appropriate name and expiration 4. Under "Repository access", select "Only select repositories" and add: - epymodelingsuite repository - Your forecast repository 5. Under "Repository permissions", grant: - Contents: Read-only access 6. Generate and copy the token

Store the PAT in Google Secret Manager:

# Store the PAT (replace with your actual token). For the first time, run:
 echo -n "github_pat_xxxxxxxxxxxxx" | gcloud secrets create github-pat \
  --data-file=- \
  --project=${PROJECT_ID}

# Update PAT (second time or later)
  echo -n "github_pat_xxxx" | gcloud secrets versions add github-pat \
  --data-file=- \
  --project=${PROJECT_ID}

# Verify it was created
gcloud secrets describe github-pat --project=${PROJECT_ID}

Important notes: - The secret name must be github-pat to match the Terraform configuration - Never commit the PAT to version control - Set an appropriate expiration date and rotate regularly - Single PAT provides access to multiple repositories with granular permissions

3) Terraform¶

We use Terraform to manage core cloud resources and workflow. See full implementation in terraform/main.tf, terraform/variables.tf, and terraform/outputs.tf.

Key resources: - APIs: Enables batch.googleapis.com, workflows.googleapis.com, artifactregistry.googleapis.com, secretmanager.googleapis.com - Artifact Registry: Docker repository for container images - GCS Bucket: Uses existing bucket via data "google_storage_bucket" - Secret Manager: Stores GitHub PAT (github-pat) for repository access - Service Accounts: - batch_runtime_sa: For running Batch jobs - workflows_runner_sa: For executing workflows - IAM: - Batch SA has roles/storage.objectAdmin on bucket - Batch SA has roles/secretmanager.secretAccessor for GitHub PAT - Workflows SA has roles/batch.jobsAdmin and roles/iam.serviceAccountUser - Workflows: Deploys from terraform/workflow.yaml

Important notes: - Uses existing GCS bucket (not creating new one) - Uses batch.jobsAdmin role (more restrictive than batch.admin) - Includes logging permissions for Batch jobs - GitHub PAT secret must be created manually before applying Terraform

4) Workflow YAML¶

See full implementation in terraform/workflow.yaml.

Orchestration flow: 1. Stage A (Generator): Single Batch job that runs main_builder.py to generate N input files 2. List inputs: Counts generated files in GCS 3. Stage B (Runner): Parallel Batch jobs (N tasks) running main_runner.py, each processing one input 4. Stage C (Output): Single Batch job that runs main_output.py to aggregate results and generate CSV outputs 5. Wait helper: Polls Batch job status until completion

Key features: - Takes runtime parameters: count, seed, bucket, dirPrefix, exp_id, batchSaEmail, githubForecastRepo, maxParallelism (optional, default: 100) - Auto-generates run_id from workflow execution ID for unique identification - Constructs paths: {dirPrefix}{exp_id}/{run_id}/builder-artifacts/, /runner-artifacts/, and /outputs/ - Stage A: 2 CPU, 4 GB RAM, includes repo cloning via run_builder.sh - Stage B: 2 CPU, 8 GB RAM, configurable parallelism (default: 100, max: 5000 per Cloud Batch limits) - Stage C: 2 CPU, 8 GB RAM, aggregates all Stage B results and generates formatted CSV outputs - GitHub authentication via PAT from Secret Manager - Logs to Cloud Logging - Error handling for failed/deleted jobs - Returns job names and task count

5) Compute Resources and Instance Types¶

Understanding cpuMilli, memoryMib, and Machine Types¶

Google Cloud Batch allows you to specify compute resources in two ways:

Option 1: Automatic VM selection (recommended for getting started) - Set cpu_milli and memory_mib in config to define task requirements - Leave machine_type empty (or omit it) - Google Cloud automatically selects an appropriate VM type - Example: cpu_milli: 2000, memory_mib: 4096 → Google may provision c4d-standard-2 (2 vCPU, 7 GB)

Option 2: Explicit machine type (recommended for production) - Set machine_type: "c4d-standard-2" in config - cpu_milli and memory_mib become task constraints (must fit within the VM) - Ensures predictable VM provisioning and scaling behavior - Better for avoiding task queueing with task_count_per_node: 1

cpuMilli represents thousandths of a vCPU: - 1000 cpuMilli = 1 vCPU - 2000 cpuMilli = 2 vCPUs - 4000 cpuMilli = 4 vCPUs

Important relationship:

If STAGE_B_MACHINE_TYPE is set:
  - cpuMilli/memoryMib = task requirements (must fit in VM)
  - Machine type determines actual VM resources
  - Example: e2-standard-2 (2 vCPU, 8 GB) with cpuMilli=2000 → 1 task per VM

If STAGE_B_MACHINE_TYPE is empty:
  - cpuMilli/memoryMib = basis for automatic VM selection
  - Google chooses VM type that fits requirements
  - Less predictable scaling behavior

See Google Cloud Batch documentation for details.

Current Resource Allocation¶

Cloud Build - cloudbuild.yaml:38

machineType: E2_HIGHCPU_8

- Predefined instance type for build operations. Note that the only options for CloudBuild are e2-medium, e2-standard-2, e2-highcpu-8, and e2-highcpu-32. See documentation for details. - e2-highcpu-8: 8 vCPU, 8 GB RAM (provides faster builds with more CPU)

Stage A Job (Generator) - Single task that generates input files - Configurable in config.yaml: google_cloud.batch.stage_a section - Default resources: 2 vCPUs (cpu_milli: 2000), 4 GB RAM (memory_mib: 4096)

Stage B Job (Runner) - Parallel tasks processing individual simulations - Configurable in config.yaml: google_cloud.batch.stage_b section - Default resources: 2 vCPUs (cpu_milli: 2000), 4 GB RAM (memory_mib: 4096) - Default timeout: 36000 seconds (10 hours) per task (max_run_duration: 36000) - Default parallelism: 1 task per VM (task_count_per_node: 1)

Stage C Job (Output) - Single task that aggregates all Stage B results - Configurable in config.yaml: google_cloud.batch.stage_c section - Default resources: 2 vCPUs (cpu_milli: 2000), 8 GB RAM (memory_mib: 8192) - Default timeout: 7200 seconds (2 hours) (max_run_duration: 7200)

Tuning Compute Resources¶

Resources are configurable in config.yaml under google_cloud.batch.stage_b:

Example configurations:

# Lightweight tasks (1 vCPU, 2 GB) - config.yaml
google_cloud:
  batch:
    stage_b:
      cpu_milli: 1000
      memory_mib: 2048
      machine_type: ""  # Auto-select

# Standard tasks (2 vCPUs, 8 GB) - Default
google_cloud:
  batch:
    stage_b:
      cpu_milli: 2000
      memory_mib: 8192
      machine_type: "e2-standard-2"

# Compute-intensive tasks (2 vCPUs, 7 GB)
google_cloud:
  batch:
    stage_b:
      cpu_milli: 2000
      memory_mib: 7168
      machine_type: "c4d-standard-2"

# High-memory tasks (2 vCPUs, 15 GB)
google_cloud:
  batch:
    stage_b:
      cpu_milli: 8000
      memory_mib: 15360
      machine_type: "c4d-highmem-2"

Timeout configurations:

# Edit config.yaml
google_cloud:
  batch:
    stage_b:
      max_run_duration: 3600   # Short (< 1 hour)
      max_run_duration: 18000  # Medium (1-5 hours)
      max_run_duration: 36000  # Long (5-10 hours) - Default
      max_run_duration: 86400  # Very long (24 hours)
      # Note: Cloud Batch max limit is 604800s (7 days)

Important notes: - Tasks exceeding max_run_duration will be terminated by Google Cloud Batch - Set this value based on your longest expected simulation runtime - Add buffer time (e.g., if max simulation is 8 hours, set to 10 hours) - Monitor task completion times to optimize this setting

Recommended Configuration: Dedicated VMs (task_count_per_node=1)¶

For parallel execution with no task queueing, use one task per VM:

# Recommended configuration in config.yaml
google_cloud:
  batch:
    task_count_per_node: 1
    stage_b:
      cpu_milli: 2000
      memory_mib: 8192
      machine_type: "c4d-standard-2"  # Explicit type for predictable scaling

Benefits: - No queueing: Each task gets its own VM immediately - Efficient resource usage: VMs terminate when tasks finish - Cost-effective for variable runtimes: Pay only for actual task duration - Predictable scaling: With explicit machine type, Batch provisions expected number of VMs

How it works:

With task_count_per_node=1:
  - Google Cloud Batch creates up to N VMs for N tasks
  - Each VM runs exactly 1 task then terminates
  - If machine type is set, VM creation is predictable
  - If machine type is empty, Batch may create fewer VMs (hitting quota/availability limits)

Production recommendations: 1. Set explicit machine_type in config for predictable VM provisioning 2. Match cpu_milli to machine capacity: e.g. c4d-standard-2 (2 vCPU) → cpu_milli: 2000 3. Set task_count_per_node: 1 for parallel execution with variable task runtimes 4. Monitor first run in Cloud Console to verify expected number of VMs are created

Machine type options:

c4d-standard-2 provides the optimal balance of performance and cost for our workloads. A typical calibration task in Stage B for a single state uses approximately 4GB of memory, making the 7GB available in c4d-standard-2 sufficient with headroom. Since epymodelingsuite computations are single-threaded, tasks only require 1 vCPU (CPU_MILLI=1000). However, setting CPU_MILLI=1000 without an explicit machine type causes Google Cloud to auto-select slower e2-standard instances. We use dedicated small VMs with TASK_COUNT_PER_NODE=1 and c4d-standard-2 because it provides predictable scaling and leverages faster AMD EPYC Genoa processors for optimal single-thread performance.

While larger shared VMs could maximize vCPU utilization and reduce per-task costs, they introduce unpredictable queueing when simulation runtimes vary significantly. (Some tasks finish quickly while others run longer, extending billing duration until the slowest task on each VM completes.) Using a dedicated VM with 2 vCPUs, the average vCPU utilization will be around 50%, but this ensures consistent performance and faster overall job completion, making it the preferred choice despite slightly higher vCPU costs.

Machine Type	vCPU	Memory (GB)	CPU_MILLI	MEMORY_MIB	Price (us-central1)*	Notes
`e2-standard-2`	2	8	2000	8192	$0.06701142/hr	Most cost-effective, general purpose.
`n2-standard-2`	2	8	2000	8192	$0.097118/hr	Better CPU performance than E2. Intel Cascade Lake/Ice Lake.
`c4-standard-2`	2	8	2000	8192	$0.096866/hr	Intel compute-optimized. Intel Sapphire Rapids.
`c4d-standard-2`	2	7	2000	7168	$0.089876046/hr	AMD compute-optimized. AMD EPYC Genoa.
`c4d-highmem-2`	2	15	2000	15360	$0.11784067/hr	High memory + compute. AMD EPYC Genoa.

*Prices are as of Oct 23, 2025. See Google Cloud VM pricing for current rates, and the doc for the details of machine types.

After modifying resources:

epycloud config edit      # Edit configuration
epycloud terraform plan   # Review changes
epycloud terraform apply  # Deploy updated configuration

6) Docker image¶

For production (cloud) and development/testing (local), all of the computation runs on a Docker container. See docker/Dockerfile.

Multi-stage build architecture¶

The Dockerfile uses multi-stage builds with three stages:

Build stages:

base - Common dependencies shared by both local and cloud images
Base image: python:3.11-slim
Installs uv for fast dependency management
Clones and installs epymodelingsuite package from private GitHub repository (uses its uv.lock)
Installs cloud-specific dependencies from this repo's docker/pyproject.toml + docker/uv.lock
Copies scripts from scripts/ directory
local - Minimal image for local development
Builds from base stage
Size: ~300-400 MB
Includes Python deps, git, scripts, and epymodelingsuite
No gcloud CLI (uses local filesystem instead of GCS)
Used by Docker Compose for local testing
cloud - Production image for Google Cloud
Builds from base stage
Size: ~500-700 MB
Adds gcloud CLI for Secret Manager access
Default entrypoint: main_builder.py (Stage A)
Used by Cloud Batch jobs

Image naming: - Image name: epymodelingsuite (configurable via docker.image_name in config) - Image tag: latest (configurable via docker.image_tag) - Full path: {region}-docker.pkg.dev/{project_id}/{repo_name}/epymodelingsuite:latest

Build & push¶

# Option 1: Cloud Build (recommended for production)
epycloud build cloud

# Option 2: Build cloud image locally and push to Artifact Registry
# Ensure GitHub PAT is configured
epycloud config show  # Verify github.personal_access_token is set

# Authenticate Docker (one-time setup)
REGION=$(epycloud config show | grep 'region:' | awk '{print $2}')
PROJECT_ID=$(epycloud config show | grep 'project_id:' | awk '{print $2}')
gcloud auth configure-docker ${REGION}-docker.pkg.dev --project=${PROJECT_ID}

epycloud build local

# Option 3: Build local dev image (for Docker Compose, no push)
epycloud build dev

Build targets: - epycloud build cloud - Uses Cloud Build on GCP, builds cloud target, pushes to Artifact Registry - epycloud build local - Builds cloud target locally, pushes to Artifact Registry (requires auth) - epycloud build dev - Builds local target locally, tags as epymodelingsuite:local, no push

Secrets management: - GitHub PAT stored in ~/.config/epymodelingsuite-cloud/secrets.yaml (with 0600 permissions) - Configure via: epycloud config edit-secrets - Used for local builds when accessing private repositories - Cloud builds use Secret Manager instead

Cloud Build configuration in cloudbuild.yaml: - Uses E2_HIGHCPU_8 machine type for faster builds - Enables layer caching with --cache-from and BUILDKIT_INLINE_CACHE - Logs to Cloud Logging only - Automatically pushes to Artifact Registry - Fetches GitHub PAT from Secret Manager for private repository access - Passes GITHUB_MODELING_SUITE_REPO and GITHUB_MODELING_SUITE_REF as build arguments - Submits builds asynchronously via epycloud build cloud for non-blocking operation

How dependencies are installed:

The Docker image installs dependencies from two sources, both using locked versions for reproducibility:

1. epymodelingsuite dependencies (from the epymodelingsuite repository):

# Clone repo and install using its uv.lock
git clone ... /opt/epymodelingsuite
cd /opt/epymodelingsuite && uv sync --frozen

- Creates venv at /opt/epymodelingsuite/.venv - Uses uv.lock from the epymodelingsuite repository - Installs at build time (baked into the Docker image) - Uses GitHub PAT from Secret Manager (Cloud Build) or secrets.yaml (local builds) - Supports specific branch/commit via github.modeling_suite_ref config setting

2. Cloud-specific dependencies (from this repo's docker/ directory):

# Export locked deps and install into the existing venv
COPY docker/pyproject.toml docker/uv.lock /app/
RUN uv export --frozen --no-dev | uv pip install --no-cache -r -

- Defined in docker/pyproject.toml (google-cloud-storage, dill, python-json-logger) - Locked versions in docker/uv.lock - Installed into the same venv as epymodelingsuite

Why this approach? - uv sync is project-centric and creates its own .venv in the project directory - uv pip install respects the VIRTUAL_ENV environment variable - By using uv export | uv pip install, we install locked dependencies into the epymodelingsuite venv

Updating cloud dependencies:

cd docker
# Edit pyproject.toml to add/update dependencies
uv lock                    # Regenerate uv.lock
uv lock --upgrade          # Upgrade all deps to latest compatible versions

Testing the container:

# Run container structure tests
docker run --rm \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v $(pwd)/docker/container-structure-test.yaml:/config.yaml \
  gcr.io/gcp-runtimes/container-structure-test:latest test \
  --image <image-name> --config /config.yaml

This approach ensures reproducible builds with locked dependencies from both repositories.

Local execution with Docker Compose¶

Running locally¶

For local execution, first build the local image, then use the epycloud CLI commands:

# Ensure configuration is set up (including GitHub PAT if using private repos)
epycloud config show

# Build local development image
epycloud build dev

# Run builder (Stage A) - this auto-generates RUN_ID
epycloud run job --local --stage builder --exp-id test-sim
# Note the RUN_ID from the output, e.g., "20251114-123045-a1b2c3"

# Run a single runner locally (Stage B)
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0

# Run multiple runners for different tasks
for i in {0..9}; do
  epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index $i &
done
wait

# Run output generation (Stage C)
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks 10

Note: The epycloud CLI wraps docker compose calls with additional features: - Automatically includes --rm flag to remove containers after execution (prevents orphaned containers) - Validates required environment variables (e.g., EXP_ID, RUN_ID) - Loads config values and passes them to docker compose as environment variables - Provides helpful output messages showing where files are read/written

What the epycloud CLI does: - epycloud run job --local --stage builder → runs docker compose run --rm builder with environment variables - epycloud run job --local --stage runner → runs docker compose run --rm runner with environment variables - epycloud run job --local --stage output → runs docker compose run --rm output with environment variables

Alternative (NOT recommended): Direct docker compose commands

# If you must run docker compose directly, always include --rm
docker compose run --rm builder
TASK_INDEX=0 docker compose run --rm runner
docker compose run --rm output

Docker Compose configuration¶

The Docker Compose setup (defined in docker-compose.yml):

Volume bindings: - Mounts ./local (host) to /data (container) - ./local/bucket/ → /data/bucket/ - Simulates GCS bucket for builder-artifacts/runner-artifacts - ./local/forecast/ → /data/forecast/ - Forecast repository data (alternative to git clone)

Configuration: - Uses the local build target (smaller image, no gcloud CLI) - Environment variables loaded from config.yaml by epycloud CLI - Sets EXECUTION_MODE=local automatically (enables local filesystem instead of GCS) - Runtime variables: EXP_ID, RUN_ID, TASK_INDEX

Services: - dispatcher - Runs run_builder.sh to generate input files - runner - Runs main_runner.py to process individual tasks

Local directory structure:

./local/                    # Host directory (mounted as /data in container)
  bucket/                   # → /data/bucket/ (simulates GCS bucket)
    {exp_id}/
      {run_id}/
        builder-artifacts/  # Generated input files (input_0000.pkl, ...)
        runner-artifacts/   # Simulation results (result_0000.pkl, ...)
  forecast/                 # → /data/forecast/ (forecast repository)
    experiments/            # YAML experiment configurations

The storage abstraction layer in the scripts automatically detects EXECUTION_MODE=local and uses /data/bucket/ instead of gs://bucket-name/.

7) Scripts¶

Stage A Wrapper: scripts/run_builder.sh ¶

Shell wrapper that handles forecast repository setup before running the dispatcher.

Features: - Cloud mode: Fetches GitHub PAT from Secret Manager and clones forecast repo - Local mode: Uses mounted forecast data from /data/forecast - Adds forecast data to PYTHONPATH - Supports optional FORECAST_REPO_REF for branch/tag checkout (cloud only)

Environment variables: - EXECUTION_MODE - "cloud" or "local" (required) - GITHUB_FORECAST_REPO - Forecast repo to clone (cloud mode only) - FORECAST_REPO_DIR - Where to clone repo (default: /data/forecast/, cloud only) - GCLOUD_PROJECT_ID - GCP project for Secret Manager (cloud mode only) - GITHUB_PAT_SECRET - Secret Manager secret name (default: github-pat, cloud only) - FORECAST_REPO_REF - Optional branch/tag to checkout (cloud mode only)

Stage A: scripts/main_builder.py ¶

Generates input files based on configuration and uploads them to GCS.

Features: - Environment variables: GCS_BUCKET, OUT_PREFIX, JOBID, EXP_ID, RUN_ID - Automatically discovers and resolves config files by parsing YAML structure - Creates pickled input files with model configs - Output pattern: {OUT_PREFIX}input_{i:04d}.pkl - Logging for monitoring progress

Usage:

python main_builder.py

Stage B: scripts/main_runner.py ¶

Processes individual tasks in parallel using BATCH_TASK_INDEX.

Features: - Environment variables: BATCH_TASK_INDEX, GCS_BUCKET, IN_PREFIX, OUT_PREFIX - Downloads input: {IN_PREFIX}input_{idx:04d}.pkl - Runs simulation (currently placeholder logic) - Uploads results: {OUT_PREFIX}result_{idx:04d}.pkl - Error handling and logging

Note: BATCH_TASK_INDEX is automatically set by Cloud Batch (0-indexed).

Stage C: scripts/main_output.py ¶

Aggregates all Stage B results and generates formatted CSV outputs.

Features: - Environment variables: GCS_BUCKET, EXP_ID, RUN_ID, NUM_TASKS - Downloads all result files: {RUN_ID}/runner-artifacts/result_*.pkl - Automatically discovers output configuration from experiment directory - Generates formatted CSV outputs (quantiles, trajectories, posteriors, metadata) - Uploads to: {RUN_ID}/outputs/*.csv.gz - Error handling for missing result files

Wrapper: scripts/run_output.sh - Similar to run_builder.sh, handles cloud vs local mode - No repo cloning needed (uses already-installed epymodelingsuite)

8) Monitoring and Resource Groups¶

The infrastructure uses resource labels to organize and monitor different stages of the pipeline. All resources are tagged with labels for easy filtering and monitoring.

Label Structure¶

All resources use a consistent labeling scheme: - component: epymodelingsuite - Identifies all resources belonging to this system - stage - Identifies the specific phase: - imagebuild - Cloud Build jobs that build Docker images - builder - Stage A Batch jobs (dispatcher that generates input files) - runner - Stage B Batch jobs (parallel simulation runners) - output - Stage C Batch jobs (output aggregation and formatting) - exp_id - Dynamic label for experiment ID (Batch jobs only) - run_id - Dynamic label for workflow execution/run ID (Batch jobs only) - environment: production - Environment identifier - managed-by - Shows which tool manages the resource (terraform, cloudbuild, workflows)

Monitoring Dashboards¶

After running make tf-apply, three Cloud Monitoring dashboards are automatically created:

Builder Dashboard - Monitors Stage A (builder) CPU/memory usage
Filter: component=epymodelingsuite AND stage=builder
Metrics: CPU %, Memory %, Memory MiB, CPU cores
Runner Dashboard - Monitors Stage B (parallel runners) CPU/memory, parallelism
Filter: component=epymodelingsuite AND stage=runner
Metrics: CPU %, Memory %, Memory MiB, CPU cores, Active instances
Output Dashboard - Monitors Stage C (output generation) CPU/memory usage
Filter: component=epymodelingsuite AND stage=output
Metrics: CPU %, Memory %, Memory MiB, CPU cores
Overall System Dashboard - Monitors all stages combined
Filter: component=epymodelingsuite
Metrics: Aggregated CPU/memory by stage, Active instances by stage

Access dashboards:

# After terraform apply, get dashboard URLs:
epycloud terraform output | grep dashboard

Or navigate to: Cloud Console → Monitoring → Dashboards

Custom Filtering¶

You can create custom queries in Cloud Monitoring to filter by specific experiments or runs:

# View all resources for a specific experiment
component=epymodelingsuite AND exp_id="experiment-01"

# View specific run of an experiment
component=epymodelingsuite AND run_id="abc123-def456"

# Compare builder vs runner performance
component=epymodelingsuite AND (stage=builder OR stage=runner)

9) epycloud CLI¶

The epycloud CLI provides all operational commands for the pipeline. For detailed commands and workflows, see /docs/operations.md.

Quick reference:

# Infrastructure
epycloud terraform init        # Initialize Terraform
epycloud terraform plan        # Preview changes
epycloud terraform apply       # Deploy infrastructure
epycloud terraform destroy     # Destroy resources
epycloud terraform output      # View Terraform outputs

# Build
epycloud build cloud           # Cloud Build (recommended)
epycloud build local           # Build locally and push
epycloud build dev             # Build for local development
epycloud build status          # Check build status
epycloud build status --ongoing # Show only active builds

# Execute (Cloud)
epycloud run workflow --exp-id my-exp   # Run workflow on cloud

# Monitor
epycloud status                         # Quick status check
epycloud workflow list                  # List workflows
epycloud workflow list --exp-id my-exp  # Filter by experiment
epycloud workflow describe <id>         # Workflow details
epycloud logs --exp-id my-exp           # View logs
epycloud logs --exp-id my-exp --stage B # Filter by stage

# Local development
epycloud run job --local --stage builder --exp-id test-sim
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks 10

# Configuration
epycloud config show           # Show current config
epycloud config edit           # Edit configuration
epycloud config validate       # Validate configuration

10) Operational notes¶

Reproducibility: tag images with immutable digests and store a run_metadata.json next to outputs (image digest, args, run time, counts).
Quotas:
Cap parallelism in Workflows (default: 100, configurable via google_cloud.batch.max_parallelism in config)
Cloud Batch supports up to 5,000 parallel tasks per job
Adjust CPU/Memory per task based on your region's vCPU quota
Avoid queueing: Set explicit machine_type with task_count_per_node: 1 for predictable VM provisioning
VM Allocation Best Practices:
Set task_count_per_node: 1 for parallel execution (one task per VM)
Set explicit machine_type (e.g., "c4d-standard-2") for predictable scaling
Match cpu_milli to machine capacity: c4d-standard-2 (2 vCPU) → cpu_milli: 2000
Monitor first job run in Cloud Console to verify expected number of VMs are created
Security:
Principle of least privilege (scoped IAM on bucket, read-only PAT for repos)
Only unpickle trusted data produced by Stage A
GitHub authentication via fine-grained PAT with minimal permissions (Contents: read)
Never commit GitHub PAT - stored in Secret Manager only
Rotate PAT regularly and set appropriate expiration dates

11) Billing and Cost Tracking¶

All resources are labeled with component=epymodelingsuite for billing tracking.

View costs in GCP Console: 1. Go to Billing → Reports 2. Add filter: Labels → component = epymodelingsuite 3. Group by: Service (to see Cloud Build, Batch, Workflows, etc.)

Billable resources tracked: - Cloud Build (image builds) - Cloud Batch (compute for jobs) - Cloud Workflows (orchestration) - Artifact Registry (Docker image storage) - Secret Manager (GitHub PAT storage) - Cloud Logging (inherited from parent resources)

Note: Cloud Storage costs are not tracked by labels since the bucket is shared with other projects. Track storage by prefix (DIR_PREFIX) if needed.

Billing Project Label¶

Cloud Batch jobs can be labeled with a user-defined billing_project for cost grouping in GCP billing reports. Use this to categorize costs however makes sense for you (by contract, client, funding source, team, etc.).

Configuration (persistent):

Set in config.yaml or a profile file:

google_cloud:
  billing_project: "my-project-name"

CLI override (per-run):

epycloud run workflow --exp-id my-exp --billing-project my-project-name
epycloud run job --stage A --exp-id my-exp --billing-project my-project-name

The CLI flag overrides the config file value for that run.

Filtering costs by billing project: 1. Go to Billing → Reports 2. Add filter: Labels → billing_project = my-project-name

12) Implementation Summary¶

📝 TODO (for production use): - Set up result aggregation/analysis scripts - Configure monitoring and alerting for workflow failures - Implement result validation and quality checks