Operations Guide¶

This document covers common operational commands for running and monitoring the pipeline.

Table of Contents¶

Prerequisites
Quick Reference
Building Docker Images
Build Targets
Cloud Build (Recommended for Production)
Local Build and Push
Local Development Build
Running the Pipeline
A) Cloud Execution
B) Local Execution
Monitoring and Debugging
Quick Status Check
Workflow Management
Batch Jobs
Cloud Storage
Pipeline Logs
Cloud Console Dashboards
Terraform Operations
Initialize Terraform
Preview Changes
Apply Infrastructure Changes
Destroy Infrastructure
Update Docker Image

Prerequisites¶

Before running any commands, ensure you have initialized the configuration:

# One-time setup: Initialize configuration
epycloud config init

# Edit configuration as needed
epycloud config edit
epycloud config edit-secrets  # For GitHub PAT

# Activate a profile (e.g., flu)
epycloud profile use flu

# Verify configuration (should show all YAMLs)
epycloud config show

Note: The epycloud CLI automatically loads configuration from ~/.config/epymodelingsuite-cloud/. No need to source files in each terminal session.

Quick Reference¶

# ===============================================================
#        Only required for initial setup / updating setup
# ===============================================================
# Initialize configuration (one-time)
# epycloud config init
# epycloud config edit
# epycloud config edit-secrets
# epycloud profile use flu
# epycloud config show

# Deploy infrastructure
# epycloud terraform init && epycloud terraform apply

# Build Docker image
# epycloud build cloud
# ===============================================================

# ========================== Production =========================
# 0. Add experiments (YAML files) in forecast repository. Make sure to git push.
# 1. Run workflow on cloud
epycloud run workflow --exp-id experiment-01

# 2. Monitor workflow
epycloud workflow list
epycloud workflow list --exp-id experiment-01

# ======================= Testing locally ========================
# 0. Setup local experiment configs in ./local/forecast/experiments/{EXP_ID}/config/

# 1. Test pipeline locally with Docker Compose
# Build Docker image for local testing
epycloud build dev
# Run pipeline stages (RUN_ID is auto-generated by builder)
epycloud run job --local --stage builder --exp-id test-sim
# After builder completes, get RUN_ID from output and run runner
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0
# Generate outputs
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks 1

Building Docker Images¶

The pipeline runs in Docker containers. You can build images using Cloud Build (recommended for production) or locally for development and testing.

Build Targets¶

There are three build targets: - epycloud build cloud - Cloud Build, pushed to Artifact Registry. - epycloud build local - Local build, pushed to Artifact Registry. - epycloud build dev - Development build for locally testing pipeline. Doesn't get pushed to cloud.

Cloud Build (Recommended for Production)¶

Build and push Docker image using Google Cloud Build:

# Build image on cloud (async - returns immediately)
epycloud build cloud

This submits the build asynchronously and returns immediately with a build ID. Monitor progress with:

# View build status
epycloud build status
epycloud build status --ongoing

# Or use gcloud commands for specific build
gcloud builds log <BUILD_ID> --region=$REGION --stream
gcloud builds describe <BUILD_ID> --region=$REGION

Local Build and Push¶

Build locally and push to Artifact Registry:

# Ensure GitHub PAT is configured in secrets
epycloud config show  # Verify github.personal_access_token is set

# Authenticate Docker (one-time setup)
# Get REGION and PROJECT_ID from config
REGION=$(epycloud config show | grep 'region:' | awk '{print $2}')
PROJECT_ID=$(epycloud config show | grep 'project_id:' | awk '{print $2}')
gcloud auth configure-docker ${REGION}-docker.pkg.dev --project=${PROJECT_ID}

# Build and push
epycloud build local

Local Development Build¶

To run the pipeline locally, you will need a local development build image:

# Ensure configuration is set up (including GitHub PAT if using private repos)
epycloud config show

# Build local dev image
epycloud build dev

Running the Pipeline¶

The pipeline can be executed on Google Cloud for production runs or locally using Docker for development and testing.

A) Cloud Execution¶

Run the full pipeline on Google Cloud:

# Add experiments (YAML files) in forecast repository. Make sure to git push.

# Basic run
epycloud run workflow --exp-id my-experiment

# The workflow will:
# 1. Generate a unique RUN_ID automatically
# 2. Run Stage A (builder) to create input files
# 3. Run Stage B (runners) in parallel
# 4. Run Stage C (output) to aggregate results and generate CSV outputs
# 5. Store results in: gs://{bucket}/{DIR_PREFIX}{EXP_ID}/{RUN_ID}/

Monitoring:

# List workflows
epycloud workflow list

# List workflows for specific experiment
epycloud workflow list --exp-id my-experiment

# View workflow details
epycloud workflow describe <execution-id>

# View logs
epycloud logs --exp-id my-experiment
epycloud logs --exp-id my-experiment --stage B --tail 500

Note: Single-task reruns and independent Stage C runs are no longer supported via CLI. Use the full workflow or run stages locally for testing.

B) Local Execution¶

Run the pipeline locally for testing using Docker.

Overview¶

Key differences from cloud mode:

Storage: Uses ./local/ directory instead of Google Cloud Storage (GCS)
Forecast repo: Reads from ./local/forecast/ instead of cloning from GitHub
RUN_ID: Must be set manually (not auto-generated)

Local directory structure:

./local/
  bucket/                    # Simulates GCS bucket (created automatically)
    {EXP_ID}/{RUN_ID}/
      builder-artifacts/input_*.pkl     # Generated by builder
      runner-artifacts/result_*.pkl     # Generated by runners
      outputs/*.csv.gz                  # Generated by output stage
  forecast/                  # Experiment configurations and data
    experiments/{EXP_ID}/config/*.yaml
    experiments/{EXP_ID}/config/output.yaml  # Output configuration
    common-data/             # Shared data files (surveillance data, etc.)
    functions/               # Custom Python modules (optional, user-defined functions)

Setup¶

One-time setup:

# 1. Initialize configuration
epycloud config init
epycloud config edit          # Configure Google Cloud, Docker settings
epycloud config edit-secrets  # Add GitHub PAT for building Docker image
epycloud profile use flu      # Activate a profile
epycloud config show          # Verify configuration (should show all YAMLs)

# 2. Create experiment directory
mkdir -p ./local/forecast/experiments/{EXP_ID}/config/
# Add config files: at minimum basemodel.yaml (optional: sampling.yaml, calibration.yaml)

# 3. Build local development image (takes a few minutes)
epycloud build dev

Running an Experiment¶

# 1. Run builder (Stage A) - this auto-generates RUN_ID
epycloud run job --local --stage builder --exp-id test-sim
# Note the RUN_ID from the output, e.g., "20251114-123045-a1b2c3"

ls -R ./local/bucket/test-sim/<run_id>/builder-artifacts/  # Verify: should show input_*.pkl files

# 2. Run tasks (Stage B)
# Single task:
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0

# Or multiple in parallel (manual approach):
for i in {0..9}; do
  epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index $i &
done
wait

ls ./local/bucket/test-sim/<run_id>/runner-artifacts/  # Verify: should show result_*.pkl files

# 3. Run output generation (Stage C)
# First, determine NUM_TASKS from the number of result files
NUM_TASKS=$(ls ./local/bucket/test-sim/<run_id>/runner-artifacts/result_*.pkl | wc -l)

epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks $NUM_TASKS

ls ./local/bucket/test-sim/<run_id>/outputs/  # Verify: should show *.csv.gz files

Important: The builder auto-generates a unique RUN_ID (format: YYYYMMDD-HHMMSS-<uuid>). Note this value from the builder output for use in runner and output stages. All local data is stored in ./local/bucket/{EXP_ID}/{RUN_ID}/.

Monitoring and Debugging¶

Monitor pipeline execution using the epycloud CLI, Google Cloud Console dashboards, and command-line tools.

Quick Status Check¶

The epycloud status command provides a real-time overview of all active workflows and batch jobs:

# View all active workflows and batch jobs
epycloud status

# Filter by specific experiment
epycloud status --exp-id experiment-01

# Watch mode with auto-refresh (default: 3 seconds)
epycloud status --watch

# Custom refresh interval
epycloud status --watch --interval 5

Example output:

Pipeline Status
================================================================================

Active Workflows:
EXECUTION ID                              EXP_ID              START TIME
--------------------------------------------------------------------------------
abc123-def456-ghi789                      test-flu            2025-11-14 10:30:00

Active Batch Jobs:
JOB NAME                                          STAGE    STATUS       TASKS
--------------------------------------------------------------------------------
epycloud-test-flu-20251114-103052-abc123-stage-b  B        RUNNING      45/100

Total active: 1 workflow(s), 1 batch job(s)

This command is ideal for: - Quick status checks without navigating to the Cloud Console - Monitoring multiple experiments simultaneously - Watching job progress in real-time with --watch mode - Identifying stuck or failed workflows

Workflow Management¶

https://console.cloud.google.com/workflows

List and manage workflow executions:

# List workflow executions (default: 20 most recent)
epycloud workflow list

# List with more results
epycloud workflow list --limit 50

# Filter by experiment ID
epycloud workflow list --exp-id test-flu

# Filter by status (ACTIVE, SUCCEEDED, FAILED, CANCELLED)
epycloud workflow list --status ACTIVE

# Describe workflow execution details
epycloud workflow describe <execution-id>

# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100

# Cancel running workflow
epycloud workflow cancel <execution-id>

# Retry failed workflow
epycloud workflow retry <execution-id>

Batch Jobs¶

https://console.cloud.google.com/batch

Monitor active batch jobs using the status command:

# View all active batch jobs (recommended)
epycloud status

# Filter by specific experiment
epycloud status --exp-id experiment-01

# Watch mode for real-time updates
epycloud status --watch

For detailed job inspection when needed:

# Describe specific job details
gcloud batch jobs describe <job-name> --location=$REGION

# List all tasks for a job
gcloud batch tasks list --job=<job-name> --location=$REGION

Note: The epycloud status command provides all the information needed for typical batch job monitoring. Use the raw gcloud commands above only when you need detailed task-level inspection or job metadata.

Cloud Storage¶

https://console.cloud.google.com/storage/browser

Pipeline Logs¶

https://console.cloud.google.com/logs/

View pipeline logs using the epycloud logs command:

# View logs for an experiment (default: last 100 entries)
epycloud logs --exp-id experiment-01

# View more logs
epycloud logs --exp-id experiment-01 --tail 500

# View all logs (no limit)
epycloud logs --exp-id experiment-01 --tail 0

# Filter by stage
epycloud logs --exp-id experiment-01 --stage A
epycloud logs --exp-id experiment-01 --stage B
epycloud logs --exp-id experiment-01 --stage C

# Filter by run ID
epycloud logs --exp-id experiment-01 --run-id 20251114-103052-abc123

# Filter by specific task
epycloud logs --exp-id experiment-01 --task-index 5

# Time-based filtering
epycloud logs --exp-id experiment-01 --since 1h
epycloud logs --exp-id experiment-01 --since 30m

# Stream live logs
epycloud logs --exp-id experiment-01 --follow

# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100

# Clean output for export (no colors)
epycloud --no-color logs --exp-id experiment-01 > logs.txt

Cloud Console Dashboards¶

Access monitoring dashboards from here.

Four dashboards are available: - Builder Dashboard - Stage A (builder) metrics - Runner Dashboard - Stage B (parallel runners) metrics - Output Dashboard - Stage C (output generation) metrics - Overall System Dashboard - Combined metrics across all stages

Terraform Operations¶

Manage Google Cloud infrastructure using Terraform commands. The epycloud CLI automatically loads configuration.

Initialize Terraform¶

First-time setup:

epycloud terraform init

Preview Changes¶

See what will change before applying:

epycloud terraform plan

Apply Infrastructure Changes¶

Deploy or update infrastructure:

epycloud terraform apply

What gets created: - Artifact Registry repository - Service accounts with IAM permissions - Cloud Workflows definition - Monitoring dashboards - Secret Manager references

Destroy Infrastructure¶

Remove all Terraform-managed resources:

epycloud terraform destroy

Warning: This deletes all infrastructure except: - GCS bucket (pre-existing, managed separately) - Secret Manager secrets (requires manual deletion) - Stored data in GCS

Update Docker Image¶

After code changes:

# 1. Build new image
epycloud build cloud

# 2. Run workflow (uses latest tag)
epycloud run workflow --exp-id updated-experiment

For versioned images:

# Update image tag in configuration
epycloud config edit  # Update docker.image_tag

# Rebuild and redeploy
epycloud build cloud
epycloud terraform apply  # Update workflow to reference new tag