Operations Guide¶
This document covers common operational commands for running and monitoring the pipeline.
Table of Contents¶
- Prerequisites
- Quick Reference
- Building Docker Images
- Build Targets
- Cloud Build (Recommended for Production)
- Local Build and Push
- Local Development Build
- Running the Pipeline
- A) Cloud Execution
- B) Local Execution
- Monitoring and Debugging
- Quick Status Check
- Workflow Management
- Batch Jobs
- Cloud Storage
- Pipeline Logs
- Cloud Console Dashboards
- Terraform Operations
- Initialize Terraform
- Preview Changes
- Apply Infrastructure Changes
- Destroy Infrastructure
- Update Docker Image
Prerequisites¶
Before running any commands, ensure you have initialized the configuration:
# One-time setup: Initialize configuration
epycloud config init
# Edit configuration as needed
epycloud config edit
epycloud config edit-secrets # For GitHub PAT
# Activate a profile (e.g., flu)
epycloud profile use flu
# Verify configuration (should show all YAMLs)
epycloud config show
Note: The epycloud CLI automatically loads configuration from ~/.config/epymodelingsuite-cloud/. No need to source files in each terminal session.
Quick Reference¶
# ===============================================================
# Only required for initial setup / updating setup
# ===============================================================
# Initialize configuration (one-time)
# epycloud config init
# epycloud config edit
# epycloud config edit-secrets
# epycloud profile use flu
# epycloud config show
# Deploy infrastructure
# epycloud terraform init && epycloud terraform apply
# Build Docker image
# epycloud build cloud
# ===============================================================
# ========================== Production =========================
# 0. Add experiments (YAML files) in forecast repository. Make sure to git push.
# 1. Run workflow on cloud
epycloud run workflow --exp-id experiment-01
# 2. Monitor workflow
epycloud workflow list
epycloud workflow list --exp-id experiment-01
# ======================= Testing locally ========================
# 0. Setup local experiment configs in ./local/forecast/experiments/{EXP_ID}/config/
# 1. Test pipeline locally with Docker Compose
# Build Docker image for local testing
epycloud build dev
# Run pipeline stages (RUN_ID is auto-generated by builder)
epycloud run job --local --stage builder --exp-id test-sim
# After builder completes, get RUN_ID from output and run runner
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0
# Generate outputs
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks 1
Building Docker Images¶
The pipeline runs in Docker containers. You can build images using Cloud Build (recommended for production) or locally for development and testing.
Build Targets¶
There are three build targets: - epycloud build cloud - Cloud Build, pushed to Artifact Registry. - epycloud build local - Local build, pushed to Artifact Registry. - epycloud build dev - Development build for locally testing pipeline. Doesn't get pushed to cloud.
Cloud Build (Recommended for Production)¶
Build and push Docker image using Google Cloud Build:
This submits the build asynchronously and returns immediately with a build ID. Monitor progress with:
# View build status
epycloud build status
epycloud build status --ongoing
# Or use gcloud commands for specific build
gcloud builds log <BUILD_ID> --region=$REGION --stream
gcloud builds describe <BUILD_ID> --region=$REGION
Local Build and Push¶
Build locally and push to Artifact Registry:
# Ensure GitHub PAT is configured in secrets
epycloud config show # Verify github.personal_access_token is set
# Authenticate Docker (one-time setup)
# Get REGION and PROJECT_ID from config
REGION=$(epycloud config show | grep 'region:' | awk '{print $2}')
PROJECT_ID=$(epycloud config show | grep 'project_id:' | awk '{print $2}')
gcloud auth configure-docker ${REGION}-docker.pkg.dev --project=${PROJECT_ID}
# Build and push
epycloud build local
Local Development Build¶
To run the pipeline locally, you will need a local development build image:
# Ensure configuration is set up (including GitHub PAT if using private repos)
epycloud config show
# Build local dev image
epycloud build dev
Running the Pipeline¶
The pipeline can be executed on Google Cloud for production runs or locally using Docker for development and testing.
A) Cloud Execution¶
Run the full pipeline on Google Cloud:
# Add experiments (YAML files) in forecast repository. Make sure to git push.
# Basic run
epycloud run workflow --exp-id my-experiment
# The workflow will:
# 1. Generate a unique RUN_ID automatically
# 2. Run Stage A (builder) to create input files
# 3. Run Stage B (runners) in parallel
# 4. Run Stage C (output) to aggregate results and generate CSV outputs
# 5. Store results in: gs://{bucket}/{DIR_PREFIX}{EXP_ID}/{RUN_ID}/
Monitoring:
# List workflows
epycloud workflow list
# List workflows for specific experiment
epycloud workflow list --exp-id my-experiment
# View workflow details
epycloud workflow describe <execution-id>
# View logs
epycloud logs --exp-id my-experiment
epycloud logs --exp-id my-experiment --stage B --tail 500
Note: Single-task reruns and independent Stage C runs are no longer supported via CLI. Use the full workflow or run stages locally for testing.
B) Local Execution¶
Run the pipeline locally for testing using Docker.
Overview¶
Key differences from cloud mode:
- Storage: Uses
./local/directory instead of Google Cloud Storage (GCS) - Forecast repo: Reads from
./local/forecast/instead of cloning from GitHub - RUN_ID: Must be set manually (not auto-generated)
Local directory structure:
./local/
bucket/ # Simulates GCS bucket (created automatically)
{EXP_ID}/{RUN_ID}/
builder-artifacts/input_*.pkl # Generated by builder
runner-artifacts/result_*.pkl # Generated by runners
outputs/*.csv.gz # Generated by output stage
forecast/ # Experiment configurations and data
experiments/{EXP_ID}/config/*.yaml
experiments/{EXP_ID}/config/output.yaml # Output configuration
common-data/ # Shared data files (surveillance data, etc.)
functions/ # Custom Python modules (optional, user-defined functions)
Setup¶
One-time setup:
# 1. Initialize configuration
epycloud config init
epycloud config edit # Configure Google Cloud, Docker settings
epycloud config edit-secrets # Add GitHub PAT for building Docker image
epycloud profile use flu # Activate a profile
epycloud config show # Verify configuration (should show all YAMLs)
# 2. Create experiment directory
mkdir -p ./local/forecast/experiments/{EXP_ID}/config/
# Add config files: at minimum basemodel.yaml (optional: sampling.yaml, calibration.yaml)
# 3. Build local development image (takes a few minutes)
epycloud build dev
Running an Experiment¶
# 1. Run builder (Stage A) - this auto-generates RUN_ID
epycloud run job --local --stage builder --exp-id test-sim
# Note the RUN_ID from the output, e.g., "20251114-123045-a1b2c3"
ls -R ./local/bucket/test-sim/<run_id>/builder-artifacts/ # Verify: should show input_*.pkl files
# 2. Run tasks (Stage B)
# Single task:
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0
# Or multiple in parallel (manual approach):
for i in {0..9}; do
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index $i &
done
wait
ls ./local/bucket/test-sim/<run_id>/runner-artifacts/ # Verify: should show result_*.pkl files
# 3. Run output generation (Stage C)
# First, determine NUM_TASKS from the number of result files
NUM_TASKS=$(ls ./local/bucket/test-sim/<run_id>/runner-artifacts/result_*.pkl | wc -l)
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks $NUM_TASKS
ls ./local/bucket/test-sim/<run_id>/outputs/ # Verify: should show *.csv.gz files
Important: The builder auto-generates a unique RUN_ID (format: YYYYMMDD-HHMMSS-<uuid>). Note this value from the builder output for use in runner and output stages. All local data is stored in ./local/bucket/{EXP_ID}/{RUN_ID}/.
Monitoring and Debugging¶
Monitor pipeline execution using the epycloud CLI, Google Cloud Console dashboards, and command-line tools.
Quick Status Check¶
The epycloud status command provides a real-time overview of all active workflows and batch jobs:
# View all active workflows and batch jobs
epycloud status
# Filter by specific experiment
epycloud status --exp-id experiment-01
# Watch mode with auto-refresh (default: 3 seconds)
epycloud status --watch
# Custom refresh interval
epycloud status --watch --interval 5
Example output:
Pipeline Status
================================================================================
Active Workflows:
EXECUTION ID EXP_ID START TIME
--------------------------------------------------------------------------------
abc123-def456-ghi789 test-flu 2025-11-14 10:30:00
Active Batch Jobs:
JOB NAME STAGE STATUS TASKS
--------------------------------------------------------------------------------
epycloud-test-flu-20251114-103052-abc123-stage-b B RUNNING 45/100
Total active: 1 workflow(s), 1 batch job(s)
This command is ideal for: - Quick status checks without navigating to the Cloud Console - Monitoring multiple experiments simultaneously - Watching job progress in real-time with --watch mode - Identifying stuck or failed workflows
Workflow Management¶
https://console.cloud.google.com/workflows
List and manage workflow executions:
# List workflow executions (default: 20 most recent)
epycloud workflow list
# List with more results
epycloud workflow list --limit 50
# Filter by experiment ID
epycloud workflow list --exp-id test-flu
# Filter by status (ACTIVE, SUCCEEDED, FAILED, CANCELLED)
epycloud workflow list --status ACTIVE
# Describe workflow execution details
epycloud workflow describe <execution-id>
# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100
# Cancel running workflow
epycloud workflow cancel <execution-id>
# Retry failed workflow
epycloud workflow retry <execution-id>
Batch Jobs¶
https://console.cloud.google.com/batch
Monitor active batch jobs using the status command:
# View all active batch jobs (recommended)
epycloud status
# Filter by specific experiment
epycloud status --exp-id experiment-01
# Watch mode for real-time updates
epycloud status --watch
For detailed job inspection when needed:
# Describe specific job details
gcloud batch jobs describe <job-name> --location=$REGION
# List all tasks for a job
gcloud batch tasks list --job=<job-name> --location=$REGION
Note: The epycloud status command provides all the information needed for typical batch job monitoring. Use the raw gcloud commands above only when you need detailed task-level inspection or job metadata.
Cloud Storage¶
https://console.cloud.google.com/storage/browser
Pipeline Logs¶
https://console.cloud.google.com/logs/
View pipeline logs using the epycloud logs command:
# View logs for an experiment (default: last 100 entries)
epycloud logs --exp-id experiment-01
# View more logs
epycloud logs --exp-id experiment-01 --tail 500
# View all logs (no limit)
epycloud logs --exp-id experiment-01 --tail 0
# Filter by stage
epycloud logs --exp-id experiment-01 --stage A
epycloud logs --exp-id experiment-01 --stage B
epycloud logs --exp-id experiment-01 --stage C
# Filter by run ID
epycloud logs --exp-id experiment-01 --run-id 20251114-103052-abc123
# Filter by specific task
epycloud logs --exp-id experiment-01 --task-index 5
# Time-based filtering
epycloud logs --exp-id experiment-01 --since 1h
epycloud logs --exp-id experiment-01 --since 30m
# Stream live logs
epycloud logs --exp-id experiment-01 --follow
# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100
# Clean output for export (no colors)
epycloud --no-color logs --exp-id experiment-01 > logs.txt
Cloud Console Dashboards¶
Access monitoring dashboards from here.
Four dashboards are available: - Builder Dashboard - Stage A (builder) metrics - Runner Dashboard - Stage B (parallel runners) metrics - Output Dashboard - Stage C (output generation) metrics - Overall System Dashboard - Combined metrics across all stages
Terraform Operations¶
Manage Google Cloud infrastructure using Terraform commands. The epycloud CLI automatically loads configuration.
Initialize Terraform¶
First-time setup:
Preview Changes¶
See what will change before applying:
Apply Infrastructure Changes¶
Deploy or update infrastructure:
What gets created: - Artifact Registry repository - Service accounts with IAM permissions - Cloud Workflows definition - Monitoring dashboards - Secret Manager references
Destroy Infrastructure¶
Remove all Terraform-managed resources:
Warning: This deletes all infrastructure except: - GCS bucket (pre-existing, managed separately) - Secret Manager secrets (requires manual deletion) - Stored data in GCS
Update Docker Image¶
After code changes:
# 1. Build new image
epycloud build cloud
# 2. Run workflow (uses latest tag)
epycloud run workflow --exp-id updated-experiment
For versioned images: