<!-- Documentation index: https://mobs-lab.github.io/epymodelingsuite-cloud/llms.txt -->
<!-- Full documentation: https://mobs-lab.github.io/epymodelingsuite-cloud/llms-full.txt -->

# Operations Guide

This document covers common operational commands for running and monitoring the pipeline.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Quick Reference](#quick-reference)
- [Building Docker Images](#building-docker-images)
  - [Build Targets](#build-targets)
  - [Cloud Build (Recommended for Production)](#cloud-build-recommended-for-production)
  - [Local Build and Push](#local-build-and-push)
  - [Local Development Build](#local-development-build)
- [Running the Pipeline](#running-the-pipeline)
  - [A) Cloud Execution](#a-cloud-execution)
  - [B) Local Execution](#b-local-execution)
- [Monitoring and Debugging](#monitoring-and-debugging)
  - [Quick Status Check](#quick-status-check)
  - [Workflow Management](#workflow-management)
  - [Batch Jobs](#batch-jobs)
  - [Cloud Storage](#cloud-storage)
  - [Pipeline Logs](#pipeline-logs)
  - [Cloud Console Dashboards](#cloud-console-dashboards)
- [Terraform Operations](#terraform-operations)
  - [Initialize Terraform](#initialize-terraform)
  - [Preview Changes](#preview-changes)
  - [Apply Infrastructure Changes](#apply-infrastructure-changes)
  - [Destroy Infrastructure](#destroy-infrastructure)
  - [Update Docker Image](#update-docker-image)

## Prerequisites

Before running any commands, ensure you have initialized the configuration:

```bash
# One-time setup: Initialize configuration
epycloud config init

# Edit configuration as needed
epycloud config edit
epycloud config edit-secrets  # For GitHub PAT

# Activate a profile (e.g., flu)
epycloud profile use flu

# Verify configuration (should show all YAMLs)
epycloud config show
```

**Note:** The `epycloud` CLI automatically loads configuration from `~/.config/epymodelingsuite-cloud/`. No need to source files in each terminal session.

## Quick Reference

```bash
# ===============================================================
#        Only required for initial setup / updating setup
# ===============================================================
# Initialize configuration (one-time)
# epycloud config init
# epycloud config edit
# epycloud config edit-secrets
# epycloud profile use flu
# epycloud config show

# Deploy infrastructure
# epycloud terraform init && epycloud terraform apply

# Build Docker image
# epycloud build cloud
# ===============================================================

# ========================== Production =========================
# 0. Add experiments (YAML files) in forecast repository. Make sure to git push.
# 1. Run workflow on cloud
epycloud run workflow --exp-id experiment-01

# 2. Monitor workflow
epycloud workflow list
epycloud workflow list --exp-id experiment-01

# ======================= Testing locally ========================
# 0. Setup local experiment configs in ./local/forecast/experiments/{EXP_ID}/config/

# 1. Test pipeline locally with Docker Compose
# Build Docker image for local testing
epycloud build dev
# Run pipeline stages (RUN_ID is auto-generated by builder)
epycloud run job --local --stage builder --exp-id test-sim
# After builder completes, get RUN_ID from output and run runner
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0
# Generate outputs
epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks 1
```

---

## Building Docker Images

The pipeline runs in Docker containers. You can build images using Cloud Build (recommended for production) or locally for development and testing.

### Build Targets
There are three build targets:
- `epycloud build cloud` - Cloud Build, pushed to Artifact Registry.
- `epycloud build local` - Local build, pushed to Artifact Registry.
- `epycloud build dev` - Development build for locally testing pipeline. Doesn't get pushed to cloud.

### Cloud Build (Recommended for Production)

Build and push Docker image using Google Cloud Build:

```bash
# Build image on cloud (async - returns immediately)
epycloud build cloud
```

This submits the build asynchronously and returns immediately with a build ID. Monitor progress with:

```bash
# View build status
epycloud build status
epycloud build status --ongoing

# Or use gcloud commands for specific build
gcloud builds log <BUILD_ID> --region=$REGION --stream
gcloud builds describe <BUILD_ID> --region=$REGION
```

### Local Build and Push

Build locally and push to Artifact Registry:

```bash
# Ensure GitHub PAT is configured in secrets
epycloud config show  # Verify github.personal_access_token is set

# Authenticate Docker (one-time setup)
# Get REGION and PROJECT_ID from config
REGION=$(epycloud config show | grep 'region:' | awk '{print $2}')
PROJECT_ID=$(epycloud config show | grep 'project_id:' | awk '{print $2}')
gcloud auth configure-docker ${REGION}-docker.pkg.dev --project=${PROJECT_ID}

# Build and push
epycloud build local
```

### Local Development Build

To run the pipeline locally, you will need a local development build image:

```bash
# Ensure configuration is set up (including GitHub PAT if using private repos)
epycloud config show

# Build local dev image
epycloud build dev
```



## Running the Pipeline

The pipeline can be executed on Google Cloud for production runs or locally using Docker for development and testing.

### A) Cloud Execution

Run the full pipeline on Google Cloud:

```bash
# Add experiments (YAML files) in forecast repository. Make sure to git push.

# Basic run
epycloud run workflow --exp-id my-experiment

# The workflow will:
# 1. Generate a unique RUN_ID automatically
# 2. Run Stage A (builder) to create input files
# 3. Run Stage B (runners) in parallel
# 4. Run Stage C (output) to aggregate results and generate CSV outputs
# 5. Store results in: gs://{bucket}/{DIR_PREFIX}{EXP_ID}/{RUN_ID}/
```

**Monitoring:**
```bash
# List workflows
epycloud workflow list

# List workflows for specific experiment
epycloud workflow list --exp-id my-experiment

# View workflow details
epycloud workflow describe <execution-id>

# View logs
epycloud logs --exp-id my-experiment
epycloud logs --exp-id my-experiment --stage B --tail 500
```

**Note:** Single-task reruns and independent Stage C runs are no longer supported via CLI. Use the full workflow or run stages locally for testing.


### B) Local Execution

Run the pipeline locally for testing using Docker.

#### Overview

**Key differences from cloud mode:**

- Storage: Uses `./local/` directory instead of Google Cloud Storage (GCS)
- Forecast repo: Reads from `./local/forecast/` instead of cloning from GitHub
- RUN_ID: Must be set manually (not auto-generated)

**Local directory structure:**

```
./local/
  bucket/                    # Simulates GCS bucket (created automatically)
    {EXP_ID}/{RUN_ID}/
      builder-artifacts/input_*.pkl     # Generated by builder
      runner-artifacts/result_*.pkl     # Generated by runners
      outputs/*.csv.gz                  # Generated by output stage
  forecast/                  # Experiment configurations and data
    experiments/{EXP_ID}/config/*.yaml
    experiments/{EXP_ID}/config/output.yaml  # Output configuration
    common-data/             # Shared data files (surveillance data, etc.)
    functions/               # Custom Python modules (optional, user-defined functions)
```

#### Setup

**One-time setup:**

```bash
# 1. Initialize configuration
epycloud config init
epycloud config edit          # Configure Google Cloud, Docker settings
epycloud config edit-secrets  # Add GitHub PAT for building Docker image
epycloud profile use flu      # Activate a profile
epycloud config show          # Verify configuration (should show all YAMLs)

# 2. Create experiment directory
mkdir -p ./local/forecast/experiments/{EXP_ID}/config/
# Add config files: at minimum basemodel.yaml (optional: sampling.yaml, calibration.yaml)

# 3. Build local development image (takes a few minutes)
epycloud build dev
```

#### Running an Experiment

```bash
# 1. Run builder (Stage A) - this auto-generates RUN_ID
epycloud run job --local --stage builder --exp-id test-sim
# Note the RUN_ID from the output, e.g., "20251114-123045-a1b2c3"

ls -R ./local/bucket/test-sim/<run_id>/builder-artifacts/  # Verify: should show input_*.pkl files

# 2. Run tasks (Stage B)
# Single task:
epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index 0

# Or multiple in parallel (manual approach):
for i in {0..9}; do
  epycloud run job --local --stage runner --exp-id test-sim --run-id <run_id> --task-index $i &
done
wait

ls ./local/bucket/test-sim/<run_id>/runner-artifacts/  # Verify: should show result_*.pkl files

# 3. Run output generation (Stage C)
# First, determine NUM_TASKS from the number of result files
NUM_TASKS=$(ls ./local/bucket/test-sim/<run_id>/runner-artifacts/result_*.pkl | wc -l)

epycloud run job --local --stage output --exp-id test-sim --run-id <run_id> --num-tasks $NUM_TASKS

ls ./local/bucket/test-sim/<run_id>/outputs/  # Verify: should show *.csv.gz files
```

**Important:** The builder auto-generates a unique `RUN_ID` (format: `YYYYMMDD-HHMMSS-<uuid>`). Note this value from the builder output for use in runner and output stages. All local data is stored in `./local/bucket/{EXP_ID}/{RUN_ID}/`.


## Monitoring and Debugging

Monitor pipeline execution using the `epycloud` CLI, Google Cloud Console dashboards, and command-line tools.

### Quick Status Check

The `epycloud status` command provides a real-time overview of all active workflows and batch jobs:

```bash
# View all active workflows and batch jobs
epycloud status

# Filter by specific experiment
epycloud status --exp-id experiment-01

# Watch mode with auto-refresh (default: 3 seconds)
epycloud status --watch

# Custom refresh interval
epycloud status --watch --interval 5
```

**Example output:**

```
Pipeline Status
================================================================================

Active Workflows:
EXECUTION ID                              EXP_ID              START TIME
--------------------------------------------------------------------------------
abc123-def456-ghi789                      test-flu            2025-11-14 10:30:00

Active Batch Jobs:
JOB NAME                                          STAGE    STATUS       TASKS
--------------------------------------------------------------------------------
epycloud-test-flu-20251114-103052-abc123-stage-b  B        RUNNING      45/100

Total active: 1 workflow(s), 1 batch job(s)
```

This command is ideal for:
- Quick status checks without navigating to the Cloud Console
- Monitoring multiple experiments simultaneously
- Watching job progress in real-time with `--watch` mode
- Identifying stuck or failed workflows

### Workflow Management

https://console.cloud.google.com/workflows

List and manage workflow executions:

```bash
# List workflow executions (default: 20 most recent)
epycloud workflow list

# List with more results
epycloud workflow list --limit 50

# Filter by experiment ID
epycloud workflow list --exp-id test-flu

# Filter by status (ACTIVE, SUCCEEDED, FAILED, CANCELLED)
epycloud workflow list --status ACTIVE

# Describe workflow execution details
epycloud workflow describe <execution-id>

# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100

# Cancel running workflow
epycloud workflow cancel <execution-id>

# Retry failed workflow
epycloud workflow retry <execution-id>
```

### Batch Jobs

https://console.cloud.google.com/batch

Monitor active batch jobs using the status command:

```bash
# View all active batch jobs (recommended)
epycloud status

# Filter by specific experiment
epycloud status --exp-id experiment-01

# Watch mode for real-time updates
epycloud status --watch
```

For detailed job inspection when needed:

```bash
# Describe specific job details
gcloud batch jobs describe <job-name> --location=$REGION

# List all tasks for a job
gcloud batch tasks list --job=<job-name> --location=$REGION
```

**Note:** The `epycloud status` command provides all the information needed for typical batch job monitoring. Use the raw `gcloud` commands above only when you need detailed task-level inspection or job metadata.

### Cloud Storage

https://console.cloud.google.com/storage/browser


### Pipeline Logs

https://console.cloud.google.com/logs/

View pipeline logs using the `epycloud logs` command:

```bash
# View logs for an experiment (default: last 100 entries)
epycloud logs --exp-id experiment-01

# View more logs
epycloud logs --exp-id experiment-01 --tail 500

# View all logs (no limit)
epycloud logs --exp-id experiment-01 --tail 0

# Filter by stage
epycloud logs --exp-id experiment-01 --stage A
epycloud logs --exp-id experiment-01 --stage B
epycloud logs --exp-id experiment-01 --stage C

# Filter by run ID
epycloud logs --exp-id experiment-01 --run-id 20251114-103052-abc123

# Filter by specific task
epycloud logs --exp-id experiment-01 --task-index 5

# Time-based filtering
epycloud logs --exp-id experiment-01 --since 1h
epycloud logs --exp-id experiment-01 --since 30m

# Stream live logs
epycloud logs --exp-id experiment-01 --follow

# View workflow-specific logs
epycloud workflow logs <execution-id>
epycloud workflow logs <execution-id> --follow
epycloud workflow logs <execution-id> --tail 100

# Clean output for export (no colors)
epycloud --no-color logs --exp-id experiment-01 > logs.txt
```


### Cloud Console Dashboards

Access monitoring dashboards from [here](https://console.cloud.google.com/monitoring/dashboards).

Four dashboards are available:
- **Builder Dashboard** - Stage A (builder) metrics
- **Runner Dashboard** - Stage B (parallel runners) metrics
- **Output Dashboard** - Stage C (output generation) metrics
- **Overall System Dashboard** - Combined metrics across all stages


## Terraform Operations

Manage Google Cloud infrastructure using Terraform commands. The `epycloud` CLI automatically loads configuration.

### Initialize Terraform

First-time setup:

```bash
epycloud terraform init
```

### Preview Changes

See what will change before applying:

```bash
epycloud terraform plan
```

### Apply Infrastructure Changes

Deploy or update infrastructure:

```bash
epycloud terraform apply
```

**What gets created:**
- Artifact Registry repository
- Service accounts with IAM permissions
- Cloud Workflows definition
- Monitoring dashboards
- Secret Manager references

### Destroy Infrastructure

Remove all Terraform-managed resources:

```bash
epycloud terraform destroy
```

**Warning:** This deletes all infrastructure except:
- GCS bucket (pre-existing, managed separately)
- Secret Manager secrets (requires manual deletion)
- Stored data in GCS

### Update Docker Image

After code changes:

```bash
# 1. Build new image
epycloud build cloud

# 2. Run workflow (uses latest tag)
epycloud run workflow --exp-id updated-experiment
```

For versioned images:

```bash
# Update image tag in configuration
epycloud config edit  # Update docker.image_tag

# Rebuild and redeploy
epycloud build cloud
epycloud terraform apply  # Update workflow to reference new tag
```