Pipeline Stages¶

The pipeline orchestrates computational workloads through three sequential stages. Each stage serves a specific purpose in transforming experiment configurations into final results.

Overview¶

graph LR
    subgraph "Stage A: Builder"
        A[Single Task<br/>5-15 min]
    end

    subgraph "Stage B: Runner"
        B[N Parallel Tasks<br/>Minutes to Hours]
    end

    subgraph "Stage C: Output"
        C[Single Task<br/>5-30 min]
    end

    YAML[Config YAMLs] --> A
    A -->|N .pkl files| STORAGE1[(Storage)]
    STORAGE1 --> B
    B -->|N .pkl files| STORAGE2[(Storage)]
    STORAGE2 --> C
    C -->|CSV/PDF| RESULTS[Results]

    style A fill:#4285f4,color:#fff
    style B fill:#34a853,color:#fff
    style C fill:#fbbc04,color:#000

Stage A (Builder): Reads YAML configs, builds EpiModel objects for each population, and generates N task input files
Stage B (Runner): Executes N simulations/calibrations in parallel
Stage C (Output): Aggregates results into formatted outputs

Stage A: Builder¶

Purpose¶

Reads experiment YAML configuration files, constructs model objects for each population, and packages them into self-contained task inputs that Stage B can execute in parallel.

What It Does¶

Input Files Read:

basemodel.yaml - Model structure (compartments, transitions, parameters, populations)
modelset.yaml - Population definitions and scenario variants; contains calibration or sampling settings depending on workflow type (optional)
output.yaml - Output generation settings (optional)

Processing:

Loads and validates YAML configuration files
Dispatches to the appropriate builder via epymodelingsuite.dispatcher.dispatch_builder() based on which configs are provided:
- Basemodel only → builds a single EpiModel with simulation settings
- Basemodel + sampling config → builds one EpiModel per parameter combination defined by the sampling strategy
- Basemodel + calibration config → builds one ABCSampler per population, each containing the model, priors, observed data, and calibration strategy
Saves N input files to storage, one self-contained BuilderOutput per file

For example, if your calibration config defines 52 US state/territory populations, Stage A produces 52 input files (one per population). Each file contains everything needed to run the full calibration (including all particles) for that population.

Output Files:

builder-artifacts/
├── input_00000.pkl.gz
├── input_00001.pkl.gz
├── input_00002.pkl.gz
└── input_{N-1}.pkl.gz

Each .pkl.gz file is a gzip-compressed pickle containing a BuilderOutput: everything a single Stage B task needs to run: the constructed model, execution settings, and (for calibration) the sampler with priors and observed data.

Execution Details¶

Concurrency: Single task (1 container)
Typical Duration: 5-15 minutes
Resources: Low CPU, moderate memory (configuration parsing and object creation)
Entry Point: run_builder.sh (wrapper that clones experiment repository), calling main_builder.py

What Determines N?¶

The number of input files (N) depends on your configuration:

Calibration Mode: N = number of populations. Each task runs the full calibration (all particles) for one population (e.g., 52 US states = 52 tasks)
Sampling Mode: N = number of parameter combinations determined by the sampling strategy (grid, LHS, etc.)
Simple Simulation (basemodel only): N = 1

The builder outputs NUM_TASKS=N as a job label, which Stage B uses to determine parallelism.

Stage B: Runner¶

Purpose¶

Executes simulations or calibrations in parallel. This is where the actual computational work happens.

What It Does¶

Input: Single .pkl.gz file assigned to this task (input_{TASK_INDEX}.pkl.gz)

Processing:

Loads task configuration from assigned input file
Executes workload using epymodelingsuite.dispatcher.dispatch_runner() (simulation, calibration, or projection)
Serializes result to storage

Output Files:

runner-artifacts/
├── result_00000.pkl.gz
├── result_00001.pkl.gz
├── result_00002.pkl.gz
└── result_{N-1}.pkl.gz

Each result file contains simulation trajectories, fitted parameters, or projection outputs.

Execution Details¶

Concurrency: N parallel tasks (configurable max parallelism, default: 100)
Typical Duration: Highly variable
- Simple simulations: 5-30 minutes
- Complex calibrations: 1-4 hours
- Large-scale projections: 30 minutes - 2 hours
Resources: High CPU, high memory (intensive computation)
Entry Point: main_runner.py

Task Assignment¶

Each container receives a unique task index. In cloud mode, Cloud Batch automatically sets BATCH_TASK_INDEX (0-indexed). In local mode, users set TASK_INDEX manually. The script checks both variables (TASK_INDEX takes precedence) and uses the index to load the corresponding input file.

Independence & Parallelism¶

Key Property: Tasks are completely independent - no inter-task communication, can run in any order, and failures in one task don't affect others. This enables massive parallelism: cloud mode can scale to thousands of tasks, while local mode is limited by your machine's cores. Google Cloud Batch automatically schedules tasks across available VMs.

Stage C: Output Generator¶

Purpose¶

Aggregates results from all Stage B tasks and generates formatted outputs (tables, plots, formatted CSVs).

What It Does¶

Input: All result files from Stage B (result_00000.pkl.gz through result_{N-1}.pkl.gz)

Processing:

Loads all result files
Auto-detects result type (simulation vs calibration vs projection)
Loads output configuration (auto-detects output.yaml or uses specified config)
Uses epymodelingsuite.dispatcher.dispatch_output_generator() to aggregate trajectories, compute quantiles, format for submission, and generate metadata
Saves to timestamped output directory

Output Files (varies by result type):

outputs/YYYYMMDD-HHMMSS/
├── quantiles_compartments.csv.gz           # Simulation
├── quantiles_transitions.csv.gz            # Simulation
├── quantiles_calibration.csv.gz            # Calibration
├── quantiles_projection_compartments.csv.gz  # Calibration
├── quantiles_projection_transitions.csv.gz   # Calibration
├── trajectories_compartments.csv.gz        # If requested
├── trajectories_transitions.csv.gz         # If requested
├── posteriors.csv.gz                       # Calibration only
├── model_metadata.csv.gz                   # If requested
└── output_hub_formatted.csv.gz             # If hub format requested

Execution Details¶

Concurrency: Single task (1 container)
Typical Duration: 5-30 minutes (depends on N and output complexity)
Resources: Moderate CPU, high memory (loading all N result files)
Entry Point: run_output.sh (wrapper for optional repository cloning), calling main_output.py

Timestamped Outputs¶

Output directories are timestamped in UTC (YYYYMMDD-HHMMSS/) to prevent overwriting. This allows running Stage C multiple times with different output configurations without losing previous results.

Data flow between stages¶

For details on experiment data and artifacts that are saved in each stages, see Data and Artifacts.

Stage Orchestration¶

In cloud mode, Cloud Workflows automates the full Stage A → B → C sequence. In local mode, stages run sequentially via the epycloud CLI. See Execution Modes for the complete comparison.

Next Steps¶

Data and Artifacts: Experiment data and artifact organization
Running Experiments: How to run the pipeline (cloud and local)
Docker Images: How stages are containerized
Storage Abstraction: How artifacts are stored
Workflows Orchestration: How stages are coordinated
Monitoring: How to monitor pipeline execution
Troubleshooting: Common issues and solutions