Pipeline Stages¶
The pipeline orchestrates computational workloads through three sequential stages. Each stage serves a specific purpose in transforming experiment configurations into final results.
Overview¶
graph LR
subgraph "Stage A: Builder"
A[Single Task<br/>5-15 min]
end
subgraph "Stage B: Runner"
B[N Parallel Tasks<br/>Minutes to Hours]
end
subgraph "Stage C: Output"
C[Single Task<br/>5-30 min]
end
YAML[Config YAMLs] --> A
A -->|N .pkl files| STORAGE1[(Storage)]
STORAGE1 --> B
B -->|N .pkl files| STORAGE2[(Storage)]
STORAGE2 --> C
C -->|CSV/PDF| RESULTS[Results]
style A fill:#4285f4,color:#fff
style B fill:#34a853,color:#fff
style C fill:#fbbc04,color:#000 - Stage A (Builder): Reads YAML configs, builds EpiModel objects for each population, and generates N task input files
- Stage B (Runner): Executes N simulations/calibrations in parallel
- Stage C (Output): Aggregates results into formatted outputs
Stage A: Builder¶
Purpose¶
Reads experiment YAML configuration files, constructs model objects for each population, and packages them into self-contained task inputs that Stage B can execute in parallel.
What It Does¶
Input Files Read:
basemodel.yaml- Model structure (compartments, transitions, parameters, populations)modelset.yaml- Population definitions and scenario variants; contains calibration or sampling settings depending on workflow type (optional)output.yaml- Output generation settings (optional)
Processing:
- Loads and validates YAML configuration files
- Dispatches to the appropriate builder via
epymodelingsuite.dispatcher.dispatch_builder()based on which configs are provided:- Basemodel only → builds a single EpiModel with simulation settings
- Basemodel + sampling config → builds one EpiModel per parameter combination defined by the sampling strategy
- Basemodel + calibration config → builds one ABCSampler per population, each containing the model, priors, observed data, and calibration strategy
- Saves N input files to storage, one self-contained BuilderOutput per file
For example, if your calibration config defines 52 US state/territory populations, Stage A produces 52 input files (one per population). Each file contains everything needed to run the full calibration (including all particles) for that population.
Output Files:
builder-artifacts/
├── input_00000.pkl.gz
├── input_00001.pkl.gz
├── input_00002.pkl.gz
└── input_{N-1}.pkl.gz
Each .pkl.gz file is a gzip-compressed pickle containing a BuilderOutput: everything a single Stage B task needs to run: the constructed model, execution settings, and (for calibration) the sampler with priors and observed data.
Execution Details¶
- Concurrency: Single task (1 container)
- Typical Duration: 5-15 minutes
- Resources: Low CPU, moderate memory (configuration parsing and object creation)
- Entry Point:
run_builder.sh(wrapper that clones experiment repository), callingmain_builder.py
What Determines N?¶
The number of input files (N) depends on your configuration:
- Calibration Mode: N = number of populations. Each task runs the full calibration (all particles) for one population (e.g., 52 US states = 52 tasks)
- Sampling Mode: N = number of parameter combinations determined by the sampling strategy (grid, LHS, etc.)
- Simple Simulation (basemodel only): N = 1
The builder outputs NUM_TASKS=N as a job label, which Stage B uses to determine parallelism.
Stage B: Runner¶
Purpose¶
Executes simulations or calibrations in parallel. This is where the actual computational work happens.
What It Does¶
Input: Single .pkl.gz file assigned to this task (input_{TASK_INDEX}.pkl.gz)
Processing:
- Loads task configuration from assigned input file
- Executes workload using
epymodelingsuite.dispatcher.dispatch_runner()(simulation, calibration, or projection) - Serializes result to storage
Output Files:
runner-artifacts/
├── result_00000.pkl.gz
├── result_00001.pkl.gz
├── result_00002.pkl.gz
└── result_{N-1}.pkl.gz
Each result file contains simulation trajectories, fitted parameters, or projection outputs.
Execution Details¶
- Concurrency: N parallel tasks (configurable max parallelism, default: 100)
- Typical Duration: Highly variable
- Simple simulations: 5-30 minutes
- Complex calibrations: 1-4 hours
- Large-scale projections: 30 minutes - 2 hours
- Resources: High CPU, high memory (intensive computation)
- Entry Point:
main_runner.py
Task Assignment¶
Each container receives a unique task index. In cloud mode, Cloud Batch automatically sets BATCH_TASK_INDEX (0-indexed). In local mode, users set TASK_INDEX manually. The script checks both variables (TASK_INDEX takes precedence) and uses the index to load the corresponding input file.
Independence & Parallelism¶
Key Property: Tasks are completely independent - no inter-task communication, can run in any order, and failures in one task don't affect others. This enables massive parallelism: cloud mode can scale to thousands of tasks, while local mode is limited by your machine's cores. Google Cloud Batch automatically schedules tasks across available VMs.
Stage C: Output Generator¶
Purpose¶
Aggregates results from all Stage B tasks and generates formatted outputs (tables, plots, formatted CSVs).
What It Does¶
Input: All result files from Stage B (result_00000.pkl.gz through result_{N-1}.pkl.gz)
Processing:
- Loads all result files
- Auto-detects result type (simulation vs calibration vs projection)
- Loads output configuration (auto-detects
output.yamlor uses specified config) - Uses
epymodelingsuite.dispatcher.dispatch_output_generator()to aggregate trajectories, compute quantiles, format for submission, and generate metadata - Saves to timestamped output directory
Output Files (varies by result type):
outputs/YYYYMMDD-HHMMSS/
├── quantiles_compartments.csv.gz # Simulation
├── quantiles_transitions.csv.gz # Simulation
├── quantiles_calibration.csv.gz # Calibration
├── quantiles_projection_compartments.csv.gz # Calibration
├── quantiles_projection_transitions.csv.gz # Calibration
├── trajectories_compartments.csv.gz # If requested
├── trajectories_transitions.csv.gz # If requested
├── posteriors.csv.gz # Calibration only
├── model_metadata.csv.gz # If requested
└── output_hub_formatted.csv.gz # If hub format requested
Execution Details¶
- Concurrency: Single task (1 container)
- Typical Duration: 5-30 minutes (depends on N and output complexity)
- Resources: Moderate CPU, high memory (loading all N result files)
- Entry Point:
run_output.sh(wrapper for optional repository cloning), callingmain_output.py
Timestamped Outputs¶
Output directories are timestamped in UTC (YYYYMMDD-HHMMSS/) to prevent overwriting. This allows running Stage C multiple times with different output configurations without losing previous results.
Data flow between stages¶
For details on experiment data and artifacts that are saved in each stages, see Data and Artifacts.
Stage Orchestration¶
In cloud mode, Cloud Workflows automates the full Stage A → B → C sequence. In local mode, stages run sequentially via the epycloud CLI. See Execution Modes for the complete comparison.
Next Steps¶
- Data and Artifacts: Experiment data and artifact organization
- Running Experiments: How to run the pipeline (cloud and local)
- Docker Images: How stages are containerized
- Storage Abstraction: How artifacts are stored
- Workflows Orchestration: How stages are coordinated
- Monitoring: How to monitor pipeline execution
- Troubleshooting: Common issues and solutions