Pipeline Artifacts¶
Directory structure¶
Each stage produces artifacts that the next stage consumes. Artifacts are written to GCS in cloud mode or to the local filesystem in local mode, organized by experiment and run:
All timestamps use UTC.
{bucket}/{dir_prefix}{exp_id}/{run_id}/
├── builder-artifacts/
│ ├── input_00000.pkl.gz
│ ├── input_00001.pkl.gz
│ └── input_{N-1}.pkl.gz
├── runner-artifacts/
│ ├── result_00000.pkl.gz
│ ├── result_00001.pkl.gz
│ └── result_{N-1}.pkl.gz
└── outputs/
└── YYYYMMDD-HHMMSS[_meta-id]/
├── quantiles_*.csv.gz
├── trajectories_*.csv.gz
├── posteriors.csv.gz # Calibration only
├── model_metadata.csv.gz
├── output_hub_formatted.csv.gz # If hub format requested
└── *.png / *.pdf # Plots, if configured
{bucket}: GCS URI (e.g.,gs://my-bucket/) in cloud mode or a local path (e.g.,./local/bucket/) in local mode{dir_prefix}: Base directory prefix (e.g.,pipeline/flu/){exp_id}: Experiment ID, can include/for grouping (e.g.,202605/my-experiment){run_id}: Auto-generated asYYYYMMDD-HHMMSS-<exec-id>in UTC, where<exec-id>is the first 8 characters of the Cloud Workflows execution ID[_meta-id]: Optional suffix fromoutput.meta.idin the output config, appended to the output subdirectory timestamp
Serialization¶
All inter-stage artifacts (.pkl.gz files) use dill (not pickle) for serialization, compressed with gzip. Dill handles lambda functions, closures, and nested classes more robustly than standard pickle, which is essential for complex modeling objects.
Storage¶
The storage.py module provides a unified interface (save_bytes, load_bytes, get_path) that works identically in both cloud and local modes. See Storage Abstraction for details.
Builder artifacts¶
Generated by Stage A (Builder). The builder reads experiment YAML configs, constructs model objects, and saves one input file per parallel task.
- Files:
input_00000.pkl.gzthroughinput_{N-1}.pkl.gz - Size: Typically a few KB to MB per file
Each file contains a self-contained BuilderOutput that bundles a model with its execution instructions.
BuilderOutput¶
| Field | Type | Description |
|---|---|---|
primary_id | int | Task identifier (maps to the file index) |
model | EpiModel or None | Constructed epidemic model (for simulation workflows) |
calibrator | ABCSampler or None | Calibration sampler containing an EpiModel, priors, and observed data (for calibration workflows) |
simulation | SimulationArguments or None | Simulation settings: start/end dates, number of runs, timestep, initial conditions |
calibration | CalibrationStrategy or None | Calibration strategy name (SMC, rejection, top_fraction) and options |
projection | ProjectionArguments or None | Projection settings: end date, number of trajectories, generation number |
seed | int or None | Random seed for reproducibility |
delta_t | float or None | Timestep |
Runner artifacts¶
Generated by Stage B (Runner). Each parallel task loads one builder artifact, executes the simulation or calibration, and saves the result.
- Files:
result_00000.pkl.gzthroughresult_{N-1}.pkl.gz - Size: Typically a few MB to hundreds of MB per file
Each file contains a SimulationOutput or CalibrationOutput depending on the workflow type. Both share common tracking fields (primary_id, population, seed, delta_t) and wrap a results object from epydemix.
SimulationOutput¶
Produced by simulation workflows.
| Field | Type | Description |
|---|---|---|
primary_id | int | Task identifier |
population | str | Population name (e.g., "Massachusetts") |
seed | int or None | Random seed used |
delta_t | float or None | Timestep |
results | SimulationResults | Simulation results (see below) |
SimulationResults from epydemix includes:
| Field | Type | Description |
|---|---|---|
trajectories | list[Trajectory] | One per simulation run. Each trajectory holds time series of compartment values (e.g., S, I, R) and transition values (e.g., S_to_I) indexed by date |
parameters | dict[str, Any] | Parameters used in the simulation |
CalibrationOutput¶
Produced by calibration and calibration+projection workflows.
| Field | Type | Description |
|---|---|---|
primary_id | int | Task identifier |
population | str | Population name |
seed | int or None | Random seed used |
delta_t | float or None | Timestep |
start_date_reference | date or None | Reference date for converting posterior start_date offsets to actual dates |
results | CalibrationResults | Calibration results (see below) |
CalibrationResults from epydemix includes:
| Field | Type | Description |
|---|---|---|
posterior_distributions | dict[int, DataFrame] | Posterior parameter distributions per SMC generation |
selected_trajectories | dict[int, list] | Accepted trajectories per generation |
projections | dict[str, list] | Projection trajectories indexed by scenario ID (e.g., "baseline") |
projection_parameters | dict[str, DataFrame] | Projection parameters by scenario ID |
distances | dict[int, list] | Computed distances per generation |
weights | dict[int, list] | Computed weights per generation |
Output files¶
Generated by Stage C (Output Generator). The output stage loads all runner artifacts, aggregates results, and produces formatted data files and plots.
Output files are saved to timestamped subdirectories to prevent overwriting when re-running with different output configurations. The subdirectory name is YYYYMMDD-HHMMSS in UTC, optionally suffixed with _meta-id if output.meta.id is set in the output config. Which files are generated depends on the result type and the output configuration (output.yaml). Tabular data is output as gzip-compressed CSV (.csv.gz) or Parquet (.parquet).
Simulation¶
| File | Contents |
|---|---|
quantiles_compartments.csv.gz | Quantile statistics (e.g., median, 95% CI) for compartment values over time |
quantiles_transitions.csv.gz | Quantile statistics for transition values over time |
trajectories_compartments.csv.gz | Individual simulation trajectories for compartments |
trajectories_transitions.csv.gz | Individual simulation trajectories for transitions |
model_metadata.csv.gz | Parameters, seeds, dates, initial conditions for each task |
*.png / *.pdf | Plots (quantile plots, trajectory plots, etc.) if configured |
Calibration¶
| File | Contents |
|---|---|
quantiles_calibration.csv.gz | Quantiles for the calibration fitting window |
quantiles_projection_compartments.csv.gz | Projection quantiles for compartments (beyond fitting window) |
quantiles_projection_transitions.csv.gz | Projection quantiles for transitions (beyond fitting window) |
trajectories_projection_compartments.csv.gz | Individual projection trajectories for compartments |
trajectories_projection_transitions.csv.gz | Individual projection trajectories for transitions |
posteriors.csv.gz | Posterior parameter distributions from calibration |
model_metadata.csv.gz | Calibration metadata for each task |
output_hub_formatted.csv.gz | Forecast hub submission format (FluSight, Metrocast, etc.) |
*.png / *.pdf | Plots (quantile plots, posterior plots, grid plots, etc.) if configured |