Pipeline Artifacts¶

Directory structure¶

Each stage produces artifacts that the next stage consumes. Artifacts are written to GCS in cloud mode or to the local filesystem in local mode, organized by experiment and run:

All timestamps use UTC.

{bucket}/{dir_prefix}{exp_id}/{run_id}/
├── builder-artifacts/
│   ├── input_00000.pkl.gz
│   ├── input_00001.pkl.gz
│   └── input_{N-1}.pkl.gz
├── runner-artifacts/
│   ├── result_00000.pkl.gz
│   ├── result_00001.pkl.gz
│   └── result_{N-1}.pkl.gz
└── outputs/
    └── YYYYMMDD-HHMMSS[_meta-id]/
        ├── quantiles_*.csv.gz
        ├── trajectories_*.csv.gz
        ├── posteriors.csv.gz            # Calibration only
        ├── model_metadata.csv.gz
        ├── output_hub_formatted.csv.gz  # If hub format requested
        └── *.png / *.pdf               # Plots, if configured

{bucket} : GCS URI (e.g., gs://my-bucket/) in cloud mode or a local path (e.g., ./local/bucket/) in local mode
{dir_prefix} : Base directory prefix (e.g., pipeline/flu/)
{exp_id} : Experiment ID, can include / for grouping (e.g., 202605/my-experiment)
{run_id} : Auto-generated as YYYYMMDD-HHMMSS-<exec-id> in UTC, where <exec-id> is the first 8 characters of the Cloud Workflows execution ID
[_meta-id] : Optional suffix from output.meta.id in the output config, appended to the output subdirectory timestamp

Serialization¶

All inter-stage artifacts (.pkl.gz files) use dill (not pickle) for serialization, compressed with gzip. Dill handles lambda functions, closures, and nested classes more robustly than standard pickle, which is essential for complex modeling objects.

Storage¶

The storage.py module provides a unified interface (save_bytes, load_bytes, get_path) that works identically in both cloud and local modes. See Storage Abstraction for details.

Builder artifacts¶

Generated by Stage A (Builder). The builder reads experiment YAML configs, constructs model objects, and saves one input file per parallel task.

Files: input_00000.pkl.gz through input_{N-1}.pkl.gz
Size: Typically a few KB to MB per file

Each file contains a self-contained BuilderOutput that bundles a model with its execution instructions.

BuilderOutput¶

Field	Type	Description
`primary_id`	`int`	Task identifier (maps to the file index)
`model`	EpiModel or `None`	Constructed epidemic model (for simulation workflows)
`calibrator`	ABCSampler or `None`	Calibration sampler containing an EpiModel, priors, and observed data (for calibration workflows)
`simulation`	`SimulationArguments` or `None`	Simulation settings: start/end dates, number of runs, timestep, initial conditions
`calibration`	`CalibrationStrategy` or `None`	Calibration strategy name (SMC, rejection, top_fraction) and options
`projection`	`ProjectionArguments` or `None`	Projection settings: end date, number of trajectories, generation number
`seed`	`int` or `None`	Random seed for reproducibility
`delta_t`	`float` or `None`	Timestep

Runner artifacts¶

Generated by Stage B (Runner). Each parallel task loads one builder artifact, executes the simulation or calibration, and saves the result.

Files: result_00000.pkl.gz through result_{N-1}.pkl.gz
Size: Typically a few MB to hundreds of MB per file

Each file contains a SimulationOutput or CalibrationOutput depending on the workflow type. Both share common tracking fields (primary_id, population, seed, delta_t) and wrap a results object from epydemix.

SimulationOutput¶

Produced by simulation workflows.

Field	Type	Description
`primary_id`	`int`	Task identifier
`population`	`str`	Population name (e.g., "Massachusetts")
`seed`	`int` or `None`	Random seed used
`delta_t`	`float` or `None`	Timestep
`results`	SimulationResults	Simulation results (see below)

SimulationResults from epydemix includes:

Field	Type	Description
`trajectories`	`list[Trajectory]`	One per simulation run. Each trajectory holds time series of compartment values (e.g., S, I, R) and transition values (e.g., `S_to_I`) indexed by date
`parameters`	`dict[str, Any]`	Parameters used in the simulation

CalibrationOutput¶

Produced by calibration and calibration+projection workflows.

Field	Type	Description
`primary_id`	`int`	Task identifier
`population`	`str`	Population name
`seed`	`int` or `None`	Random seed used
`delta_t`	`float` or `None`	Timestep
`start_date_reference`	`date` or `None`	Reference date for converting posterior start_date offsets to actual dates
`results`	CalibrationResults	Calibration results (see below)

CalibrationResults from epydemix includes:

Field	Type	Description
`posterior_distributions`	`dict[int, DataFrame]`	Posterior parameter distributions per SMC generation
`selected_trajectories`	`dict[int, list]`	Accepted trajectories per generation
`projections`	`dict[str, list]`	Projection trajectories indexed by scenario ID (e.g., "baseline")
`projection_parameters`	`dict[str, DataFrame]`	Projection parameters by scenario ID
`distances`	`dict[int, list]`	Computed distances per generation
`weights`	`dict[int, list]`	Computed weights per generation

Output files¶

Generated by Stage C (Output Generator). The output stage loads all runner artifacts, aggregates results, and produces formatted data files and plots.

Output files are saved to timestamped subdirectories to prevent overwriting when re-running with different output configurations. The subdirectory name is YYYYMMDD-HHMMSS in UTC, optionally suffixed with _meta-id if output.meta.id is set in the output config. Which files are generated depends on the result type and the output configuration (output.yaml). Tabular data is output as gzip-compressed CSV (.csv.gz) or Parquet (.parquet).

Simulation¶

File	Contents
`quantiles_compartments.csv.gz`	Quantile statistics (e.g., median, 95% CI) for compartment values over time
`quantiles_transitions.csv.gz`	Quantile statistics for transition values over time
`trajectories_compartments.csv.gz`	Individual simulation trajectories for compartments
`trajectories_transitions.csv.gz`	Individual simulation trajectories for transitions
`model_metadata.csv.gz`	Parameters, seeds, dates, initial conditions for each task
`.png` / `.pdf`	Plots (quantile plots, trajectory plots, etc.) if configured

Calibration¶

File	Contents
`quantiles_calibration.csv.gz`	Quantiles for the calibration fitting window
`quantiles_projection_compartments.csv.gz`	Projection quantiles for compartments (beyond fitting window)
`quantiles_projection_transitions.csv.gz`	Projection quantiles for transitions (beyond fitting window)
`trajectories_projection_compartments.csv.gz`	Individual projection trajectories for compartments
`trajectories_projection_transitions.csv.gz`	Individual projection trajectories for transitions
`posteriors.csv.gz`	Posterior parameter distributions from calibration
`model_metadata.csv.gz`	Calibration metadata for each task
`output_hub_formatted.csv.gz`	Forecast hub submission format (FluSight, Metrocast, etc.)
`.png` / `.pdf`	Plots (quantile plots, posterior plots, grid plots, etc.) if configured