Skip to content

Pipeline Artifacts

Directory structure

Each stage produces artifacts that the next stage consumes. Artifacts are written to GCS in cloud mode or to the local filesystem in local mode, organized by experiment and run:

All timestamps use UTC.

{bucket}/{dir_prefix}{exp_id}/{run_id}/
├── builder-artifacts/
│   ├── input_00000.pkl.gz
│   ├── input_00001.pkl.gz
│   └── input_{N-1}.pkl.gz
├── runner-artifacts/
│   ├── result_00000.pkl.gz
│   ├── result_00001.pkl.gz
│   └── result_{N-1}.pkl.gz
└── outputs/
    └── YYYYMMDD-HHMMSS[_meta-id]/
        ├── quantiles_*.csv.gz
        ├── trajectories_*.csv.gz
        ├── posteriors.csv.gz            # Calibration only
        ├── model_metadata.csv.gz
        ├── output_hub_formatted.csv.gz  # If hub format requested
        └── *.png / *.pdf               # Plots, if configured
  • {bucket} : GCS URI (e.g., gs://my-bucket/) in cloud mode or a local path (e.g., ./local/bucket/) in local mode
  • {dir_prefix} : Base directory prefix (e.g., pipeline/flu/)
  • {exp_id} : Experiment ID, can include / for grouping (e.g., 202605/my-experiment)
  • {run_id} : Auto-generated as YYYYMMDD-HHMMSS-<exec-id> in UTC, where <exec-id> is the first 8 characters of the Cloud Workflows execution ID
  • [_meta-id] : Optional suffix from output.meta.id in the output config, appended to the output subdirectory timestamp

Serialization

All inter-stage artifacts (.pkl.gz files) use dill (not pickle) for serialization, compressed with gzip. Dill handles lambda functions, closures, and nested classes more robustly than standard pickle, which is essential for complex modeling objects.

Storage

The storage.py module provides a unified interface (save_bytes, load_bytes, get_path) that works identically in both cloud and local modes. See Storage Abstraction for details.

Builder artifacts

Generated by Stage A (Builder). The builder reads experiment YAML configs, constructs model objects, and saves one input file per parallel task.

  • Files: input_00000.pkl.gz through input_{N-1}.pkl.gz
  • Size: Typically a few KB to MB per file

Each file contains a self-contained BuilderOutput that bundles a model with its execution instructions.

BuilderOutput

Field Type Description
primary_id int Task identifier (maps to the file index)
model EpiModel or None Constructed epidemic model (for simulation workflows)
calibrator ABCSampler or None Calibration sampler containing an EpiModel, priors, and observed data (for calibration workflows)
simulation SimulationArguments or None Simulation settings: start/end dates, number of runs, timestep, initial conditions
calibration CalibrationStrategy or None Calibration strategy name (SMC, rejection, top_fraction) and options
projection ProjectionArguments or None Projection settings: end date, number of trajectories, generation number
seed int or None Random seed for reproducibility
delta_t float or None Timestep

Runner artifacts

Generated by Stage B (Runner). Each parallel task loads one builder artifact, executes the simulation or calibration, and saves the result.

  • Files: result_00000.pkl.gz through result_{N-1}.pkl.gz
  • Size: Typically a few MB to hundreds of MB per file

Each file contains a SimulationOutput or CalibrationOutput depending on the workflow type. Both share common tracking fields (primary_id, population, seed, delta_t) and wrap a results object from epydemix.

SimulationOutput

Produced by simulation workflows.

Field Type Description
primary_id int Task identifier
population str Population name (e.g., "Massachusetts")
seed int or None Random seed used
delta_t float or None Timestep
results SimulationResults Simulation results (see below)

SimulationResults from epydemix includes:

Field Type Description
trajectories list[Trajectory] One per simulation run. Each trajectory holds time series of compartment values (e.g., S, I, R) and transition values (e.g., S_to_I) indexed by date
parameters dict[str, Any] Parameters used in the simulation

CalibrationOutput

Produced by calibration and calibration+projection workflows.

Field Type Description
primary_id int Task identifier
population str Population name
seed int or None Random seed used
delta_t float or None Timestep
start_date_reference date or None Reference date for converting posterior start_date offsets to actual dates
results CalibrationResults Calibration results (see below)

CalibrationResults from epydemix includes:

Field Type Description
posterior_distributions dict[int, DataFrame] Posterior parameter distributions per SMC generation
selected_trajectories dict[int, list] Accepted trajectories per generation
projections dict[str, list] Projection trajectories indexed by scenario ID (e.g., "baseline")
projection_parameters dict[str, DataFrame] Projection parameters by scenario ID
distances dict[int, list] Computed distances per generation
weights dict[int, list] Computed weights per generation

Output files

Generated by Stage C (Output Generator). The output stage loads all runner artifacts, aggregates results, and produces formatted data files and plots.

Output files are saved to timestamped subdirectories to prevent overwriting when re-running with different output configurations. The subdirectory name is YYYYMMDD-HHMMSS in UTC, optionally suffixed with _meta-id if output.meta.id is set in the output config. Which files are generated depends on the result type and the output configuration (output.yaml). Tabular data is output as gzip-compressed CSV (.csv.gz) or Parquet (.parquet).

Simulation

File Contents
quantiles_compartments.csv.gz Quantile statistics (e.g., median, 95% CI) for compartment values over time
quantiles_transitions.csv.gz Quantile statistics for transition values over time
trajectories_compartments.csv.gz Individual simulation trajectories for compartments
trajectories_transitions.csv.gz Individual simulation trajectories for transitions
model_metadata.csv.gz Parameters, seeds, dates, initial conditions for each task
*.png / *.pdf Plots (quantile plots, trajectory plots, etc.) if configured

Calibration

File Contents
quantiles_calibration.csv.gz Quantiles for the calibration fitting window
quantiles_projection_compartments.csv.gz Projection quantiles for compartments (beyond fitting window)
quantiles_projection_transitions.csv.gz Projection quantiles for transitions (beyond fitting window)
trajectories_projection_compartments.csv.gz Individual projection trajectories for compartments
trajectories_projection_transitions.csv.gz Individual projection trajectories for transitions
posteriors.csv.gz Posterior parameter distributions from calibration
model_metadata.csv.gz Calibration metadata for each task
output_hub_formatted.csv.gz Forecast hub submission format (FluSight, Metrocast, etc.)
*.png / *.pdf Plots (quantile plots, posterior plots, grid plots, etc.) if configured