<!-- Documentation index: https://mobs-lab.github.io/epymodelingsuite-cloud/llms.txt -->
<!-- Full documentation: https://mobs-lab.github.io/epymodelingsuite-cloud/llms-full.txt -->

# Pipeline Artifacts

## Directory structure

Each stage produces artifacts that the next stage consumes. Artifacts are written to GCS in cloud mode or to the local filesystem in local mode, organized by experiment and run:

All timestamps use UTC.

```
{bucket}/{dir_prefix}{exp_id}/{run_id}/
├── builder-artifacts/
│   ├── input_00000.pkl.gz
│   ├── input_00001.pkl.gz
│   └── input_{N-1}.pkl.gz
├── runner-artifacts/
│   ├── result_00000.pkl.gz
│   ├── result_00001.pkl.gz
│   └── result_{N-1}.pkl.gz
└── outputs/
    └── YYYYMMDD-HHMMSS[_meta-id]/
        ├── quantiles_*.csv.gz
        ├── trajectories_*.csv.gz
        ├── posteriors.csv.gz            # Calibration only
        ├── model_metadata.csv.gz
        ├── output_hub_formatted.csv.gz  # If hub format requested
        └── *.png / *.pdf               # Plots, if configured
```

- `{bucket}` : GCS URI (e.g., `gs://my-bucket/`) in cloud mode or a local path (e.g., `./local/bucket/`) in local mode
- `{dir_prefix}` : Base directory prefix (e.g., `pipeline/flu/`)
- `{exp_id}` : Experiment ID, can include `/` for grouping (e.g., `202605/my-experiment`)
- `{run_id}` : Auto-generated as `YYYYMMDD-HHMMSS-<exec-id>` in UTC, where `<exec-id>` is the first 8 characters of the Cloud Workflows execution ID
- `[_meta-id]` : Optional suffix from `output.meta.id` in the output config, appended to the output subdirectory timestamp

## Serialization

All inter-stage artifacts (`.pkl.gz` files) use **dill** (not pickle) for serialization, compressed with gzip. Dill handles lambda functions, closures, and nested classes more robustly than standard pickle, which is essential for complex modeling objects.

## Storage

The `storage.py` module provides a unified interface (`save_bytes`, `load_bytes`, `get_path`) that works identically in both cloud and local modes. See **[Storage Abstraction](https://mobs-lab.github.io/epymodelingsuite-cloud/architecture/storage-abstraction.md)** for details.


## Builder artifacts

Generated by Stage A (Builder). The builder reads experiment YAML configs, constructs model objects, and saves one input file per parallel task.

- **Files**: `input_00000.pkl.gz` through `input_{N-1}.pkl.gz`
- **Size**: Typically a few KB to MB per file

Each file contains a self-contained **BuilderOutput** that bundles a model with its execution instructions.

### BuilderOutput

| Field | Type | Description |
|-------|------|-------------|
| `primary_id` | `int` | Task identifier (maps to the file index) |
| `model` | EpiModel or `None` | Constructed epidemic model (for simulation workflows) |
| `calibrator` | ABCSampler or `None` | Calibration sampler containing an EpiModel, priors, and observed data (for calibration workflows) |
| `simulation` | `SimulationArguments` or `None` | Simulation settings: start/end dates, number of runs, timestep, initial conditions |
| `calibration` | `CalibrationStrategy` or `None` | Calibration strategy name (SMC, rejection, top_fraction) and options |
| `projection` | `ProjectionArguments` or `None` | Projection settings: end date, number of trajectories, generation number |
| `seed` | `int` or `None` | Random seed for reproducibility |
| `delta_t` | `float` or `None` | Timestep |

## Runner artifacts

Generated by Stage B (Runner). Each parallel task loads one builder artifact, executes the simulation or calibration, and saves the result.

- **Files**: `result_00000.pkl.gz` through `result_{N-1}.pkl.gz`
- **Size**: Typically a few MB to hundreds of MB per file

Each file contains a **SimulationOutput** or **CalibrationOutput** depending on the workflow type. Both share common tracking fields (`primary_id`, `population`, `seed`, `delta_t`) and wrap a results object from epydemix.

### SimulationOutput

Produced by simulation workflows.

| Field | Type | Description |
|-------|------|-------------|
| `primary_id` | `int` | Task identifier |
| `population` | `str` | Population name (e.g., "Massachusetts") |
| `seed` | `int` or `None` | Random seed used |
| `delta_t` | `float` or `None` | Timestep |
| `results` | SimulationResults | Simulation results (see below) |

SimulationResults from epydemix includes:

| Field | Type | Description |
|-------|------|-------------|
| `trajectories` | `list[Trajectory]` | One per simulation run. Each trajectory holds time series of compartment values (e.g., S, I, R) and transition values (e.g., `S_to_I`) indexed by date |
| `parameters` | `dict[str, Any]` | Parameters used in the simulation |


### CalibrationOutput

Produced by calibration and calibration+projection workflows.

| Field | Type | Description |
|-------|------|-------------|
| `primary_id` | `int` | Task identifier |
| `population` | `str` | Population name |
| `seed` | `int` or `None` | Random seed used |
| `delta_t` | `float` or `None` | Timestep |
| `start_date_reference` | `date` or `None` | Reference date for converting posterior start_date offsets to actual dates |
| `results` | CalibrationResults | Calibration results (see below) |

CalibrationResults from epydemix includes:

| Field | Type | Description |
|-------|------|-------------|
| `posterior_distributions` | `dict[int, DataFrame]` | Posterior parameter distributions per SMC generation |
| `selected_trajectories` | `dict[int, list]` | Accepted trajectories per generation |
| `projections` | `dict[str, list]` | Projection trajectories indexed by scenario ID (e.g., "baseline") |
| `projection_parameters` | `dict[str, DataFrame]` | Projection parameters by scenario ID |
| `distances` | `dict[int, list]` | Computed distances per generation |
| `weights` | `dict[int, list]` | Computed weights per generation |


## Output files

Generated by Stage C (Output Generator). The output stage loads all runner artifacts, aggregates results, and produces formatted data files and plots.

Output files are saved to timestamped subdirectories to prevent overwriting when re-running with different output configurations. The subdirectory name is `YYYYMMDD-HHMMSS` in UTC, optionally suffixed with `_meta-id` if `output.meta.id` is set in the output config. Which files are generated depends on the result type and the output configuration (`output.yaml`). Tabular data is output as gzip-compressed CSV (`.csv.gz`) or Parquet (`.parquet`).

### Simulation

| File | Contents |
|------|----------|
| `quantiles_compartments.csv.gz` | Quantile statistics (e.g., median, 95% CI) for compartment values over time |
| `quantiles_transitions.csv.gz` | Quantile statistics for transition values over time |
| `trajectories_compartments.csv.gz` | Individual simulation trajectories for compartments |
| `trajectories_transitions.csv.gz` | Individual simulation trajectories for transitions |
| `model_metadata.csv.gz` | Parameters, seeds, dates, initial conditions for each task |
| `*.png` / `*.pdf` | Plots (quantile plots, trajectory plots, etc.) if configured |

### Calibration

| File | Contents |
|------|----------|
| `quantiles_calibration.csv.gz` | Quantiles for the calibration fitting window |
| `quantiles_projection_compartments.csv.gz` | Projection quantiles for compartments (beyond fitting window) |
| `quantiles_projection_transitions.csv.gz` | Projection quantiles for transitions (beyond fitting window) |
| `trajectories_projection_compartments.csv.gz` | Individual projection trajectories for compartments |
| `trajectories_projection_transitions.csv.gz` | Individual projection trajectories for transitions |
| `posteriors.csv.gz` | Posterior parameter distributions from calibration |
| `model_metadata.csv.gz` | Calibration metadata for each task |
| `output_hub_formatted.csv.gz` | Forecast hub submission format (FluSight, Metrocast, etc.) |
| `*.png` / `*.pdf` | Plots (quantile plots, posterior plots, grid plots, etc.) if configured |

