Storage Abstraction¶

The pipeline uses a unified storage interface that works identically in both cloud and local execution modes. This abstraction is implemented in docker/scripts/util/storage.py.

Why Storage Abstraction?¶

Problem: Pipeline scripts need to save and load artifacts, but the storage backend differs by mode:

Cloud mode: Google Cloud Storage (GCS) with gs:// URIs
Local mode: Local filesystem with ./local/bucket/ paths

Solution: A unified API that automatically uses the correct backend based on the EXECUTION_MODE environment variable.

Benefit: Pipeline scripts (main_builder.py, main_runner.py, main_output.py) contain zero mode-specific code.

Architecture¶

graph TB
    subgraph "Pipeline Scripts"
        BUILDER[main_builder.py]
        RUNNER[main_runner.py]
        OUTPUT[main_output.py]
    end

    subgraph "Storage Abstraction Layer"
        API[storage.py<br/>Unified API]
    end

    subgraph "Backends"
        GCS[Google Cloud Storage<br/>google-cloud-storage library]
        FS[Local Filesystem<br/>pathlib]
    end

    BUILDER --> API
    RUNNER --> API
    OUTPUT --> API

    API -->|EXECUTION_MODE=cloud| GCS
    API -->|EXECUTION_MODE=local| FS

    GCS --> GCS_BUCKET[(gs://bucket/)]
    FS --> LOCAL_DIR[(./local/bucket/)]

    style API fill:#4285f4,color:#fff
    style GCS fill:#34a853,color:#fff
    style FS fill:#fbbc04,color:#000

Core Functions¶

Function	Signature	Description
`get_config()`	`() -> dict`	Returns storage configuration (mode, bucket, prefix, exp_id, run_id) based on environment variables
`get_path(*parts)`	`(*parts: str) -> str`	Constructs full storage path with correct format for current mode
`save_bytes(path, data)`	`(path: str, data: bytes) -> None`	Saves binary data to storage (GCS upload or filesystem write)
`load_bytes(path)`	`(path: str) -> bytes`	Loads binary data from storage (GCS download or filesystem read)
`list_files(prefix)`	`(prefix: str) -> list[str]`	Lists files matching a prefix (GCS blob listing or filesystem glob)

All functions automatically dispatch to the correct backend based on EXECUTION_MODE:

Function	Cloud Backend	Local Backend
`save_bytes(path, data)`	GCS upload	Filesystem write
`load_bytes(path)`	GCS download	Filesystem read
`list_files(prefix)`	GCS blob listing	Filesystem glob
`get_path(*parts)`	`gs://bucket/prefix/...`	`./local/bucket/prefix/...`

Path Conventions¶

Standard Directory Structure¶

All pipeline runs follow this structure:

{bucket}/
└── {dir_prefix}{exp_id}/
    └── {run_id}/
        ├── builder-artifacts/
        │   ├── input_0000.pkl
        │   ├── input_0001.pkl
        │   └── ...
        ├── runner-artifacts/
        │   ├── result_0000.pkl
        │   ├── result_0001.pkl
        │   └── ...
        └── outputs/
            └── {timestamp}/
                ├── quantiles.csv.gz
                ├── trajectories.csv.gz
                └── ...

Path Components¶

Bucket: GCS bucket (gs://my-bucket) in cloud mode, local directory (./local/bucket) in local mode
DIR_PREFIX: Base directory prefix (default: pipeline/flu/), configurable per profile
EXP_ID: Experiment identifier from user (e.g., 202550/smc_rmse_202543-202549), can include subdirectories
RUN_ID: Unique run identifier (YYYYMMDD-HHMMSS-{uuid-prefix}), auto-generated in cloud mode

File Naming Conventions¶

Input files: input_{index:04d}.pkl (e.g., input_0000.pkl, input_0042.pkl)
Result files: result_{index:04d}.pkl (e.g., result_0000.pkl, result_0042.pkl)
Output files: quantiles.csv.gz, trajectories.csv.gz, metadata.json, etc. in a timestamped subdirectory

Next Steps¶

Execution Modes: How cloud and local modes use the storage abstraction
Pipeline Stages: How stages use storage
Docker Images: Where the storage module lives in the Docker image
Cloud Infrastructure: GCS bucket configuration