Skip to content

Storage Abstraction

The pipeline uses a unified storage interface that works identically in both cloud and local execution modes. This abstraction is implemented in docker/scripts/util/storage.py.

Why Storage Abstraction?

Problem: Pipeline scripts need to save and load artifacts, but the storage backend differs by mode:

  • Cloud mode: Google Cloud Storage (GCS) with gs:// URIs
  • Local mode: Local filesystem with ./local/bucket/ paths

Solution: A unified API that automatically uses the correct backend based on the EXECUTION_MODE environment variable.

Benefit: Pipeline scripts (main_builder.py, main_runner.py, main_output.py) contain zero mode-specific code.

Architecture

graph TB
    subgraph "Pipeline Scripts"
        BUILDER[main_builder.py]
        RUNNER[main_runner.py]
        OUTPUT[main_output.py]
    end

    subgraph "Storage Abstraction Layer"
        API[storage.py<br/>Unified API]
    end

    subgraph "Backends"
        GCS[Google Cloud Storage<br/>google-cloud-storage library]
        FS[Local Filesystem<br/>pathlib]
    end

    BUILDER --> API
    RUNNER --> API
    OUTPUT --> API

    API -->|EXECUTION_MODE=cloud| GCS
    API -->|EXECUTION_MODE=local| FS

    GCS --> GCS_BUCKET[(gs://bucket/)]
    FS --> LOCAL_DIR[(./local/bucket/)]

    style API fill:#4285f4,color:#fff
    style GCS fill:#34a853,color:#fff
    style FS fill:#fbbc04,color:#000

Core Functions

Function Signature Description
get_config() () -> dict Returns storage configuration (mode, bucket, prefix, exp_id, run_id) based on environment variables
get_path(*parts) (*parts: str) -> str Constructs full storage path with correct format for current mode
save_bytes(path, data) (path: str, data: bytes) -> None Saves binary data to storage (GCS upload or filesystem write)
load_bytes(path) (path: str) -> bytes Loads binary data from storage (GCS download or filesystem read)
list_files(prefix) (prefix: str) -> list[str] Lists files matching a prefix (GCS blob listing or filesystem glob)

All functions automatically dispatch to the correct backend based on EXECUTION_MODE:

Function Cloud Backend Local Backend
save_bytes(path, data) GCS upload Filesystem write
load_bytes(path) GCS download Filesystem read
list_files(prefix) GCS blob listing Filesystem glob
get_path(*parts) gs://bucket/prefix/... ./local/bucket/prefix/...

Path Conventions

Standard Directory Structure

All pipeline runs follow this structure:

{bucket}/
└── {dir_prefix}{exp_id}/
    └── {run_id}/
        ├── builder-artifacts/
        │   ├── input_0000.pkl
        │   ├── input_0001.pkl
        │   └── ...
        ├── runner-artifacts/
        │   ├── result_0000.pkl
        │   ├── result_0001.pkl
        │   └── ...
        └── outputs/
            └── {timestamp}/
                ├── quantiles.csv.gz
                ├── trajectories.csv.gz
                └── ...

Path Components

  • Bucket: GCS bucket (gs://my-bucket) in cloud mode, local directory (./local/bucket) in local mode
  • DIR_PREFIX: Base directory prefix (default: pipeline/flu/), configurable per profile
  • EXP_ID: Experiment identifier from user (e.g., 202550/smc_rmse_202543-202549), can include subdirectories
  • RUN_ID: Unique run identifier (YYYYMMDD-HHMMSS-{uuid-prefix}), auto-generated in cloud mode

File Naming Conventions

  • Input files: input_{index:04d}.pkl (e.g., input_0000.pkl, input_0042.pkl)
  • Result files: result_{index:04d}.pkl (e.g., result_0000.pkl, result_0042.pkl)
  • Output files: quantiles.csv.gz, trajectories.csv.gz, metadata.json, etc. in a timestamped subdirectory

Next Steps