Cloud Infrastructure¶
Infrastructure overview¶
The pipeline is built as a serverless architecture on Google Cloud. There are no persistent servers or clusters to manage. Cloud Batch provisions VMs on demand for each job and terminates them after completion, so you only pay for the compute you use. All infrastructure is defined as code using Terraform.
graph TB
subgraph "User Interface"
CLI[epycloud CLI]
end
subgraph "Container Registry"
AR[Artifact Registry<br/>Docker Images]
end
subgraph "Orchestration"
CW[Cloud Workflows<br/>Pipeline Coordinator]
end
subgraph "Compute"
CB[Cloud Batch<br/>Job Execution]
VM[Auto-scaled VMs]
end
subgraph "Storage"
GCS[Cloud Storage<br/>Artifacts & Results]
end
subgraph "Access Control"
IAM[IAM & Service Accounts]
end
subgraph "Monitoring"
LOG[Cloud Logging]
MON[Cloud Monitoring]
BUDGET[Budget Alerts]
end
CLI -->|Submit workflow| CW
CW -->|Create jobs| CB
CB -->|Pull images| AR
CB -->|Provision| VM
VM -->|Read/Write| GCS
IAM -->|Control access| GCS
IAM -->|Control access| CB
VM -->|Send logs| LOG
CB -->|Send metrics| MON
MON -->|Trigger| BUDGET
style CW fill:#4285f4,color:#fff
style CB fill:#34a853,color:#fff
style GCS fill:#fbbc04,color:#000
style AR fill:#ea4335,color:#fff Terraform¶
Infrastructure as Code
All infrastructure is defined in the terraform/ directory. This includes Artifact Registry, Cloud Workflows, service accounts, IAM bindings, networking, and monitoring alerts. The GCS bucket is created manually outside of Terraform (see Prerequisites).
Managed through epycloud
Instead of running terraform directly, you use epycloud terraform commands. Under the hood, epycloud reads your configuration and converts values like project ID, region, and resource allocations into TF_VAR_* environment variables. This way, Terraform always uses the same settings as the rest of epycloud without you needing to pass -var flags manually.
Remote state
State is stored remotely in a GCS backend (configured in main.tf) for safe collaboration.
terraform/
├── main.tf # APIs, storage, registry, workflows, secrets
├── variables.tf # Input variables
├── outputs.tf # Output values
├── network.tf # VPC, subnet, NAT, firewall
└── monitoring.tf # Budget and alert policies
To deploy or update infrastructure, see Cloud Deployment: Deploy Infrastructure.
Core services¶
Cloud Storage (GCS)¶
Persistent storage for all pipeline artifacts and results. Stores builder inputs, runner results, and final outputs under a structured path hierarchy ({bucket}/{dir_prefix}{exp_id}/{run_id}/...). Optional lifecycle policies handle automatic cleanup of old artifacts.
Artifact Registry¶
Docker image repository for pipeline containers. Stores both local and cloud image variants in a regional repository for fast pulls during job execution. Cloud Build pushes images here automatically; Cloud Batch pulls them when starting jobs.
Cloud Build¶
Builds Docker images and pushes them to Artifact Registry. Submitted asynchronously via epycloud build cloud. See Cloud Build configuration below for implementation details.
Cloud Batch¶
Serverless compute for running pipeline stages. Cloud Workflows submits job definitions, Cloud Batch provisions VMs based on resource requirements, VMs pull Docker images from Artifact Registry, containers execute pipeline scripts, and VMs are automatically terminated after completion. No cluster management required.
Batch documentation | Compute Engine | Google CloudCloud Workflows¶
Orchestrates multi-stage pipeline execution. Coordinates the sequential execution of Stage A → B → C, handles job status polling, passes data between stages (e.g., NUM_TASKS), and manages conditional execution (optional Stage C). See Workflows Orchestration for design details.
Cloud Logging¶
Centralized log aggregation from Cloud Workflows executions, Cloud Batch jobs, and container stdout/stderr. Logs are labeled with exp_id, run_id, stage, and task_index for filtering. Default retention is 30 days.
Cloud Monitoring¶
Tracks infrastructure metrics and triggers alerts for workflow execution failures, batch job failures (>10% failure rate), and budget threshold crossings (50%, 80%, 100% of monthly budget).
Cloud Monitoring documentation | Google CloudCloud Build configurations¶
The build is configured in cloudbuild.yaml:
- Uses
E2_HIGHCPU_8machine type for faster builds. See this page for available instance types. - Supports layer caching with
--cache-fromandBUILDKIT_INLINE_CACHE(optional, controlled by_NO_CACHEflag) - Fetches
GITHUB_PATfrom Secret Manager for private repository access during build - Runs container structure tests after build to validate the image
Cloud Batch resources¶
Default resource allocations per stage (from terraform/variables.tf):
| Stage | CPU | Memory | Timeout | Machine Type | Parallelism |
|---|---|---|---|---|---|
| A (Builder) | 2000 milli | 8192 MiB | 3600s (1 hr) | c4d-standard-2 | 1 task |
| B (Runner) | 2000 milli | 4096 MiB | 36000s (10 hr) | auto-selected | up to 100 tasks |
| C (Output) | 4000 milli | 15360 MiB | 7200s (2 hr) | c4d-standard-4 | 1 task |
All values can be overridden from your epycloud configuration under google_cloud.batch. When no machine type is specified (empty string), Cloud Batch auto-selects a VM type based on the CPU and memory requirements. See Google Cloud Batch documentation for details on resource configuration.
Note: CPU is measured in millicores (1000 milli = 1 vCPU).
IAM & Service accounts¶
Two service accounts are created with minimal permissions:
Workflow Service Account (used by Cloud Workflows):
| Role | Purpose |
|---|---|
roles/batch.jobsEditor | Create and manage Batch jobs |
roles/logging.logWriter | Write workflow logs |
Batch Service Account (used by Cloud Batch VMs):
| Role | Purpose |
|---|---|
roles/storage.objectAdmin | Read/write GCS artifacts |
roles/artifactregistry.reader | Pull Docker images |
roles/logging.logWriter | Write job logs |
Required Google Cloud APIs¶
The following APIs must be enabled (Terraform handles this automatically):
batch.googleapis.com- Cloud Batchworkflows.googleapis.com- Cloud Workflowsstorage.googleapis.com- Cloud Storageartifactregistry.googleapis.com- Artifact Registrycloudbuild.googleapis.com- Cloud Buildlogging.googleapis.com- Cloud Loggingmonitoring.googleapis.com- Cloud Monitoring
Next steps¶
- Cloud Deployment: Setup and deployment guide
- Workflows Orchestration: Workflow design details
- Pipeline Stages: How stages run on this infrastructure