Reference Architecture
Hybrid Cloud Landing Zone
Reference architecture for a hybrid cloud landing zone: account structure, identity federation, VPN/DX connectivity, policy guardrails, and cost visibility.
Design summary
A hybrid cloud landing zone is the pre-built foundation — account structure, identity, network connectivity, guardrails, and cost controls — that workloads land into, rather than each team improvising its own cloud footprint. This architecture uses a multi-account/subscription hierarchy with an org-level policy layer, identity federated from the existing enterprise IdP, hub-and-spoke networking bridged to on-prem over redundant VPN or dedicated circuits, and tagging enforced at creation time so cost visibility works from day one. It is the difference between adopting cloud and accumulating it.
Component stack
- Cloud organization with OU/management-group hierarchy
- IaC pipeline (Terraform/OpenTofu) with state backend and plan review
- Enterprise IdP (Entra ID/Okta-class) federated via SAML/OIDC, SCIM provisioning
- Hub VPC/VNet with centralized egress and inspection
- Dedicated circuit (Direct Connect/ExpressRoute-class) + IPsec VPN backup
- BGP routing between cloud hub and on-prem core
- Org-level policy guardrails (SCPs/Azure Policy) + preventive IaC checks
- Centralized log archive account (immutable storage)
- Cost management tooling + enforced tagging schema
Purpose and requirements
A landing zone is what you build before the first production workload arrives in cloud, so that the tenth and the hundredth workload land in the same structure instead of a new improvisation each time. On accounts I lead, the difference between organizations that built one and those that did not is visible within a year: the former argue about workloads, the latter argue about untangling accounts nobody can turn off.
Cloud does not punish missing structure immediately. It compounds it.
This design assumes an enterprise with a real on-prem estate that is not going away — datacenter workloads, corporate AD, existing WAN — adopting one primary cloud provider. The framing for which workloads move is a separate question (my cloud vs on-prem decision framework covers it); this is the foundation either answer lands on.
Requirements:
- Isolation by construction. Environments and workload domains are separated by account boundaries the provider enforces, not by tags and good intentions.
- One identity. The enterprise IdP is the sole path for human access; no parallel population of cloud-native users.
- Deterministic network. On-prem and cloud exchange routes over redundant paths, with one place where inspection and egress happen.
- Guardrails that hold under bypass. The critical policies bind even if someone gets console access outside the pipeline.
- Costs attributable from day one. Every resource answers “who pays for this?” at creation time.
Topology
CLOUD ORGANIZATION (management/root — no workloads here)
|
+-- OU: Platform
| +-- Network hub acct +-- Log archive acct
| | (TGW/vWAN hub, | (immutable audit
| | egress+inspection) | + flow logs, all accts)
| +-- Identity acct +-- Shared services acct
|
+-- OU: Workloads-Prod +-- OU: Workloads-NonProd
| +-- acct: domain A prod | +-- acct: domain A dev/test
| +-- acct: domain B prod | +-- acct: domain B dev/test
|
+-- OU: Sandbox (budget-capped, auto-expiring, no on-prem route)
NETWORK VIEW
+-------------------+
spoke VPCs (prod) --- | HUB VPC/VNet | --- spoke VPCs (non-prod)
no spoke-to-spoke | central egress, | (no route to prod)
except via hub | inspection, DNS |
+---+-----------+---+
| |
Direct Connect / IPsec VPN
ExpressRoute (BGP) (BGP, backup path)
| |
+---+-----------+---+
| On-prem WAN core |
| corp AD, DCs, |
| legacy estate |
+-------------------+
IdP (Entra/Okta) ==SAML/OIDC+SCIM==> cloud SSO --> roles in all accts
Terraform pipeline ==plan/review/apply==> every account
Component roles
Organization hierarchy. The management root runs nothing — it holds the org, the guardrail policies, and consolidated billing. Platform accounts (network hub, identity, log archive, shared services) are owned by the platform team; workload OUs split prod from non-prod at the OU level so policies differ structurally, not per-account. The sandbox OU is deliberately generous inside hard walls: capped budgets, auto-expiry, and no route to on-prem, because the alternative to a good sandbox is shadow accounts on corporate credit cards.
Identity federation. The enterprise IdP federates via SAML/OIDC; SCIM keeps group membership synchronized; cloud role assignment maps from IdP groups. Humans authenticate once, with the enterprise MFA policy, and assume time-limited roles. Workloads use cloud-native workload identity (instance roles, workload identity federation) — never static keys in config files. This is identity-first security applied to cloud: the account boundary contains blast radius, but identity is the control plane attackers actually target.
Network hub. One hub VPC/VNet per region carries the transit gateway or vWAN hub, centralized egress (NAT plus inspection), and hybrid DNS resolution — conditional forwarders both directions, because half of all hybrid “connectivity issues” are DNS. Spokes attach to the hub; spoke-to-spoke goes through it or not at all. Prod and non-prod use separate route domains so a dev VPC can never reach a prod database by default.
Hybrid connectivity. A dedicated circuit (Direct Connect/ExpressRoute class) for bandwidth and latency, plus an IPsec VPN as the always-on backup, both running BGP with the on-prem core. Prefer advertising summaries over hundreds of specifics, set MED/local-pref so failover is deterministic, and fail the circuit over to VPN deliberately at least twice a year. Untested backup paths follow the same rule as untested backups.
Policy guardrails. Two layers. Preventive org policies (SCPs / Azure Policy deny effects) enforce the floor: approved regions only, no public buckets/blobs, CloudTrail/activity logs cannot be disabled, no IAM users created outside break-glass. Pipeline checks (IaC scanning, plan review) enforce the standard above the floor. The division matters — guardrails are for what must survive a bypassed pipeline or a stolen console session.
Delivery pipeline. Terraform/OpenTofu with remote state, plan output attached to the pull request, apply from CI only. Account vending itself is a pipeline: a new workload account arrives with baseline networking, logging, guardrails, and IAM already in place — that automation is the landing zone, in the sense that matters. The operating model behind this is in infrastructure as code.
Log archive and cost tooling. Every account streams audit logs, flow logs, and DNS query logs to one archive account with immutable storage and separate administrators. Cost tooling reads consolidated billing, and a tagging schema (owner, cost-center, environment, application) is enforced by policy — untagged resources are blocked at creation in prod OUs and swept weekly elsewhere.
Security model
| Layer | Control | Non-negotiable |
|---|---|---|
| Org | SCP/policy deny rules | Regions, public storage, logging, root use alarmed |
| Account | Boundary itself + baseline IAM | No cross-account trust without review |
| Identity | IdP federation, MFA, time-limited roles | No long-lived human credentials |
| Network | Hub inspection, separate prod route domain | No spoke-to-spoke or non-prod→prod default routes |
| Data | Encryption with managed keys per domain | Key policies deny cross-environment use |
| Audit | Central immutable archive | Write-once, separate admin population |
Break-glass: two cloud-native identities per critical account, credentials sealed offline, any use pages the security team. Boring by design.
Tradeoffs
| Decision | What you gain | What it costs |
|---|---|---|
| Many accounts vs one large account | Enforced isolation, clean billing, contained blast radius | Account sprawl to automate; cross-account plumbing |
| Centralized egress/inspection in hub | One choke point for policy and visibility | Data processing charges; hub is a shared dependency |
| Dedicated circuit + VPN backup | Predictable latency, tested failover | Circuit cost and lead time (weeks to months) |
| Federated identity only | One credential population, instant deprovisioning | IdP becomes tier-0 for cloud access; needs break-glass |
| Tag enforcement at creation | Cost attribution actually works | Friction for quick experiments (that’s the sandbox’s job) |
| Account vending automation | Consistent baselines, fast onboarding | Real up-front build before “anything visible” ships |
Scaling and variations
Smaller organizations: collapse to five accounts — management, platform (network+logging+shared), prod, non-prod, sandbox — and keep every other principle intact. The hierarchy can grow later; un-merging workloads out of a shared prod account is the expensive direction.
Multi-region: replicate the hub per region, interconnect hubs via the provider backbone, and keep the on-prem circuits landing in at least two locations. This pairs with the site-resilience patterns in my resilient multi-site infrastructure whitepaper.
Second cloud provider: resist symmetry. Build a minimal landing zone (identity federation, logging, guardrails, one hub) for the specific workloads that justify the second provider, and route inter-cloud traffic through on-prem or a transit provider rather than pretending you will operate two equal estates. Most “multi-cloud strategies” are one cloud plus an exception list.
Heavily regulated workloads: add a dedicated OU with stricter policies — customer-managed keys mandatory, private endpoints only, no internet egress — rather than tightening the whole org to the strictest tenant’s needs.
Operations notes
- Review guardrail denials weekly. Each denial is either an attempted mistake (the system working) or a legitimate need the platform does not yet serve (backlog item). Both are signal.
- Reconcile IdP groups to cloud roles quarterly — group sprawl is the cloud equivalent of firewall rule sprawl, and it drifts just as quietly.
- Test the connectivity failover on a schedule: drop the dedicated circuit, confirm BGP converges to VPN, measure what degraded. Publish the results internally.
- Cost review is an engineering meeting, not a finance meeting. Anomaly alerts route to the owning team’s channel; the platform team publishes unit-cost trends (per environment, per domain) monthly.
- Write down the paved road. The landing zone succeeds when teams choose it because it is the fastest path to production — a documented “here is how you get an account, a network, and a pipeline in a day” beats any policy memo at driving adoption.
None of the ideas here are new to cloud. Isolation by construction, one identity, deterministic routing, and immutable audit are the same principles that made well-run on-prem estates work for decades — the landing zone just expresses them in provider primitives instead of VLANs and chassis. Providers will keep renaming the services. The organizations that get value from cloud are the ones that put the structure in before the workloads, because the alternative is not flexibility.
It is archaeology.
Frequently asked questions
- Why multiple cloud accounts instead of one well-organized account?
- Accounts (or subscriptions) are the only isolation boundary the provider fully enforces: blast radius, IAM scope, API limits, and billing all stop at the account edge. Tags and resource groups are labels; accounts are walls. One account per environment per major workload domain is the coarse-grained segmentation of cloud.
- Do we still need a dedicated circuit if we have a VPN?
- It depends on bandwidth and latency sensitivity. IPsec over internet is fine for management traffic and modest workloads; sustained data replication, latency-sensitive applications, or anything over a few hundred Mbps justifies a Direct Connect/ExpressRoute-class circuit. Run the VPN regardless — as the always-on backup path, actively tested via BGP failover.
- Where should identity live in a hybrid design?
- One authoritative IdP — almost always the one the enterprise already runs — federated into the cloud provider via SAML/OIDC with SCIM for provisioning. Humans get no long-lived cloud credentials; they get roles through the IdP. Cloud-native IAM users are reserved for break-glass, stored offline, and alarmed on use.
- What belongs in guardrails versus in code review?
- Guardrails (SCPs, Azure Policy) enforce the non-negotiables that must hold even when pipelines are bypassed: no public object storage, approved regions only, logging cannot be disabled, root/owner actions alarmed. Code review and IaC scanning catch quality and architecture issues. Guardrails are the floor, not the process.