Cybersecurity
Architecture that survives ransomware
How to design for ransomware resilience: tiered identity, isolated and immutable backups, recovery-time engineering, and lessons from the ESXiArgs campaign.
Executive summary
Ransomware resilience is the ability of an environment to limit the blast radius of an encryption event and restore service within a known, tested time — without paying. It is an architecture property, not a product. This article covers the four pillars I design against: a tiered identity model that keeps domain compromise from becoming estate compromise, backup isolation built on the 3-2-1-1-0 rule, immutability that an attacker with admin credentials cannot revoke, and recovery-time engineering validated by full restores. The ESXiArgs campaign is used throughout as the worked example of what fails.
Resilience is an architecture property, not a product
Every ransomware briefing I sit through eventually reaches the same slide: a vendor logo grid labeled “defense in depth.” Meanwhile, the incidents that actually destroy companies share a boring shape that no logo grid addresses — one credential set that reached everything, backups that shared the fate of production, and a recovery time nobody had ever measured.
Read the postmortems and the pattern holds with depressing regularity. The products differ; the architecture failures repeat.
Ransomware resilience means designing so that an encryption event is a bounded incident: the attacker’s reach stops at a tier boundary, at least one copy of the data is beyond their ability to alter, and restoration finishes inside a window the business has already agreed to survive. Prevention still matters — patching, email filtering, EDR — but resilience is what remains when prevention fails, and prevention always eventually fails somewhere.
Design for recovery, not perfection. That is the whole thesis; the four pillars below are what it costs to implement.
The ESXiArgs lesson: encrypt the layer below the defenses
The February 2023 ESXiArgs campaign is the cleanest case study I know. Per the joint CISA/FBI advisory AA23-039A, attackers exploited known OpenSLP vulnerabilities in internet-exposed, unpatched VMware ESXi hypervisors and encrypted virtual machine files — configuration files and portions of the flat disks — directly on the datastore. Roughly 3,800 servers were compromised worldwide.
Three failures repeated across victims:
- Management planes exposed to the internet. A hypervisor management interface reachable from the internet is not a misconfiguration, it is a standing invitation. Management networks belong in a restricted zone, which is the core argument of my network segmentation strategy.
- Defenses deployed one layer above the attack. In-guest EDR and guest-level backup agents watched nothing while the hypervisor beneath them encrypted their disks. Whatever layer runs your workloads, ask what protects that layer, not just the workloads on it.
- Snapshots mistaken for backups. VM snapshots living on the same datastore as the VMs died with the datastore. A backup that shares storage, credentials, or network reachability with production is a copy, not a backup.
The early ESXiArgs variant only partially encrypted large flat files, which is why CISA could publish a recovery script that reconstructed VMs from the untouched regions. A later variant closed that gap within days. Nobody should plan around attacker sloppiness twice.
Pillar one: identity tiering
In most ransomware intrusions, encryption is the last step of an identity compromise. The operator lands on a workstation, harvests credentials, and walks upward until they hold Domain Admin — at which point they push the encryptor by GPO or PsExec to everything at once.
The countermeasure is the tiered model Microsoft formalizes as the enterprise access model:
| Tier | Contains | Credential rule |
|---|---|---|
| Tier 0 | Domain controllers, IdP, PKI, backup infrastructure, hypervisor management | Tier 0 credentials never touch lower tiers |
| Tier 1 | Servers and applications | Admin from dedicated accounts, never from Tier 2 sessions |
| Tier 2 | Workstations, VDI | No standing server or DC admin rights |
The single rule that pays for the whole model: a credential that can log on to Tier 0 must never be typed on a Tier 1 or Tier 2 system. Credentials are exposed wherever they are used; a Domain Admin who RDPs into a compromised member server has just donated the keys. Enforce it with authentication policies and logon-workstation restrictions, not memos. I cover the full identity architecture in identity-first security.
Note the deliberate placement of backup infrastructure and hypervisor management in Tier 0. ESXiArgs is the argument: whoever controls the layer below the workloads controls the workloads.
Pillar two: backup isolation and 3-2-1-1-0
Ransomware crews search for backup consoles before they trigger encryption — deleting or encrypting backups is what converts an outage into leverage. So the backup design target is: an attacker holding Domain Admin still cannot destroy the last copy.
The 3-2-1-1-0 rule operationalizes that:
- 3 copies of the data,
- 2 different media or platforms,
- 1 copy offsite,
- 1 copy offline, air-gapped, or immutable,
- 0 errors when restore verification runs.
The final two digits are where programs succeed or fail. Practical isolation measures, roughly in order of value:
- Backup servers and repositories in their own identity domain or no domain at all — never joined to the AD they protect.
- Backup network reachability limited to backup traffic; no RDP from the user LAN to a repository, ever.
- Pull-based replication to the isolated copy, so production credentials cannot reach it even in theory.
- MFA on the backup console, with deletion requiring a second person or a time-delayed approval.
The detailed treatment lives in backup and recovery security.
Pillar three: immutability the attacker cannot revoke
“Immutable” is now a checkbox on every datasheet, so ask the only question that matters: can an administrator with full credentials shorten or remove the immutability window? If yes, the attacker who becomes that administrator can too.
| Option | Revocable by a compromised admin? | Notes |
|---|---|---|
| Object storage with compliance-mode object lock | No, until retention expires | Governance mode is revocable — verify which mode you bought |
| Hardened Linux repository (immutability flags, no SSH) | Not via the backup application | Strength depends on OS hardening discipline |
| WORM tape, offline | No | Slow restores; unbeatable isolation |
| Snapshot “immutability” on primary storage | Often yes, via the array console | Treat as convenience, not last line |
My baseline: one copy under compliance-mode object lock or on a hardened repository, with a retention window longer than your realistic detection lag. Ransomware operators routinely dwell for weeks before encrypting; seven days of immutability protects you from a fast attacker and not from a patient one.
Pillar four: recovery-time engineering
An untested recovery plan is a hypothesis.
Treat it like one: it gets promoted to a capability only after the experiment runs. The questions that need measured answers, not estimates:
- How long does a full restore of the critical tier actually take? Restoring 200 TB over a 10 Gbps link is roughly two days of pure transfer time before you add verification, sequencing, and the inevitable surprises.
- What restores first? Identity and DNS come before everything, because nothing authenticates without them. Then the dependency chain of the revenue-critical applications, written down while people are calm.
- Where do you restore to if production hardware is evidence or still hostile? A clean-room enclave — even a modest one in a well-built lab environment or reserved cloud capacity — turns a rebuild-the-world problem into a staged migration.
- Can you trust what you restore? Restoring the attacker’s persistence along with the data is a classic reinfection path. Restore points must predate compromise, and golden images should be rebuilt from source.
Run a full restore drill at least twice a year and treat the measured time as the real RTO, whatever the document says. The first drill is always humbling; that is the point of doing it before an attacker schedules one for you.
What to write down
- The tier boundaries, and the named accounts allowed to cross each one.
- The location and mechanism of the immutable copy, and who can touch it.
- The measured restore time from the last drill, with the date.
- The restore order for the top ten services, with owners.
- The decision, made in advance and in writing, of what happens in the first hour of a suspected encryption event — which connects this architecture to the incident response process that has to operate it under pressure.
Resilience is the discipline of assuming the encryption event happens and engineering the day after. Everything else is hoping.
Ransomware crews will change names, tooling, and extortion models; the underlying bet they make never changes — that somewhere in your architecture, one compromise reaches everything. The four pillars exist to make that bet a losing one. Attack techniques evolve quickly. The engineering principles that bound their damage barely move at all.
Frequently asked questions
- What does ransomware resilience actually mean?
- It means an encryption event is a bounded operational incident instead of an existential one: the attacker cannot reach every system with one set of credentials, cannot destroy the backups, and the organization can restore critical services within a tested recovery time. Resilience is measured by successful restore drills, not by the number of security products deployed.
- Is the 3-2-1 backup rule still enough against ransomware?
- No. Modern ransomware operators hunt backup infrastructure first, so the extended 3-2-1-1-0 rule applies: three copies, two media types, one offsite, one offline or immutable, and zero errors on restore verification. The last two elements — a copy the attacker cannot alter, and proof it restores — are what actually decide the outcome.
- Why was the ESXiArgs campaign so damaging?
- It targeted internet-exposed VMware ESXi hypervisors running unpatched OpenSLP services, encrypting virtual machine files directly at the hypervisor layer. Guest-level backup agents and in-guest security tooling never saw it. Roughly 3,800 servers were hit worldwide, per the joint CISA/FBI advisory AA23-039A, and organizations whose only copies were VM snapshots on the same datastore lost everything at once.
- Should we plan to pay the ransom as a fallback?
- Plan as if paying is not an option. Payment does not guarantee working decryptors, does not undo data theft, may be legally restricted depending on the sanctions status of the actor, and marks you as a payer. Every hour of engineering spent on isolated, tested restores buys more certainty than a ransom negotiation ever will.