Skip to content
PAVEL GLUKHIKH
Menu

Cybersecurity

Incident Response From the Infrastructure Seat

Incident response for infrastructure teams: the preparation artifacts that matter, containment calls under pressure, and balancing evidence with recovery.

6 min read

Executive summary

Incident response is the structured process of detecting, containing, eradicating, and recovering from a security incident while preserving the evidence needed to understand it. Most IR writing addresses the SOC; this article addresses the infrastructure engineers who actually pull the cables, snapshot the VMs, and rebuild the estate. It covers the preparation artifacts an infrastructure team must have before the bad day, how to make containment decisions — including when disconnecting beats powering off — and how to navigate the standing conflict between evidence preservation and the business screaming for recovery, using NIST SP 800-61 as the reference frame.

The people who actually turn things off

Incident response literature is written for SOC analysts and CISOs. But when an incident goes physical — isolate this VLAN, snapshot these VMs before they reboot, rebuild the domain — the work lands on infrastructure engineers, most of whom have never been in a security incident and are getting contradictory instructions from three directions at once. I have been the person deciding whether to cut a link during an event, and the calls are harder in the moment than any tabletop suggests.

NIST SP 800-61 gives the process a shape — preparation; detection and analysis; containment, eradication, and recovery; post-incident activity, recast in Revision 3 as a continuous capability aligned to CSF 2.0. This article is about what that shape means from the infrastructure seat specifically.

Preparation: the artifacts, not the binder

Preparation is the phase infrastructure teams control completely and fund worst.

The IR plan document matters far less than whether a handful of artifacts exist and are current. Every incident postmortem I have read or written converges on the same finding: the response was only ever as good as what had been prepared before anyone knew there would be an incident. The artifacts that decide the first hour:

  • Network diagrams that match reality. During containment you make decisions in minutes about what a segment reaches. A diagram that is two acquisitions out of date produces confident wrong decisions.
  • Asset inventory with owners. “What is 10.14.22.9 and who screams if it goes away” is the most-asked question of any incident’s first day.
  • Out-of-band communications. If the attacker holds the identity provider, they read your email and sit in your bridge calls. A pre-agreed channel on independent infrastructure — even a simple external messaging workspace with pre-enrolled devices — must exist before it is needed.
  • Break-glass access. Local and out-of-band credentials for network gear, hypervisors, and backup consoles that work when the directory is down or hostile — sealed, offline, tested. This is the operational payoff of the break-glass discipline in identity-first security.
  • Pre-written isolation procedures. For each zone in your segmentation model: the exact commands or change steps that cut it off, who may authorize them, and the known side effects. Deciding the blast radius of an isolation action at 2 a.m. is how secondary outages happen.
  • Forensic capacity. A few terabytes of storage the production domain cannot write to, and the knowledge of how to image a disk and export logs with hashes. Nothing exotic — just decided in advance.

If your organization has an IR retainer, know the activation procedure and what the responders will ask for on the first call: logs, diagrams, EDR scope. Having those ready collapses a day off the engagement.

Containment: when to pull the plug

Containment is a series of tradeoffs made with incomplete information, and infrastructure owns the execution. The decisions that recur:

Disconnect versus power off. The default answer: isolate the network, keep the power on. RAM holds what disk does not — running processes, network connections, sometimes the encryption keys that make file recovery possible. Powering off destroys all of it, and with some ransomware families, an in-progress encryptor interrupted by power loss leaves files unrecoverable in any state. Pull the cable, kill the switch port, use EDR network isolation. Power off is the last resort when isolation cannot happen fast enough, and even then it is a decision to log, not an instinct to follow.

Isolate the host or the segment. One confirmed compromised host with good EDR coverage: isolate the host. Signs of lateral movement, or no confidence in your visibility: isolate the segment and accept the collateral outage. The worst common choice is the middle path — watching a “contained” incident spread because nobody wanted to own the downtime of doing it properly.

When to cut the internet. Severing egress stops exfiltration and command-and-control at the price of stopping the business and alerting the attacker. If encryption has already begun or data staging is visible, cut it. Make sure the required emergency change is pre-approved in policy, because the attacker is not waiting for your CAB.

When identity itself is suspect. If there is credible evidence the directory or IdP is compromised, containment inverts: you are no longer protecting systems from the network, you are protecting them from their own management plane. Freeze GPO changes, monitor or disable privileged accounts, and treat every tool that authenticates against the directory as potentially hostile.

One infrastructure-specific warning: in OT-adjacent environments, the “just isolate it” instinct can be dangerous — cutting a control network segment can have physical consequences. Containment decisions there need the process engineers in the room, which is a discipline of its own on the industrial systems side.

Evidence versus recovery: refusing the false choice

An hour into any serious incident, two pressures collide. The business wants services back; the investigation needs the environment undisturbed. The executive asking “why isn’t it rebuilt yet” and the responder saying “touch nothing” are both doing their jobs.

Treating this as a choice is the mistake. It is a sequencing problem.

The resolution is to decouple: capture quickly, then recover on copies of your own timeline.

  1. Snapshot before you remediate. Snapshotting affected VMs — including memory where the platform supports it — takes minutes and preserves the state that reimaging destroys.
  2. Respect the order of volatility (RFC 3227): memory and network state first, disk and logs after. Practically: capture RAM or a memory-inclusive snapshot before reboot, always.
  3. Get logs off the battlefield. Export domain controller, VPN, firewall, and EDR logs to storage outside the compromised domain immediately. Attackers delete logs; retention windows expire mid-investigation.
  4. Hash and record. Every image and export gets a hash and a line in the log: who, what, when, from where. Chain of custody sounds legalistic until an insurer or regulator asks.
  5. Then rebuild in parallel. Recovery proceeds on restored or rebuilt systems while analysis runs on the captured copies. This only works if restore is actually fast and trustworthy — which is why backup security and recovery-time engineering are prerequisites for IR, not separate topics.

The unrecoverable mistake, made constantly: wiping and reimaging the initial access system in the first hour. You get the box back a day early and lose the answer to “how did they get in” — which means you cannot prove the rebuilt environment isn’t compromised the same way.

Recovery and the part everyone skips

Recovery from a serious intrusion is not “restore and resume.” Restore points must predate compromise; credentials must be rotated estate-wide (twice for krbtgt if AD was touched); persistence must be hunted in whatever you restore. Bring systems back by dependency order — identity and DNS first — into monitored, still-segmented networks, and watch for the attacker’s return. Eradication you cannot verify is postponement.

Then the post-incident review, which infrastructure teams skip because the backlog exploded during the incident. Hold it anyway, within two weeks, blameless, with the timeline on the wall. Every finding becomes either an architecture change, a new preparation artifact, or a monitoring improvement. The teams that get materially better at this are the ones that treat the review as the most important phase — which, per SP 800-61’s feedback loops, it is.

What to write down today

  • The isolation procedure for each network zone, with authorization names.
  • The break-glass credential locations and the last test date.
  • The out-of-band comms channel and enrollment status.
  • The snapshot-before-remediate rule, agreed with leadership in advance.
  • The first ten calls of a suspected compromise, in order.

An incident is where architecture gets audited. Every deferred decision — the flat management network, the domain-joined backup server, the dependency nobody mapped — presents its invoice during containment, with interest. The teams that come through intact are not the ones with the thickest IR binder; they are the ones whose infrastructure was designed for recovery in the first place.

You will respond the way you prepared. There is no third option.

Frequently asked questions

Should we power off a machine that is actively encrypting files?
Prefer disconnecting it from the network — pull the cable, disable the switch port, or isolate via EDR — because powering off destroys volatile memory that may hold encryption keys, running malware, and attacker tooling. Powering off is the fallback when isolation isn't possible fast enough. Either way, record what you did and when; the timeline matters later.
What is the incident response lifecycle in NIST SP 800-61?
The classic lifecycle is preparation; detection and analysis; containment, eradication, and recovery; and post-incident activity, with feedback loops between phases. Revision 3, published in 2025, reframes these activities around the NIST Cybersecurity Framework 2.0 functions, emphasizing that incident response is a continuous risk-management capability rather than a linear procedure.
What should infrastructure teams prepare before an incident?
Current network diagrams and asset inventories, an out-of-band communication channel, break-glass credentials that don't depend on production identity, documented isolation procedures for each environment, forensic-capable storage for images and logs, and tested restore runbooks. In the first hour of a real incident, these artifacts are the difference between execution and improvisation.
How do you preserve evidence without delaying recovery?
Decouple the two: capture fast, then rebuild. Snapshot affected VMs, image representative systems, export logs to storage the attacker can't reach, record volatile data where feasible — then hand the environment to recovery while analysis proceeds on the copies. The mistake is treating it as either-or; the expensive version of that mistake is wiping the only system that showed initial access.

References

Related reading