In this section

LX1.9 The Triage Decision Framework

3-4 hours · Module 1 · Free

Triage Decision Framework: What to Collect First and Why

Operational Objective

The Prioritization Problem: You are the sole analyst on call. At 02:15, the SOC alert fires: suspicious outbound connections from WEBSRV-NGE01 to an unknown IP on port 4444, and auth.log anomalies on BASTION-NGE01 and DBSRV-NGE01. Three systems, one analyst, an attacker who may be active on all three. If you spend 90 minutes on a full collection of WEBSRV-NGE01, you lose volatile evidence from the other two. If you spread yourself across all three simultaneously, you collect shallow evidence from each and deep evidence from none. You need a decision framework that tells you what to collect first — on which system, in what order — based on the attacker's current activity, the incident type, and the evidence that answers the most critical questions fastest.

Deliverable: A triage decision framework built on four factors — attacker activity, incident type, system count, and available access. The triage decision matrix mapping six incident types to evidence priorities. The ability to make correct collection prioritization decisions under time pressure.

⏱ Estimated completion: 30 minutes

The Triage Problem

In theory, you collect everything. In practice, you have constraints: the attacker may be active and you need to contain before they exfiltrate more data, the server is business-critical and must return to service within hours, the incident affects 15 servers and you are one investigator, or the container will restart in 3 minutes and you need to decide what to grab first.

Triage is the discipline of prioritizing evidence collection based on what is most volatile, most relevant to the investigation questions, and most at risk of destruction. An investigator who follows a rigid collection checklist regardless of circumstances collects evidence methodically but slowly. An investigator who triages effectively focuses on the evidence that answers the most critical questions first, then expands collection as time permits.

Decision Factor 1: Is the Attacker Currently Active?

# === RAPID TRIAGE SCRIPT — RUN THIS FIRST ON EACH SUSPECT HOST ===

# Step 1: Is the attacker currently active? (10 seconds)
echo "=== ACTIVE SESSIONS ==="
w                           # Who is logged in right now?
ss -tnp | grep ESTAB       # Active network connections with process names

# Step 2: What processes are running? (10 seconds)
echo "=== SUSPICIOUS PROCESSES ==="
ps auxf | grep -vE '(root|www-data|syslog|daemon).*(/usr/|/lib/)' | head -20
# Look for: processes in /tmp, /dev/shm, or without full paths

# Step 3: Network connections to external IPs (10 seconds)
echo "=== OUTBOUND CONNECTIONS ==="
ss -tnp | awk '$5 !~ /^(127\.|10\.|172\.(1[6-9]|2|3[01])\.|192\.168\.)/' 
# Shows connections to non-RFC1918 addresses (potential C2)

# Step 4: Recently modified files in suspicious locations (15 seconds)
echo "=== RECENT SUSPICIOUS FILES ==="
find /tmp /dev/shm /var/tmp -type f -mtime -7 2>/dev/null
find /var/www -name "*.php" -mtime -7 2>/dev/null

# Step 5: Quick check for common persistence (15 seconds)
echo "=== PERSISTENCE INDICATORS ==="
ls -la /etc/cron.d/ /var/spool/cron/crontabs/ 2>/dev/null
systemctl list-units --type=service --state=running | grep -v systemd

# TOTAL: ~60 seconds per host
# DECISION: based on output, determine if attacker is active,
# what incident type this is, and which host to prioritize

Expand for Deeper Context

This is the single most important triage factor. If the attacker is currently logged in (visible via who, w, or active SSH sessions in ss -tnp), the priority shifts dramatically toward capturing their current activity.

Attacker active — immediate priority: Capture the running process list with /proc direct reads (their tools, their reverse shells, their processes). Capture the network connection state (their C2 channels, their lateral movement connections). Capture the contents of /dev/shm and /tmp (their staged tools and exfiltrated data). Acquire memory if LiME is available (captures everything in one shot). Then — and only then — move to log files and persistent evidence.

Attacker not active (or status unknown) — standard priority: Follow the collection sequence from LX1.5 in order. Volatile evidence first but without the extreme urgency. You have minutes to hours rather than seconds to minutes.

Decision Factor 2: What Is the Investigation Question?

Different investigation questions require different evidence prioritization. If you know the incident type from the initial alert, you can focus collection on the evidence sources most relevant to that type.

Credential compromise (SSH brute force, stolen credentials): Priority evidence: auth.log / secure (authentication events), wtmp and btmp (login records), .ssh/authorized_keys (persistence), lastlog (last known access per account). The filesystem and network state are secondary — the authentication logs tell the story.

Web application compromise (web shell, SQLi, RCE): Priority evidence: web server access and error logs (the exploitation request), the web root directory (web shell files), the process tree (reverse shell parent/child relationships), /tmp and /dev/shm (staged payloads). Authentication logs are secondary — the attacker may not have authenticated through SSH at all.

Cryptomining: Priority evidence: running process list with /proc direct reads (the miner process and its command line), network connections (mining pool connections), CPU utilization data. Log files and filesystem artifacts are secondary — the miner is running now and the evidence is in the process and network state.

Ransomware: Priority evidence: memory dump (encryption keys may still be in process memory), filesystem state (encrypted files, ransom notes, encryption binary), the process tree (the encryption process may still be running). Immediate containment (network isolation) takes priority over collection — every minute the system stays online, more files are encrypted.

Data exfiltration: Priority evidence: network connections (active exfiltration channels), bash history (exfiltration commands), auditd file access records (what files were read), /tmp and /dev/shm (staged data archives). Time-sensitive: the data may be leaving the network right now.

Figure LX1.9 — Triage decision matrix mapping six incident types to three evidence priority tiers. The bottom rule applies universally: if the attacker is currently active, process and network state override all other priorities.

Decision Factor 3: How Many Systems Are Affected?

A single compromised server gets the full collection treatment — all phases, all evidence types. But what about 15 compromised servers? Or 50 containers in a Kubernetes cluster?

Multi-system triage: Run UAC ir_triage (fast profile) on every affected system first. This captures the most critical evidence from all systems in a fraction of the time a full collection takes. After triage data is secured from all systems, return to the highest-priority systems for comprehensive collection (memory, full UAC, disk imaging).

The worst outcome in a multi-system incident is spending 3 hours on a full collection of the first server while evidence is being destroyed on the other 14. The better outcome: 15 minutes of triage on each system (3.75 hours total), then full collection on the systems where the triage data revealed the most significant compromise.

Decision Factor 4: What Access Do You Have?

Your available access determines which collection methods are feasible:

SSH access only (most common): Full live response, UAC, remote disk imaging. Memory acquisition requires transferring a pre-compiled LiME module. All evidence streams to your forensic workstation over SSH.

Cloud console only (no SSH): Disk snapshot via API, cloud audit trail collection, security group and IAM configuration export. No live volatile collection — you cannot run commands on the system. If the investigation requires volatile evidence, you must arrange SSH access first.

kubectl only (container clusters): Pod-level collection only. No host access. If container escape is suspected, you need node-level SSH access from the cluster administrator.

Physical access only (air-gapped environments): USB-based collection. LiME from USB. UAC from USB with output to USB. Disk imaging with write blocker. No remote streaming.

The First 5 Minutes: Triage Commands

Worked artifact — Triage decision worksheet:

Complete this worksheet in the first 5 minutes of the incident. It determines your collection strategy.

Case: INC-2026-XXXX Time of initial alert: [UTC]

Factor 1 — Attacker activity: - Attacker currently active: ☐ Yes (evidence: ___) ☐ No ☐ Unknown - If active → capture process + network state on active systems FIRST

Factor 2 — Incident type (best assessment from initial indicators): - ☐ SSH brute force / credential compromise → auth logs priority - ☐ Web application compromise → web logs + web root + process tree - ☐ Cryptomining → process + network state + CPU - ☐ Ransomware → CONTAIN IMMEDIATELY + memory dump - ☐ Data exfiltration → network state + bash history + auditd - ☐ Container escape → container export + host investigation - ☐ Unknown → volatile data first, assess type from evidence

Factor 3 — Scope: - Systems affected: ___ (count) - If >3 systems → UAC ir_triage on ALL first, then deep collection on priority systems - Priority system for deep collection: ___ (reason: ___)

Factor 4 — Access: - ☐ SSH ☐ Cloud console only ☐ kubectl only ☐ Physical only - Collection methods available based on access: ___

Triage decision: Collect ___ from ___ first, then ___ from ___

Decision points: triage scenarios

Single system, attacker active, incident type unknown: Capture volatile evidence immediately (process state, network connections, /proc deep reads). The volatile evidence will reveal the incident type — a reverse shell on port 4444 indicates RCE, a mining pool connection indicates cryptomining, an active SCP to an external IP indicates exfiltration. Let the evidence tell you what happened, then collect the incident-type-specific evidence from the triage matrix.

15 systems, one analyst, attacker status unknown: UAC ir_triage on all 15 systems in parallel (scripted SSH deployment, ~15 minutes total). Review the triage output from all systems: identify which systems show active attacker processes, which show the most recent compromise indicators, and which are dormant. Deep-collect the 2-3 highest-priority systems. The triage data from the remaining 12 is sufficient for initial scoping and may be sufficient for the complete investigation.

Business-critical server, management pressure to restore immediately: Triage collection (UAC ir_triage) takes 8 minutes and does not disrupt the server. Run triage, then allow the system to return to service. Comprehensive collection (memory dump, full UAC, disk imaging) can be performed later if triage reveals insufficient evidence — but the triage captures the most volatile data before it is lost to the system returning to normal operation.

Ransomware detected, encryption in progress: Containment overrides collection. Network isolate the system immediately (pull the cable, disable the network interface, or firewall it). Then: memory dump (encryption keys may be in process memory), process tree (identify the encryption binary), filesystem state (ransom notes, encrypted file count for scope assessment). Every second the system stays connected, more files are encrypted and the attacker may pivot to other systems.

Troubleshooting: common triage decision issues

You cannot determine if the attacker is active. Run the fastest check: ssh target "w; ss -tnp | grep ESTAB". The w output shows logged-in users and their current processes. The ss output shows established TCP connections. If either shows unexpected sessions or connections, treat the attacker as active. If both appear normal, proceed with standard priority but maintain the assumption that the attacker may have disconnected recently — volatile evidence still degrades.

The initial alert does not clearly indicate the incident type. Collect volatile evidence first (it answers the incident type question). Process state and network connections reveal active attacker activity. Auth logs reveal credential compromise. Web logs reveal web application exploitation. Collect the universal volatile evidence, then let the evidence guide your incident-type-specific collection.

Management is asking for answers before you have finished collecting. Provide what you have. If you ran the volatile collection, you can report: "We have confirmed [active/inactive] attacker presence. [N] suspicious processes identified. [N] suspicious network connections to [external IPs]. Evidence collection is ongoing and a full report will follow." Do not delay collection to write reports — the evidence degrades while you write.

You started deep collection on the wrong system. Stop. Run UAC ir_triage on the system that actually matters. Return to the original system later if needed. Triage on the right system is more valuable than deep collection on the wrong one.

Beyond this investigation: LX16 (IR Readiness) formalizes this triage framework into the IR playbook: documented decision trees for each incident type that guide containment timing, evidence priorities, and escalation triggers.

Myth: "You must collect a full disk image from every compromised system, or the investigation is incomplete."

Reality: Full disk imaging is the gold standard for legal proceedings and deleted file recovery. For operational incident response — determining what happened, containing the threat, and restoring service — triage collection (UAC ir_triage or equivalent) is sufficient for 80% of investigations. The triage captures running processes, network state, authentication logs, persistence mechanisms, and key configuration files. Disk imaging adds unallocated space (deleted files) and comprehensive filesystem metadata (timeline generation). If triage answers the investigation questions, the disk image adds time cost without proportional evidence value. Collect based on what the investigation needs, not based on a theoretical maximum.

Try it yourself

Run a triage scenario.

Run a triage scenario. Set a timer for 10 minutes. Scenario: WEBSRV-NGE01 is showing suspicious outbound connections to an unknown IP on port 4444. The attacker may be active. You have SSH access. What do you collect in those 10 minutes? Write down your first 5 actions in order, with the exact commands. After the timer, review: did you capture the most volatile evidence first? Did you identify the attacker's process? Did you capture the C2 connection details? This exercise builds the muscle memory for real-time triage decisions.

Beyond This Investigation

The triage framework applies to every scenario module. LX4–LX13 each begin with an initial alert and a time-constrained collection phase. The triage decisions you make in that initial phase determine the quality of evidence available for the analysis that follows. Investigators who triage effectively have more evidence, better evidence, and faster investigation outcomes.

Check your understanding:

1. The attacker is currently active on a compromised server. You have SSH access and a pre-compiled LiME module. What are your first three collection actions? 2. You are the sole investigator for 8 compromised Linux servers. What collection strategy do you use, and why? 3. A web server compromise has been detected through an alert on outbound connections. You do not know the specific incident type yet. What is your initial evidence collection priority? 4. You have cloud console access but no SSH access to a compromised AWS EC2 instance. What evidence can you still collect, and what is the most critical gap?

Decision point

You are investigating a Linux server and discover evidence of both a cryptominer (resource abuse) and an SSH key theft (lateral movement preparation). The cryptominer is consuming 95% CPU and impacting production. Which do you address first?

Address the lateral movement first. The cryptominer is visible, noisy, and contained to this server — it is causing performance impact but not spreading. The SSH key theft is silent, potentially already exploited, and may have given the attacker access to additional servers. Contain the lateral movement risk: rotate the stolen SSH keys, check the target servers for unauthorized access, and apply network restrictions. Then address the cryptominer: kill the process, remove the binary and persistence mechanisms. Prioritizing the noisy but contained threat over the silent but spreading threat is the most common Linux IR prioritization mistake.

Unlock the Full Course See Full Course Agenda

Get weekly detection and investigation techniques

KQL queries, detection rules, and investigation methods — the same depth as this course, delivered every Tuesday.

No spam. Unsubscribe anytime. ~2,000 security practitioners.

← Previous Next →