In this section

0.1 Why Most SOCs Don't Automate (And Why They Should)

5 hours · Module 0 · Free
What you already know
You run a SOC or work inside one. You have Sentinel collecting data and analytics rules generating incidents. You triage alerts manually: open the incident, check the entities, run enrichment queries, make a classification decision, document it, close or escalate. This section explains why that workflow breaks at scale and what the alternative looks like.

Scenario

Your Sentinel workspace generates 480 incidents per day across Defender for Endpoint, Defender for Office 365, Entra ID Protection, and custom analytics rules. Your team has three analysts working 8-hour shifts. Each analyst can triage roughly 10 incidents per hour when they are focused and the incidents are straightforward. That is 240 incidents triaged per day. The other 240 sit in the queue until tomorrow, when another 480 arrive. By Friday, you have over a thousand untriaged incidents, and some contain active compromises with evidence that has already decayed past the 30-day Advanced Hunting retention window.

The capacity arithmetic

Every SOC reaches the same structural limit. It is not a staffing problem you can hire your way out of, and it is not a tuning problem you can suppress into submission. It is a fundamental constraint in how manual triage scales against alert volume.

Consider what a single AiTM phishing incident demands from an analyst working without automation. The analyst opens the incident in the Defender portal, reads the alert details, identifies the affected user. They query SigninLogs for the user's recent authentication history, checking for anomalous IPs, impossible travel, sign-ins from unfamiliar device types. They check AuditLogs for changes made after the compromise: inbox rules, OAuth consent grants, MFA modifications. They query OfficeActivity for bulk email access or suspicious file downloads. They cross-reference the source IP against threat intelligence. They determine whether the user is a VIP, an admin, or in a sensitive department. They document findings in the incident comments, classify the alert, and either close it or escalate with a handoff summary.

That sequence takes 15 to 25 minutes for an experienced analyst working a genuine AiTM alert with multiple entity types. A straightforward false positive from a known-safe process takes 3 to 5 minutes. A multi-stage incident with confirmed lateral movement can consume over an hour. Across a blended queue, the average lands at roughly 10 minutes per incident. Three analysts, 8 hours each, 10 incidents per hour: 240 incidents per day. When the workspace generates 480, the math fails.

SOC Capacity vs. Alert Volume — Daily Deficit ANALYST CAPACITY 3 analysts × 8 hrs × 10/hr 240 incidents triaged per day 50% of daily volume DAILY ALERT VOLUME Sentinel + Defender XDR 480 incidents generated per day peaks: 700+ on phishing waves DAILY DEFICIT Untriaged per day 240 some FPs, some active attacks 1,200+ backlog by Friday The deficit is structural. Hiring a fourth analyst reduces it to 120/day. It does not eliminate it.

Figure 0.1 — The daily capacity deficit in a 3-analyst SOC. Half the incidents go untriaged, and the backlog compounds daily.

None of this reflects analyst quality. Your team is working at full capacity. The constraint is structural: the workflow requires a human to perform the same enrichment sequence for every incident, regardless of whether that sequence is identical to what they ran on the same alert type yesterday. Hiring a fourth analyst reduces the deficit to 120 incidents per day. It does not eliminate it, and at current market rates for experienced SOC analysts, the salary investment buys you a 25% improvement in a workflow that is fundamentally unscalable.

What the deficit costs

Not all untriaged incidents are benign. Mandiant's M-Trends 2026 report, based on over 500,000 hours of frontline incident response, found that global median dwell time rose to 14 days — up from 11 days the prior year. Attackers operate inside compromised environments for two weeks before detection, even in organizations with SIEMs and active detection programs. The same report revealed that initial access handoff times between threat actor groups have collapsed from over 8 hours in 2022 to just 22 seconds in 2025. The attacker ecosystem has industrialized. Access brokers deliver malware directly on behalf of ransomware operators through automated handoffs, which means the window between initial compromise and active exploitation is shrinking toward zero.

M-Trends 2026 also documented that organizations detecting intrusions internally achieved a median dwell time of 9 days, while those relying on external notification averaged 25 days. The difference is detection capability, but the difference between 9 days and 9 minutes is response capability. A SOC that detects a compromise in 9 days but takes another 48 hours to triage the alert still gives the attacker 11 days of operational freedom. The detection was timely. The response was not.

When an incident sits untriaged for 48 hours, the volatile evidence decays. Sign-in tokens expire. Process execution logs roll past retention windows. The attacker, having established persistence through an inbox rule or an OAuth consent grant, has already moved to lateral movement or data exfiltration. By the time an analyst reaches the incident on Monday morning, the initial access evidence is thinner, the scope of compromise is larger, and the investigation takes three times as long.

This is the compounding cost. The incidents that eventually receive attention have become harder to investigate because the evidence that existed at alert time no longer exists at investigation time. Access tokens expire in one hour. Process trees are overwritten. Network connections close. The 22-second handoff speed Mandiant documented means that by the time an analyst opens a 48-hour-old AiTM alert, the secondary threat group has already deployed persistence mechanisms, exfiltrated credentials, and potentially begun lateral movement across the environment.

Automation Tier Assessment

Incident: AiTM phishing alert for j.morrison@northgateeng.com — received 09:14 Monday, triaged 14:22 Wednesday

Evidence gap: SigninLogs show the session token was issued at 09:12 Monday. By Wednesday, the token has been refreshed 47 times. The original sign-in context (IP, device, CA evaluation) is available, but the 48-hour window of attacker activity has produced 312 OfficeActivity records, 28 AuditLog entries, and 3 new inbox rules. What took 15 minutes to triage at alert time now takes 90 minutes to investigate.

Automation comparison: An enrichment playbook running at alert time would have captured the sign-in context, TI verdict, user risk score, and mailbox rule state in 30 seconds. The analyst would have opened a pre-enriched incident requiring 5 minutes of review instead of 90 minutes of reconstruction.

Tier classification: Enrichment (Tier 1, zero blast radius). Every enrichment step in this sequence is deterministic, produces identical output regardless of which analyst runs it, and cannot affect any production system. Full automation is appropriate.

The five barriers

If the math is this clear, why do most SOCs stay manual? Five barriers recur across organizations of every size.

Blast radius fear. The most common objection. If a playbook disables the wrong account, it causes a business outage. If it isolates the CEO's laptop during a board presentation, the SOC manager gets a call from the CIO. The fear is rational — automated containment without safeguards is dangerous. But the response should not be to avoid automation entirely. It should be to build confidence thresholds and exclusion controls that prevent the playbook from acting on low-confidence alerts or high-impact entities. This is exactly what SA0.4 (confidence thresholds) and SA0.5 (blast radius assessment) teach. The fear exists because most teams conflate all automation with containment automation, when the vast majority of automatable work is zero-risk enrichment.

No confidence measurement. Most analytics rules produce alerts with a severity label (High, Medium, Low) but no quantified confidence score. The analyst reads the alert and decides whether it is real. Automation needs a number. Without a composite confidence score derived from the detection rule's historical true-positive rate, signal source reliability, and environmental context, there is no safe basis for automated action. SA0.4 introduces the scoring model that solves this: a weighted additive system where each enrichment signal contributes to a composite score, and the score determines which automation tier is safe to execute.

Tooling complexity. Logic Apps have a genuine learning curve. Azure permissions for managed identities, connector authentication, dynamic content parsing, and error handling are not intuitive for analysts whose primary skill is KQL and investigation. The gap between "I can triage an alert" and "I can build a Logic App that triages an alert" is real. Most SOCs do not have dedicated automation engineers, and the analysts who could learn the tooling are fully consumed by the manual triage backlog — the exact problem automation would solve. SA1 closes this gap systematically, starting with the simplest possible automation rule and building incrementally.

Organizational inertia. Automation is a long-term investment with cumulative payoff. Building an enrichment playbook takes a week. It saves 30 seconds per incident. At 480 incidents per day, that is 4 hours of analyst time recovered daily — 1,460 hours per year, roughly the equivalent of a full-time analyst position. But the benefit only materializes after the playbook is deployed, tested, and trusted. The initial investment feels disproportionate to the daily return, and the project gets deprioritized behind the next critical vulnerability, compliance audit, or executive request.

Perpetual deferral. Automation is never urgent. Every day has a more pressing task: the P1 incident, the vendor assessment, the audit finding, the rule that needs tuning. Automation sits on the roadmap quarter after quarter, perpetually deferred because manual triage, while inefficient, works well enough today. The cost is invisible until the backlog produces a missed breach, and by then the failure is attributed to detection coverage or analyst error rather than the workflow that made the breach undetectable within the response window.

What automation changes

IBM's 2025 Cost of a Data Breach Report quantifies the impact directly. Organizations using AI and automation extensively experienced breach costs of $3.62 million, compared to $5.52 million for organizations without — a savings of $1.9 million per breach. Those same organizations shortened the breach lifecycle by 80 days on average. The global average breach cost dropped to $4.44 million, a 9% decrease driven primarily by faster identification and containment through automated defenses.

The mechanism is not that automation detects breaches faster. Detection depends on analytics rules, their coverage, and their signal fidelity. Automation compresses the time between detection and response — the window where the attacker operates freely. Given Mandiant's finding that initial access handoffs now happen in 22 seconds, every minute of delay between alert generation and response action represents attacker operational time.

KQL
// Calculate your SOC's automation deficit
// Run this against your Sentinel workspace to quantify the gap
SecurityIncident
| where TimeGenerated > ago(30d)
| summarize
    DailyIncidents = count() / 30,
    AvgCloseTimeHours = avg(
        datetime_diff('hour', ClosedTime, CreatedTime)
    ),
    MedianCloseTimeHours = percentile(
        datetime_diff('hour', ClosedTime, CreatedTime), 50
    ),
    StillOpen = countif(isempty(ClosedTime))
// DailyIncidents: your average daily volume
// AvgCloseTimeHours: mean time from creation to closure
// MedianCloseTimeHours: median (more resistant to outliers)
// StillOpen: incidents from the last 30 days that remain unresolved

Run this query against your own workspace. The StillOpen count is your backlog. The AvgCloseTimeHours divided by your target SLA gives you the capacity ratio. If average close time is 36 hours and your SLA is 4 hours, you are operating at 9x your target, and no amount of analyst effort closes that gap without changing the workflow.

In practical terms, automation changes three things. First, MTTA drops from minutes or hours to seconds. An enrichment playbook that runs at incident creation attaches context before any analyst opens the incident. The analyst reads the enrichment output, not the raw alert. Second, evidence is captured at alert time, not investigation time. Volatile data that would have decayed by the time an analyst reaches a 48-hour-old incident is preserved automatically. Third, containment for high-confidence incidents executes in seconds. Session revocation for an AiTM with 95% composite confidence and a clean blast radius assessment fires before the attacker can establish persistence — before the 22-second handoff window even begins.

The cumulative effect is structural. Analysts stop being data gatherers and become decision makers. Repetitive work that consumed 70% of their time is handled by playbooks. The hours they recover go to investigation, hunting, and detection engineering — the activities that improve the SOC's long-term capability rather than maintaining its daily throughput.

Anti-Pattern

The tuning trap

SOCs that recognize the capacity problem often reach for tuning first: suppress the noisiest rules, raise severity thresholds, disable low-value analytics. Tuning helps, but it has diminishing returns. Once the obvious false positive sources are suppressed, further tuning removes genuine detection coverage. You trade alert volume for visibility. The better approach is to automate the triage of alerts you want to keep, not to suppress the alerts to fit your manual capacity.

Automation Principle

Automate the workflow, not the judgment. Enrichment and evidence collection are deterministic sequences that produce identical output regardless of which analyst runs them. Automate those completely. Classification and containment require confidence measurement and blast radius controls. Automate those conditionally. The boundary between the two is the confidence threshold, and setting it correctly is the core skill this course teaches.

Next
Section 0.2 maps the automation spectrum from fully manual to fully autonomous, defining the five levels that every automation action falls into and the organizational readiness each level requires. You will place your own SOC on the spectrum and identify the level that matches your current capability.
Unlock the Full Course See Full Course Agenda