SA0.1 Why Most SOCs Don't Automate (And Why They Should)

5 hours · Module 0 · Free

Figure SA0.1 — The SOC scalability problem. Manual operations cannot scale with alert volume. Automation shifts analyst time from mechanical repetition to judgment-based investigation and threat hunting.

Operational Objective

Your SOC team handles 500 alerts per day. Each alert requires an average of 10 minutes of analyst time for initial enrichment, triage, and documentation. The math is simple: 500 alerts × 10 minutes = 83 analyst-hours per day. Your team has 24 available analyst-hours (3 analysts × 8-hour shifts). That is a 59-hour deficit. Every day. The alerts that fall into that deficit are not triaged. They sit in the queue, aging past the point where volatile evidence is still available, past the point where containment is timely, past the point where the incident is recoverable. This is not a staffing problem you can hire your way out of — it is an architecture problem that requires automation to solve.

Deliverable: A clear understanding of why manual SOC operations fail at scale, the five barriers that prevent SOCs from automating, and the operational case for automation that you can present to leadership.

⏱ Estimated completion: 30 minutes

The math that breaks every SOC

Start with the numbers. Not estimates — the actual operational math that determines whether your SOC can function.

Northgate Engineering’s SOC receives approximately 500 alerts per day across Microsoft Sentinel and Defender XDR. This is not unusual for an 810-person organization with M365 E5 licensing, Defender for Endpoint on 865 workstations, Defender for Office 365 processing 12,000 emails per day, Defender for Identity monitoring 4 domain controllers, and Sentinel ingesting logs from Palo Alto firewalls, Entra ID, and Azure activity. The alert volume is a function of the attack surface — more data connectors, more analytics rules, more alerts. This is working as intended.

Each alert requires triage. The analyst opens the incident, reads the alert description, checks the entities (user, IP, host), runs enrichment queries (sign-in history, risk level, TI lookup), makes a classification decision (true positive, false positive, benign true positive), documents the decision, and either closes the alert or escalates to investigation. For a straightforward false positive — the weekly report from the marketing analytics platform that triggers a “bulk download” alert because it exports 500 rows from SharePoint — this takes 3 minutes. For an AiTM alert that requires checking sign-in logs, MFA claims, audit logs for persistence, mailbox audit for data access, and cross-referencing with the user’s manager about expected travel — this takes 25 minutes.

The average across all alert types is approximately 10 minutes. This is consistent with industry benchmarks. The SANS SOC Survey consistently reports mean triage times between 8 and 15 minutes per alert across mature SOCs.

Three analysts. Eight-hour shifts. Twenty-four available analyst-hours per day. At 10 minutes per alert, a single analyst triages 48 alerts per shift. Three analysts triage 144 alerts per day. The remaining 356 alerts are not triaged. They age in the queue. Some are false positives that nobody needs to care about. Some are the early indicators of an active attack that will escalate into a breach while sitting unread in the incident queue.

This is not a failure of the analysts. They are working at capacity. They are not browsing the internet or taking long lunches. They are triaging as fast as they can with the tools they have. The tools require them to do the same manual work for every alert, regardless of whether that work is identical to the work they did on the same alert type yesterday, and the day before, and every day for the last six months.

The five reasons SOCs don’t automate

If the math is this clear, why do most SOCs continue operating manually? Five reasons, each legitimate, each solvable.

Reason 1: Fear of automated action. The most common objection is a scenario: the automation fires at 02:00, classifies a legitimate sign-in as an attack, and disables the CEO’s account during an overseas board meeting. The CEO calls the CTO. The CTO calls the security team. The automation is turned off permanently and the word “automation” becomes politically radioactive.

This fear is legitimate. Automated containment with no safeguards can cause exactly this scenario. The solution is not to avoid automation — it is to build automation with the correct safeguards. The three-tier model separates actions by risk. Tier 1 actions (enrichment) have zero blast radius. You cannot cause an incident by automatically looking up an IP’s reputation. Tier 3 actions (containment) require confidence thresholds, VIP checks, and human approval gates. The CEO scenario is preventable with a 3-line Logic App condition that checks a watchlist before disabling any account.

Reason 2: Complexity of the tooling. Logic Apps look intimidating. The designer shows a flow of boxes and arrows with JSON payloads, dynamic content expressions, condition branches, and error handling paths. The learning curve from “I can write a KQL query” to “I can build a Logic App that calls the Graph API with managed identity authentication, extracts entities from a Sentinel incident, queries three data sources in parallel, aggregates the results, and posts an adaptive card to Teams with approve/reject buttons” is steep.

This is a skills gap, not a fundamental barrier. Logic Apps are complex because they do complex things. But the complexity is manageable when you learn it in the right order: start with a simple automation rule (5 minutes to deploy, no code), then build a basic playbook (trigger → query → comment), then add branching logic, then add external integrations. This course follows that progression.

Reason 3: The perfectionism trap. Teams plan automation as a comprehensive project. They design the entire automation stack — enrichment, collection, notification, containment, remediation, reporting — before building anything. The project scope becomes so large that it never starts, or it starts and stalls when the first playbook takes three weeks instead of three days.

The answer is incremental deployment. Build one enrichment playbook. Deploy it. Measure the time saved. Build the next. The 90-day automation roadmap in SA12 is structured this way: Month 1 is enrichment only. Month 2 adds collection and notification. Month 3 introduces the first containment playbook. Each month delivers measurable value.

Reason 4: The broken playbook graveyard. Almost every SOC has at least one dead playbook — a Logic App that someone built last year, that ran for two weeks, that failed silently when an API changed or a token expired, and that nobody fixed because nobody remembered how it worked. The failure breeds distrust in all automation.

This is a governance failure, not an automation failure. Every playbook needs a runbook (what it does, how it works, what can go wrong), health monitoring (alerting on failures), and an owner (someone accountable for fixing it). SA11 covers this in detail. The broken playbook graveyard is preventable.

Reason 5: The paradox of being too busy. The analysts who would benefit most from automation are too busy triaging alerts manually to build it. The team lead who would approve the project is too busy managing escalations to evaluate it. The CISO who would fund it is too busy responding to the board about the last incident to prioritise it.

This is the strongest argument FOR automation, not against it. The time invested in building one enrichment playbook (4-8 hours) saves 5 minutes per alert on every subsequent alert of that type. If the alert type fires 20 times per day, the playbook pays for itself in the first week. Within a month, the team has recovered enough capacity to build the next playbook.

The operational case for automation

Strip away the fear, the complexity, the perfectionism, the broken playbooks, and the busyness. The case for automation is three numbers.

Number 1: Mean Time to Acknowledge (MTTA). Currently 45 minutes at NE. This is the time from alert creation to an analyst opening the incident. Most of this time is queue wait — the alert sits until an analyst finishes their current triage and picks up the next one. With enrichment automation, the MTTA drops to 30 seconds — the time it takes the playbook to enrich the alert with IP reputation, user risk, device compliance, and alert history. The analyst opens an incident that is already investigation-ready, not a raw alert that requires 5 manual queries before they can make a decision.

Number 2: Mean Time to Contain (MTTC). Currently 4+ hours at NE. The analyst triages, escalates to the IR team, the IR team assesses, the IR team executes containment. With automation, Tier 1 containment (session revocation for confirmed AiTM) happens in under 60 seconds. Tier 3 containment (endpoint isolation for suspected ransomware) happens in under 5 minutes with an approval gate. The attacker’s dwell time drops from hours to seconds for high-confidence detections.

Number 3: Analyst time recovered. If automation handles enrichment for all 500 daily alerts (saving 5 minutes each = 41 hours), auto-closes 60% of false positives (saving 3 minutes each for 300 alerts = 15 hours), and auto-collects evidence for 50 incidents per day (saving 10 minutes each = 8 hours), the team recovers approximately 64 analyst-hours per day. That is more than the 24 hours they currently have. The surplus time shifts to investigation, threat hunting, and detection engineering — the work that actually improves the security posture.

The Automation Business Case — NE Template
Current state: 500 alerts/day, 3 analysts, MTTA 45min, MTTR 4hr, 71% of alerts untriaged.
Automation investment: 120 hours over 90 days (40 hours/month, 1 analyst at 25% capacity).
Month 1 outcome (enrichment): MTTA reduced from 45min to 5min. Every alert pre-enriched with IP reputation, user risk, device compliance, TI match, and alert history. Analyst opens investigation-ready incidents.
Month 2 outcome (collection + notification): Evidence auto-collected at alert time (SigninLogs, AuditLogs, endpoint data). Teams notifications with adaptive cards. On-call escalation automated. MSSP coordination automated.
Month 3 outcome (first containment): AiTM auto-contained (session revoke + MFA reset) on high-confidence detection. MTTC for AiTM: 45 seconds. Manual containment preserved for medium/low confidence.
90-day result: MTTA 30 seconds (was 45 minutes). 300 alerts/day auto-resolved (was 0). 64 analyst-hours/day recovered. Zero untriaged alerts.
Ongoing cost: 4 hours/month for playbook maintenance, rule tuning, and health monitoring.

⚠ Compliance Myth: "We can't automate incident response because our cyber insurance policy requires human review of all security events"

The myth: Cyber insurance policies require human involvement in incident detection and response. Automating any part of the response process violates the policy terms and could void coverage in the event of a claim.

The reality: No mainstream cyber insurance policy prohibits automation. What policies require is documented, repeatable incident response processes and evidence that alerts are reviewed in a timely manner. Automation improves both: every automated action is logged with timestamps, every enrichment result is attached to the incident as an auditable comment, and every containment action records who approved it (or that it was auto-approved based on documented confidence thresholds). The insurer wants to see that you respond to alerts. They do not want to see a queue of 356 untriaged alerts per day.

If your broker has raised this concern, ask them to identify the specific policy clause. In every case we have seen, the clause requires response processes — not manual response processes. Automation with proper logging, approval gates, and runbook documentation strengthens your insurance position.

The automation that already exists (and why it’s not enough)

Before building new automation, audit what already exists. Most M365 E5 environments have automation running that nobody on the security team configured or monitors.

Defender XDR Auto Investigation and Response (AIR). When enabled, Defender automatically investigates certain alert types — phishing emails, suspicious processes, compromised accounts. It can auto-remediate (quarantine email, isolate endpoint, disable account) based on its investigation findings. Many organizations have AIR enabled but do not know what it does, what it catches, or what it misses. The action center in the Defender portal shows every automated action — check it.

Defender Attack Disruption. Defender XDR can automatically disrupt in-progress attacks by containing compromised user accounts and isolating devices involved in ransomware or business email compromise. This is powerful automation that runs with no Sentinel playbook involved. But it has limitations: it only triggers on high-confidence attack patterns that Defender’s ML models recognise, and it does not cover custom detection scenarios.

Sentinel automation rules. Some organizations have automation rules that were configured during initial deployment — changing severity on template alerts, assigning incidents to queues, or suppressing known false positives. These are lightweight rules that do not require Logic Apps and often work well for simple triage acceleration.

The gap is between what exists (platform-native automation for known patterns) and what the SOC needs (custom automation for the specific alert types, enrichment sources, containment actions, and notification workflows that are unique to the organization). This course fills that gap.

Decision point: You are the SOC team lead at Northgate Engineering. Your CISO asks: “Why should I give up one of my three analysts for 25% of their time to build automation when we’re already behind on the alert queue?” The answer is not “automation is important” — that is obvious and unhelpful. The answer is the specific, quantified business case: 120 hours invested over 90 days recovers 64 analyst-hours per day, eliminates the 71% untriaged backlog, and reduces MTTA from 45 minutes to 30 seconds. The alternative is hiring a fourth analyst at £55,000/year who adds 8 hours of capacity but does not solve the scaling problem — at 600 alerts per day next quarter, you are back in deficit. Automation solves the architecture. Headcount does not.

What this course builds

This is not a course about automation concepts. Every module produces deployable artifacts.

SA0-SA1 (this phase) establishes the framework and builds your first playbook. SA2-SA4 build the enrichment pipeline, evidence auto-collection, and notification system. SA5-SA7 build auto-containment for identity, endpoint, and cross-environment scenarios. SA8-SA13 build Defender XDR automation, KQL-driven triggers, Azure Functions for complex logic, testing and governance, and the complete automation program.

By the end of the course, you have a production-ready automation stack: every alert is enriched in 30 seconds, evidence is collected before the analyst opens the incident, stakeholders are notified through the right channel at the right severity, and containment executes with confidence thresholds that prevent the CEO scenario while ensuring the attacker does not operate for hours while you wait for a human to approve the obvious action.

The learner journey is not theoretical. You build every playbook. You test every automation. You deploy to a Sentinel workspace. You measure the results.

Try it: Calculate your SOC's automation deficit

Before proceeding to the next sub, calculate your own SOC’s numbers. If you do not work in a SOC, use Northgate Engineering’s numbers.

How many alerts per day does your SOC receive? (Check Sentinel: SecurityIncident | where TimeGenerated > ago(30d) | summarize count() / 30)
What is your average triage time per alert? (Estimate: simple FP = 3 min, complex TP = 25 min, average = 10 min)
How many analysts do you have? Multiply by 8 for available hours.
Calculate: (alerts × average_time) - (analysts × 8 hours) = deficit in hours per day.
What percentage of alerts are untriaged each day?

If the deficit is positive, you need automation. If the deficit is larger than your available analyst-hours, you need it urgently. Write down these numbers — you will reference them throughout the course.

Northgate Engineering receives 500 alerts per day and has 3 analysts working 8-hour shifts. Each alert takes an average of 10 minutes to triage. An analyst proposes building an enrichment playbook that reduces triage time from 10 minutes to 3 minutes for AiTM alerts, which represent 15% of total alert volume. What is the daily time savings?

525 minutes — you multiply 500 × 10 and then subtract the reduction. This calculation uses total alerts, not just AiTM alerts. The enrichment playbook only affects the 15% that are AiTM alerts.

525 minutes saved — 75 AiTM alerts × 7 minutes saved per alert. The playbook reduces triage time from 10 to 3 minutes (7 minutes saved) for 75 alerts (15% of 500). That is 8.75 hours recovered per day — more than one full analyst shift.

75 minutes — you calculated 75 alerts × 1 minute. But the time saving per alert is 7 minutes (from 10 down to 3), not 1 minute.

3,500 minutes — you multiplied 500 alerts × 7 minutes. But the 7-minute saving only applies to AiTM alerts (15% of total), not all alerts.

Where this goes deeper. SA12 (Building the Automation Program) covers the full business case methodology: quantifying automation ROI, presenting to leadership, identifying automation candidates by workload analysis, and building the 90-day roadmap. The Practical GRC course covers cyber insurance policy requirements in detail — including what insurers actually look for in incident response documentation and how automation strengthens (not weakens) your coverage position.

Operational Artifact — SOC Automation Business Case Template

The business case template above is the first artifact in the course. Adapt it for your environment: replace NE's numbers with your SOC's alert volume, analyst count, MTTA, and MTTR. The structure works for any SOC: current state metrics → investment required → monthly outcomes → 90-day result → ongoing cost. Present it as a cost comparison against the alternative (hiring additional analysts) to frame automation as the operationally superior option.

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

SA0.2 The Automation Spectrum →