10.5 Incident Management and Investigation Workflow

16-20 hours · Module 10

Incident Management and Investigation Workflow

Introduction

Required role: Microsoft Sentinel Contributor for analytics rules. Sentinel Responder for incident management.

Analytics rules create incidents. This subsection teaches what happens next — the lifecycle from the moment an incident appears in the queue through triage, assignment, investigation, classification, and closure. Effective incident management is the difference between a SOC that resolves threats quickly and one that drowns in an unmanaged queue.

The incident lifecycle

Stage 1: Creation. An analytics rule fires, an alert is generated, and the alert is grouped into an incident (new or existing, based on alert grouping configuration). The incident appears in the incident queue with status “New” and the severity defined by the rule.

Stage 2: Triage. An analyst reviews the incident queue, sorted by severity and age. For each new incident, triage answers one question: “Does this require investigation?” The analyst examines the alert title, severity, mapped entities, and custom details. If the alert is an obvious false positive (known benign activity matching the rule), the analyst classifies it as False Positive and closes it. If the alert requires investigation, the analyst assigns it (to themselves or another analyst) and changes the status to “Active.”

Stage 3: Investigation. The assigned analyst investigates by following the evidence. The investigation graph visualises entities and their relationships. The analyst queries related data, correlates with other alerts, checks entity timelines, and builds the attack narrative. Investigation bookmarks capture important findings for the incident record.

Stage 4: Classification. After investigation, the analyst classifies the incident. True Positive: a confirmed security threat. Determination: the specific threat type (phishing, malware, data exfiltration, etc.). False Positive: benign activity incorrectly flagged. Benign Positive: real security event but expected/authorised (e.g., a penetration test). Undetermined: insufficient evidence to classify (rare — should be minimised through thorough investigation).

Stage 5: Closure. The analyst adds a closing comment documenting the findings, actions taken, and any remediation performed. Status is set to “Closed.” The incident record is retained for the workspace retention period, available for future reference and trend analysis.

The incident queue

Navigate to Sentinel → Incidents. The queue shows all incidents with filters for status, severity, owner, product, and time range.

Queue management best practices:

Filter by “New” status to see untriaged incidents. Sort by severity (High → Medium → Low) to prioritise. Use the age column to identify incidents that have been New for too long — anything untriaged for more than 4 hours during business hours indicates insufficient SOC capacity or attention.

Incident details pane. Click any incident to open the details pane without leaving the queue. The pane shows: alert count, entity count, timeline (when alerts were generated), entities (clickable cards), and alert details (with custom details from subsection 10.4). Most triage decisions can be made from the details pane without opening the full investigation page.

The investigation graph

The investigation graph is Sentinel’s visual investigation tool. It displays entities from the incident and their relationships to other entities and alerts.

How it works: Open an incident → click “Investigate.” The graph starts with the incident’s entities (the Account, IP, Host mapped in the analytics rule). Each entity can be expanded to show related entities and alerts. Expanding an Account entity shows: other IPs this account signed in from, other devices this account used, other incidents involving this account, and timeline events.

Investigation workflow using the graph:

Step 1: Start with the primary entity (usually the Account or IP that triggered the alert). Step 2: Expand to see relationships — what else has this entity done recently? Step 3: Follow suspicious paths — if the account signed in from a known-malicious IP, expand that IP to see which other accounts it accessed. Step 4: Identify the scope — how many entities are affected? Is this an isolated incident or part of a larger campaign? Step 5: Add investigation bookmarks for key findings.

1
2
3
4
5
6
7
8
// Supporting query: full sign-in timeline for the investigated account
SigninLogs
| where TimeGenerated > ago(7d)
| where UserPrincipalName == "j.morrison@northgateeng.com"
| project TimeGenerated, IPAddress, AppDisplayName, ResultType,
    ResultDescription, Location, DeviceDetail,
    ConditionalAccessStatus, RiskLevelDuringSignIn
| order by TimeGenerated desc

Investigation bookmarks

Bookmarks capture specific KQL query results as evidence within an incident. When an analyst runs a query during investigation and finds a critical result (the exact sign-in event showing the compromise, the inbox rule creation, the data exfiltration), they save it as a bookmark.

Creating a bookmark: Run a KQL query in the Logs blade. Select the relevant rows. Click “Add bookmark.” Add a note explaining why this evidence is significant. Link the bookmark to the active incident.

Bookmarks persist. Unlike query results (which disappear when you close the tab), bookmarks are saved in the workspace and attached to the incident. When the incident is reviewed later — by a supervisor, by a different analyst, or during a post-incident review — the bookmarks show exactly what evidence was found and when.

Bookmarks in the investigation graph. Bookmarked entities appear in the investigation graph alongside alert entities. This enriches the graph with analyst-discovered evidence beyond what the analytics rule initially identified.

Investigation standard operating procedure

Standardise the investigation workflow so every analyst follows the same steps — ensuring consistency, completeness, and evidence quality.

Step 1: Read the alert (30 seconds). Review the alert title, severity, mapped entities, and custom details in the incident details pane. Determine the initial hypothesis: what does this alert mean? What attack stage does it represent?

Step 2: Context check (2 minutes). Before deep investigation, check context: Is this entity involved in other open incidents? (Check Related incidents in the incident pane.) Has this analytics rule generated false positives recently? (Check rule history.) Is this entity in the VIP watchlist or departing employee watchlist?

Step 3: Entity timeline investigation (5-10 minutes). For each mapped entity, review the timeline:

For Account entities: What did this user do before and after the alert? Check SigninLogs, AuditLogs, CloudAppEvents. Look for: unusual sign-in locations, new inbox rules, new app consents, password changes.

For IP entities: What else came from this IP? Is it associated with other compromised accounts? Check ThreatIntelligenceIndicator for known-malicious status. Check geolocation — does the location match the user’s expected location?

For Host entities: What processes ran? What network connections were made? Check SecurityEvent or DeviceProcessEvents for the time window around the alert.

Step 4: Scope assessment (5 minutes). Determine the blast radius. Is this an isolated incident affecting one user, or a campaign affecting multiple users? Run cross-entity queries:

1
2
3
4
5
6
7
8
9
// Scope: how many users were affected from this attacking IP?
SigninLogs
| where TimeGenerated > ago(24h)
| where IPAddress == "203.0.113.47"  // The attacking IP from the alert
| summarize
    AffectedUsers = dcount(UserPrincipalName),
    Users = make_set(UserPrincipalName, 20),
    Results = make_set(ResultType, 10)
    by IPAddress

Step 5: Evidence collection (ongoing). Throughout the investigation, bookmark every significant finding. Each bookmark should have a note explaining its relevance: “This is the initial compromise sign-in — unusual location (Lagos, Nigeria), risk level High, followed by inbox rule creation 3 minutes later.”

Step 6: Containment decision (immediate if confirmed). If the investigation confirms compromise: trigger the containment playbook (or perform manual containment — revoke sessions, reset password, isolate device). Document the containment actions in incident comments.

Step 7: Classification and closure. Classify the incident (TP/FP/BP). Add a closing comment summarising: what happened, which entities were affected, what actions were taken, and any recommendations for prevention.

Multi-incident correlation

Sometimes a single attack generates multiple incidents across different analytics rules. An AiTM phishing attack may trigger: a suspicious sign-in rule, an inbox rule creation rule, and a mail forwarding rule — three separate incidents for one attack.

Manual correlation: Review the incident queue for incidents involving the same entities within a close time window. If two incidents share the same Account entity and occurred within 1 hour, they are likely related. Merge or link them.

Automated correlation: Use automation rules to tag incidents with shared entities. If two incidents both involve j.morrison@northgateeng.com within 2 hours, the automation rule adds a tag “related-j.morrison” to both. The analyst sees the tag and investigates both together.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Find related incidents — same entity, close time window
SecurityIncident
| where TimeGenerated > ago(24h)
| where Status != "Closed"
| mv-expand AlertIds
| join kind=inner (
    SecurityAlert
    | extend Entities = parse_json(Entities)
    | mv-expand Entities
    | extend EntityValue = tostring(Entities.Address)
    | where isnotempty(EntityValue)
) on $left.AlertIds == $right.SystemAlertId
| summarize IncidentCount = dcount(IncidentNumber),
    Incidents = make_set(IncidentNumber, 10) by EntityValue
| where IncidentCount > 1
| order by IncidentCount desc

Incident comments best practices

Every incident should have comments documenting the investigation. Comments serve three purposes: they communicate findings to other analysts who may review the incident, they provide evidence for post-incident reporting, and they create an audit trail for compliance.

Opening comment (at triage): “Triaged at [time]. Initial assessment: [hypothesis]. Assigned to [analyst]. Priority: [reason for urgency or lack thereof].”

Investigation comments (during investigation): “Checked user sign-in history — confirmed sign-in from unusual IP [IP] at [time]. Previous sign-ins were all from UK office IPs.” Each significant finding gets a comment — not just a bookmark.

Containment comment (if actions taken): “Containment actions: revoked all sessions for [user], reset password, disabled account pending investigation. Playbook [name] executed automatically.”

Closing comment: “Classification: True Positive — AiTM credential phishing. Summary: [2-3 sentence summary]. Actions taken: [list]. Recommendations: [any prevention suggestions].”

Evidence chain documentation

For incidents that may require legal or compliance review, maintain a documented evidence chain.

What to document: Every KQL query run during investigation (save the query text in a bookmark or comment). Every entity timeline reviewed. Every containment action taken (automated or manual), with the exact timestamp. Every external communication (with the affected user, with management, with legal). Every file or screenshot captured as evidence.

Where to document: Incident comments for investigation findings and actions. Investigation bookmarks for specific query results. If your organisation requires formal evidence packages (for legal proceedings or regulatory reporting), export the incident timeline and bookmarks to a structured document — the IR report template from Module 16 (when built) provides this format.

Timestamp everything. The difference between “we contained the threat quickly” and proving it to an auditor is timestamps. Comment format: “[2026-03-22 14:35 UTC] Revoked refresh tokens for j.morrison@northgateeng.com via Graph API. Confirmed session termination by verifying no new sign-ins after revocation.”

Cross-portal investigation: Sentinel + Defender XDR

With the unified security operations platform (Module 7.9), investigation flows between Sentinel and the Defender portal.

Start in the incident queue (either portal). Review the incident details, entities, and alerts.

Deep-dive in the appropriate portal. For identity investigation: Sentinel entity pages or Entra ID. For endpoint investigation: Defender for Endpoint device page (timeline, process tree, network connections). For email investigation: Defender for Office 365 email entity page (email headers, attachment detonation, URL reputation). For cloud app investigation: Defender for Cloud Apps activity log.

Pattern: cross-portal investigation for AiTM phishing.

Step 1 (Sentinel): Open incident. Review sign-in alert. Note the suspicious IP and user. Step 2 (Defender portal): Open the user’s email timeline. Search for phishing emails received from external senders in the 24 hours before the suspicious sign-in. Step 3 (Defender portal): If phishing email found, check UrlClickEvents — did the user click the phishing link? Step 4 (Sentinel): Check CloudAppEvents for inbox rule creation or mail forwarding by the compromised account. Step 5 (Defender portal): Check DeviceProcessEvents on the user’s device — was a browser redirected to a credential harvesting page? Step 6 (Sentinel): Run the scope query — how many other users clicked the same phishing link?

This cross-portal workflow leverages both Sentinel (sign-in logs, cloud app events, scope assessment) and Defender (email analysis, endpoint timeline) — each portal providing data the other does not.

Incident handover between shifts

When a SOC operates across multiple shifts, incident handover is critical.

Handover requirements: Each active incident must have a current comment documenting: where the investigation stands, what has been done, what needs to be done next, and any pending containment decisions.

Handover query: At shift end, run this to identify incidents requiring handover:

1
2
3
4
5
6
7
SecurityIncident
| where Status == "Active"
| extend AssignedTo = tostring(Owner.assignedTo)
| where AssignedTo == "outgoing-analyst@northgateeng.com"
| project IncidentNumber, Title, Severity, CreatedTime,
    AgeHours = datetime_diff('hour', now(), CreatedTime)
| order by Severity asc

For each incident in the list: add a handover comment (“Investigation status: confirmed suspicious sign-in from IP X. Awaiting user verification call. Next step: if user confirms they did not sign in, execute containment playbook. If user confirms legitimate travel, classify as Benign Positive.”). Reassign to the incoming analyst.

Incident metrics and SOC performance

Track these metrics to measure SOC effectiveness.

Mean Time to Triage (MTTT). Average time between incident creation and first analyst action (status change from New to Active). Target: under 30 minutes for High severity, under 2 hours for Medium. Measure with:

1
2
3
4
5
SecurityIncident
| where TimeGenerated > ago(30d)
| where Status == "Closed"
| extend TriageTime = datetime_diff('minute', FirstModifiedTime, CreatedTime)
| summarize AvgMTTT = avg(TriageTime), P95MTTT = percentile(TriageTime, 95) by Severity

Mean Time to Resolve (MTTR). Average time between incident creation and closure. Target varies by severity: under 4 hours for High, under 24 hours for Medium, under 72 hours for Low.

True Positive Rate. Percentage of incidents classified as True Positive vs total incidents. Target: above 60% (below 60% indicates excessive false positives — tune analytics rules).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SecurityIncident
| where TimeGenerated > ago(30d)
| where Status == "Closed"
| summarize
    TotalClosed = count(),
    TruePositives = countif(Classification == "TruePositive"),
    FalsePositives = countif(Classification == "FalsePositive"),
    BenignPositives = countif(Classification == "BenignPositive")
| extend TPRate = round(TruePositives * 100.0 / TotalClosed, 1)
| extend FPRate = round(FalsePositives * 100.0 / TotalClosed, 1)

Incident severity adjustment during investigation

The analytics rule assigns initial severity, but investigation may reveal the incident is more or less serious than initially assessed.

Escalate severity when: The investigation reveals a wider blast radius than the alert indicated (one compromised account → five compromised accounts). The compromised entity is a VIP or has privileged access. Evidence of data exfiltration is found alongside the initial access alert. The attack technique matches a known active campaign targeting your industry.

De-escalate severity when: Investigation reveals the alert is a benign positive (pen test, authorised activity). The scope is more limited than initially assessed (one failed login, not a campaign). The entity is low-value (test account, inactive service principal).

Document severity changes in incident comments with rationale: “Escalated from Medium to High: investigation revealed j.morrison is a Global Administrator and inbox rule forwards to external address — potential BEC with privileged access.”

Post-incident review process

For High-severity true positive incidents, conduct a post-incident review (PIR) within 5 business days of closure.

PIR agenda:

Timeline reconstruction: what happened, when, and in what sequence? Build a chronological timeline from the investigation evidence.

Detection assessment: how quickly did the analytics rule detect the threat? Was the detection the first indicator, or did the threat exist for a period before detection? If there was a delay, can the detection be improved?

Response assessment: how quickly was the incident triaged, investigated, and contained? Were there delays? What caused them? Can the process be improved?

Prevention assessment: could this attack have been prevented? Are there hardening actions (conditional access policy, inbox rule blocking, network segmentation) that would prevent recurrence?

Detection improvement: does this incident reveal a detection gap? Should a new analytics rule be created? Should an existing rule be tuned to detect this pattern earlier?

PIR output: A brief document (1-2 pages) with: incident summary, timeline, lessons learned, and action items with owners and due dates. Action items feed into the detection engineering lifecycle (subsection 10.11) and the security hardening backlog.

Managing the incident queue at scale

As the rule library grows, incident volume increases. Managing 20+ incidents per day requires deliberate queue management.

Queue triage cadence: High-severity incidents: triage within 30 minutes of creation (automation rules should assign and notify immediately). Medium-severity incidents: triage at the start of each 4-hour block (shift start, midday, shift handover). Low-severity incidents: batch triage once per day.

Queue hygiene: Run a daily check for stale incidents — incidents that have been New for more than 24 hours or Active for more than 72 hours without updates. These indicate either analyst overload or incidents that fell through the cracks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Stale incident detection
SecurityIncident
| where Status in ("New", "Active")
| extend AgeHours = datetime_diff('hour', now(), CreatedTime)
| extend LastUpdateHours = datetime_diff('hour', now(), LastModifiedTime)
| where (Status == "New" and AgeHours > 24)
    or (Status == "Active" and LastUpdateHours > 72)
| project IncidentNumber, Title, Severity, Status,
    AgeHours, LastUpdateHours,
    Owner = tostring(Owner.assignedTo)
| order by Severity asc, AgeHours desc

Bulk operations: Sentinel supports bulk incident operations: select multiple incidents in the queue → assign, change severity, change status, or add tags in bulk. Use for: closing a batch of false positives from a rule that fired during maintenance, reassigning incidents when an analyst goes off-shift, and bulk-tagging incidents for a specific campaign.

Try it yourself

If you have incidents in your Sentinel queue, select one and walk through the complete lifecycle: open the details pane, review entities and custom details, assign the incident to yourself (status → Active), open the investigation graph, expand at least two entities, run a supporting KQL query, create a bookmark from the query results, classify the incident, and close it with a comment. This end-to-end exercise builds the muscle memory for production incident handling.

What you should observe

The investigation graph displays the incident's entities with expansion options. Entity timelines show related activity. Bookmarks appear in the incident's Evidence tab. After closure, the incident metrics (MTTT, MTTR) can be queried using the SecurityIncident table. The complete workflow — from queue to closure — should take 15-30 minutes for a practice incident.

Knowledge check

Compliance mapping

NIST CSF: DE.AE-1 (Baseline of operations established), PR.DS-1 (Data-at-rest is protected). ISO 27001: A.8.15 (Logging), A.8.16 (Monitoring activities). SOC 2: CC7.2 (Monitor system components). Every configuration in this subsection contributes to the logging and monitoring controls that auditors verify.

Check your understanding

1. What are the three incident classifications available in Sentinel?

True Positive (confirmed security threat), False Positive (benign activity incorrectly flagged), and Benign Positive (real security event but expected/authorised). There is also Undetermined, but this should be minimised through thorough investigation. Each classification has a determination field for the specific threat type.

High, Medium, Low

Open, In Progress, Resolved

Confirmed, Suspected, Dismissed

True Positive, False Positive, Benign Positive (plus Undetermined). Classification drives SOC metrics and analytics rule tuning — high FP rates indicate rules that need refinement.

← 10.4 Entity Mapping and Alert Enrichment 10.6 Automation Rules →