11.1 Threat Hunting Concepts and Methodology
Threat Hunting Concepts and Methodology
Introduction
Required role: Microsoft Sentinel Reader (minimum for hunting queries). Sentinel Contributor for bookmark and hunt management.
An analytics rule catches an attacker who uses a brute-force attack because you wrote a rule for brute-force. But what about the attacker who phishes a user, steals their session token, and signs in from the same country at the same time of day as the legitimate user — generating zero alerts because the sign-in does not match any rule’s detection logic? That attacker is invisible to automated detection. Finding them requires a human analyst who asks: “Is there anything in this data that looks wrong, even though nothing triggered an alert?”
That is threat hunting.
Detection vs hunting: the fundamental difference
Detection is automated, rule-based, and reactive. An analytics rule fires when data matches a predefined pattern. The analyst responds to the alert. Detection works well for known threats with well-defined signatures or behaviour patterns.
Hunting is manual, hypothesis-driven, and proactive. The analyst searches the data without an alert trigger, guided by hypotheses, threat intelligence, or curiosity. Hunting works well for unknown threats, novel techniques, and attackers who deliberately evade detection rules.
Detection answers: “Has this specific pattern appeared in the data?” Hunting asks: “What is happening in the data that I do not expect?”
Neither replaces the other. Detection provides continuous automated coverage — catching the 80% of threats that match known patterns. Hunting provides depth coverage — finding the 20% of threats that evade rules.
The three hunting approaches
Approach 1: Hypothesis-driven hunting. The hunter starts with a hypothesis: “I believe an attacker may have used token replay to access our environment in the last 30 days.” The hypothesis is based on: threat intelligence (reports of token replay campaigns targeting the industry), environmental knowledge (conditional access policies do not enforce token binding), or incident patterns (recent AiTM phishing attempts suggest stolen tokens may exist). The hunter writes KQL queries to test the hypothesis, analyses the results, and either confirms or refutes the hypothesis.
This is the most structured and effective hunting approach. Subsection 10.4 covers it in depth.
Approach 2: Indicator-driven hunting (IOC-based). The hunter receives specific indicators of compromise — IP addresses, domains, file hashes, email addresses — from threat intelligence feeds, ISACs, or government advisories. The hunter searches Sentinel data for any activity involving these indicators. This is technically simple (match indicators against log data) but operationally important: it answers “have we been targeted by this specific threat campaign?”
| |
Approach 3: Analytics-driven hunting (data-driven). The hunter starts without a specific hypothesis or indicator. Instead, they analyse the data for statistical anomalies: outliers, rare events, unusual patterns, and deviations from baselines. Machine learning techniques (clustering, time series decomposition) can assist, but the core technique is exploratory data analysis with KQL.
This approach finds threats that the hunter did not anticipate — but it requires strong KQL skills and deep understanding of what “normal” looks like in the environment.
Comparing the three approaches
| Approach | Starting Point | Skill Required | Time Investment | Best For |
|---|---|---|---|---|
| Hypothesis-driven | Theory from TI or gaps | Medium-High | 2-4 hours | Structured gap closure |
| Indicator-driven | Specific IOCs | Low-Medium | 30 min - 2 hours | Threat advisory response |
| Analytics-driven | Data exploration | High | 4-8 hours | Unknown threat discovery |
In practice, most hunting sessions blend approaches. A hypothesis-driven hunt starts with a theory, discovers suspicious indicators, and the hunter pivots to indicator-driven queries to expand the scope. An analytics-driven exploration of rare processes reveals an unfamiliar binary, and the hunter pivots to hypothesis-driven investigation of that binary’s behaviour.
The hunter’s mindset
Effective hunting requires a specific analytical mindset — different from incident response.
Curiosity over urgency. Incident response is urgent: contain the threat, minimise damage, close the incident. Hunting is exploratory: “What is this? Why is it here? What does it mean?” The hunter follows threads that may lead nowhere — and that is acceptable.
Comfort with ambiguity. An analytics rule produces a binary result: alert or no alert. A hunting query produces data that requires interpretation. The same query result might be malicious in one context and benign in another. The hunter must tolerate ambiguity and make probabilistic judgments.
Systematic thinking. Random data exploration is not hunting — it is browsing. The hunter follows a structured methodology (the hunting cycle) even when the investigation takes unexpected turns. Every pivot should be documented. Every dead end should be recorded to prevent re-exploration.
Adversarial perspective. The hunter thinks like the attacker: “If I compromised this account, what would I do next? Which resources would I access? How would I maintain access? How would I avoid detection?” This perspective generates hypotheses that match real attacker behaviour — not theoretical attacks from textbooks.
Hunting vs incident response: understanding the boundary
Hunting and incident response overlap but serve different purposes.
Hunting finds the threat. The hunter discovers suspicious activity through proactive querying — before any alert has been generated. The output is a finding: “I believe this account is compromised because [evidence].”
Incident response handles the threat. Once the finding is confirmed (or strongly suspected), it transitions to incident response: containment, investigation, eradication, and recovery. The hunter promotes the finding to an incident, and the incident management process (Module 10.5) takes over.
The handover point: When a hunting finding requires containment actions (password reset, device isolation, IP block), it has crossed from hunting into incident response. Promote to an incident. Do not attempt containment without the formal incident structure — the documentation, automation, and accountability of the incident process are essential.
Post-incident hunting: After an incident is contained and closed, hunting resumes: “The attacker compromised Account A via AiTM phishing. Did they use the same infrastructure against other accounts? Did they establish persistence we did not find during the incident?” This post-incident hunting extends the investigation scope beyond the initial alert.
The dwell time problem: why hunting matters
Dwell time is the period between an attacker’s initial access and the organisation’s detection of the compromise. Industry reports consistently show average dwell times of 150-200 days for externally notified breaches and 50-70 days for internally detected breaches.
What happens during dwell time: The attacker moves laterally (compromising additional accounts and devices), escalates privileges (gaining admin access), establishes persistence (creating backdoor accounts, OAuth apps, scheduled tasks), and collects data (reading email, downloading files, exfiltrating databases). Every day of dwell time increases the attacker’s foothold and the eventual damage.
Detection reduces dwell time for known patterns. An analytics rule that detects brute-force-then-success catches the attack within minutes of the credential compromise — dwell time near zero for that technique.
Hunting reduces dwell time for unknown patterns. An attacker who bypasses all analytics rules (novel technique, careful evasion) has unlimited dwell time — until a hunter finds them. A monthly hunting programme that cycles through MITRE techniques (subsection 11.9) catches the attacker within 30 days. A weekly programme catches them within 7 days. The hunting cadence directly determines the maximum undetected dwell time for threats that bypass rules.
The ROI equation: If the average cost of a breach increases by X per day of dwell time, and hunting reduces dwell time from 60 days to 7 days, the savings are 53 × X. For a mid-sized organisation, reducing dwell time by 53 days may prevent: 53 additional days of email exfiltration (thousands of sensitive emails), 53 days of lateral movement (dozens of additional compromised accounts), and the difference between a contained incident and a reportable data breach.
The threat hunting maturity model (conceptual)
Level 0: No hunting. Detection rules only. Unknown threats remain undetected indefinitely (or until the attacker causes visible damage).
Level 1: Reactive hunting. Hunting occurs only in response to incidents — “the incident is contained, let’s see if there is more.” No proactive hunting.
Level 2: Indicator hunting. Periodic IOC searches when threat advisories are published. Simple but valuable — answers “have we been targeted by this campaign?”
Level 3: Hypothesis hunting. Structured, proactive hunting driven by hypotheses from threat intelligence and coverage gaps. The target level for this module.
Level 4: Intelligence-driven hunting. Hunting guided by custom threat models, predictive analysis, and deep understanding of adversary behaviour. Notebooks and ML techniques supplement KQL. Level 4 requires: mature Level 3 operations, dedicated hunting time, and advanced analytical skills.
Most organisations following this course should target Level 3 within 6 months of completing Modules 7-10.
Data source requirements for effective hunting
Hunting is only as good as the data available. Before starting a hunting programme, assess your data coverage against the hunting requirements.
Minimum viable data for identity hunting: SigninLogs + AADNonInteractiveUserSignInLogs + AuditLogs. These three tables enable: sign-in anomaly hunting (impossible travel, new country, new device, MFA fatigue), privilege escalation hunting (role assignments, group changes), and account manipulation hunting (new accounts, OAuth consents). If you have only these three tables, you can hunt effectively for identity-based threats — the most common attack vector in M365 environments.
Adding email hunting: EmailEvents + EmailUrlInfo + UrlClickEvents + CloudAppEvents. These tables enable: phishing campaign hunting (sender analysis, URL reputation, click tracking), BEC pattern hunting (thread hijacking, inbox rules, forwarding), and data exfiltration hunting (email-based file transfer). Module 8 covers the connectors that populate these tables.
Adding endpoint hunting: DeviceProcessEvents + DeviceNetworkEvents + DeviceFileEvents + DeviceEvents. These tables enable: malware hunting (rare processes, encoded commands, suspicious network connections), lateral movement hunting (remote service connections, credential access), and insider threat hunting (USB file copy, personal cloud uploads). Requires Defender for Endpoint (Module 2).
Adding network hunting: CommonSecurityLog (CEF data from firewalls/proxies) + Syslog. These tables enable: C2 hunting (beaconing patterns, DNS anomalies), network lateral movement hunting, and data exfiltration hunting (unusual outbound connections).
The hunting data coverage matrix:
| Hunt Type | Required Tables | Module |
|---|---|---|
| Identity anomaly | SigninLogs, AADNonInteractiveUserSignInLogs | M6, M7 |
| Privilege escalation | AuditLogs | M7, M9 |
| Phishing/BEC | EmailEvents, CloudAppEvents | M8 |
| Endpoint malware | DeviceProcessEvents, DeviceNetworkEvents | M2 |
| Insider threat | CloudAppEvents, DeviceFileEvents | M3, M15 |
| Network C2 | CommonSecurityLog, DnsEvents | M8 |
Before hunting for a technique, verify the required data exists: TableName | where TimeGenerated > ago(1d) | count. If zero: the data connector is not configured or not delivering data. Fix the data gap (Module 8) before hunting — you cannot find what you cannot see.
The hunting cycle
Hunting follows a structured cycle — not random exploration.
Step 1: Formulate the hypothesis. Based on threat intelligence, MITRE ATT&CK coverage gaps, recent incidents, or environmental changes. A good hypothesis is specific, testable, and time-bounded: “An attacker may have used OAuth consent phishing to gain persistent access to user mailboxes in the last 60 days” — not “something bad may have happened.”
Step 2: Develop the query plan. Identify which Sentinel tables contain evidence relevant to the hypothesis. Write the KQL queries that test the hypothesis. Determine the time range to search.
Step 3: Execute and analyse. Run the queries. Examine the results. Distinguish between: true positives (confirmed threats), suspicious findings (require further investigation), benign anomalies (unusual but legitimate), and normal activity (hypothesis not confirmed).
Step 4: Document findings. Record what was searched, what was found, and what actions were taken. Use hunting bookmarks (subsection 11.5) to preserve evidence. If a threat is found, promote the finding to an incident for formal investigation and response.
Step 5: Improve detection. If the hunt found a real threat, create an analytics rule that would detect this pattern automatically in the future. If the hunt found nothing but the hypothesis was valid, the data may be insufficient — consider whether additional data connectors (Module 8) would improve visibility.
Step 6: Report and close. Document the hunt: hypothesis, methodology, findings, and detection improvements made. Close the hunt in the hunt management system.
When to hunt
Triggered by threat intelligence. A new advisory describes a campaign targeting your industry. Hunt for indicators and TTPs from the advisory.
Triggered by detection gaps. The MITRE ATT&CK coverage analysis (Module 10.11) reveals techniques with no analytics rules. Hunt for those techniques to determine whether they are present in your environment — and whether rules can be built.
Triggered by incidents. A true positive incident reveals an attacker in your environment. After containment, hunt for: additional compromised accounts, lateral movement paths, and persistence mechanisms the incident investigation may have missed.
Scheduled cadence. Regular hunting sessions (weekly or fortnightly) that cycle through priority hypotheses. This ensures hunting happens consistently, not just when triggered by events.
After environmental changes. A new application is deployed, a new office opens, or a new vendor is granted access. Hunt for: unauthorised access patterns that the new exposure may have created.
Hunting in hybrid environments
Most organisations do not operate a pure-cloud M365 environment. They have on-premises Active Directory synchronised to Entra ID, on-premises file servers alongside SharePoint Online, and VPN connections alongside direct cloud access. Hunting in a hybrid environment requires querying both cloud and on-premises data sources — and correlating across them.
Identity hunting across hybrid. An attacker who compromises an on-premises AD account gains access to M365 via Azure AD Connect synchronisation. The sign-in to M365 shows as a legitimate cloud sign-in — because it IS a legitimate sign-in, using a legitimately synchronised credential. To detect the on-premises compromise: hunt for password changes in on-premises AD (SecurityEvent, EventID 4724) followed by cloud sign-ins from unusual infrastructure.
| |
A cloud sign-in from a new IP within 4 hours of an on-premises password change: this is the hybrid attack chain. The attacker resets the password on-prem, the change syncs to Entra ID, and the attacker signs in to M365 with the new password.
Lateral movement from cloud to endpoint. An attacker with a stolen M365 token may use Intune or Defender for Endpoint to push scripts to managed devices — moving from cloud access to endpoint control without traditional lateral movement techniques.
| |
Non-IT accounts deploying Intune scripts or policies is a strong indicator of cloud-to-endpoint lateral movement. This is a technique that most hunting guides do not cover because it is cloud-native — the attacker never touches the network.
Anyone can respond to alerts. Hunting requires: deep understanding of the data (which tables, which fields, what "normal" looks like), KQL proficiency (Module 6), knowledge of the threat landscape (what techniques attackers use), and investigative intuition (recognising patterns that automated rules cannot express). This module builds those skills through structured methodology and practical exercises.
Try it yourself
Write three hunting hypotheses for your environment. For each, identify: the threat intelligence or environmental knowledge that motivates the hypothesis, the Sentinel tables you would query, and the time range you would search. Example: "Hypothesis: an attacker used a compromised service principal to access Azure resources in the last 30 days. Tables: AADServicePrincipalSignInLogs, AzureActivity. Time range: 30 days." You do not need to write the KQL yet — hypothesis formulation is a skill in itself.
What you should observe
Good hypotheses are specific (name the technique), testable (identify the data source), and time-bounded (define the search period). Vague hypotheses like "something suspicious happened" cannot be tested. Compare your hypotheses against the MITRE ATT&CK matrix — each hypothesis should map to at least one technique.
Knowledge check
NIST CSF: DE.AE-1 (Baseline of operations established), PR.DS-1 (Data-at-rest is protected). ISO 27001: A.8.15 (Logging), A.8.16 (Monitoring activities). SOC 2: CC7.2 (Monitor system components). Every configuration in this subsection contributes to the logging and monitoring controls that auditors verify.
Check your understanding
1. What is the fundamental difference between detection and hunting?
2. What are the three hunting approaches?
3. After a hunt finds a real threat, what should you do to improve detection?