11.4 Hypothesis-Driven Hunting
Hypothesis-Driven Hunting
Introduction
Required role: Microsoft Sentinel Reader (minimum for hunting queries). Sentinel Contributor for bookmark and hunt management.
Hypothesis-driven hunting is the most effective and structured hunting approach. Instead of randomly browsing data, you start with a specific theory about attacker activity, design queries to test that theory, and systematically confirm or refute it. The hypothesis transforms hunting from exploration into investigation.
Formulating effective hypotheses
A good hunting hypothesis has four components.
What: The specific threat or technique you believe may be present. “Token replay after AiTM phishing” — not “something suspicious with authentication.”
Why: The intelligence or observation that motivates the hypothesis. “Microsoft Threat Intelligence reported a 300% increase in AiTM campaigns targeting the UK financial sector in Q1 2026. Our organisation is in the financial sector and uses M365 E5 without token binding.”
Where: The data sources and time period to search. “SigninLogs and CloudAppEvents, last 60 days.”
How: The observable evidence that would confirm the hypothesis. “Sign-in from a non-corporate IP with a session token that was originally issued to a different IP. Followed by inbox rule creation or mail forwarding within 2 hours.”
Example hypotheses:
Hypothesis 1: “An attacker may have compromised service principal credentials and used them to access Azure resources in the last 30 days. Motivation: we recently discovered an exposed service principal key in a public GitHub repository (it was rotated, but the exposure window was 48 hours). Evidence: AADServicePrincipalSignInLogs showing sign-ins from IPs outside our known infrastructure.”
Hypothesis 2: “A departing employee may have exfiltrated sensitive documents via personal cloud storage in the last 14 days. Motivation: HR notified us of a high-risk resignation. Evidence: CloudAppEvents showing bulk file downloads by the employee, followed by uploads to consumer cloud services (Dropbox, Google Drive, personal OneDrive) visible in DeviceNetworkEvents.”
Hypothesis 3: “An attacker may be using a compromised account for reconnaissance — enumerating directory groups and admin accounts without taking visible action. Motivation: UEBA flagged a user with anomalous directory query activity. Evidence: IdentityQueryEvents showing LDAP queries for sensitive groups (Domain Admins, Enterprise Admins) from a user who has never queried these groups before.”
Testing hypotheses with KQL
Each hypothesis maps to one or more KQL queries. Design queries that return evidence for or against the hypothesis.
Hypothesis 1 test:
| |
Interpreting results: If the query returns service principals signing in from unknown IPs — especially IPs in unexpected geographies — the hypothesis is potentially confirmed. Next step: check what resources the service principal accessed (AzureActivity), whether the service principal key was recently rotated (AuditLogs), and whether the IP appears in threat intelligence databases.
If the query returns zero results: the hypothesis is not confirmed for the searched period. Document the negative finding and close the hunt. Negative findings are valuable — they confirm that this specific threat vector was not exploited during the search period.
Pivot techniques: expanding from initial findings
When a primary query returns suspicious results, pivot queries expand the investigation to determine scope and impact.
Pivot by entity. The primary query found a suspicious IP. Pivot: what else has this IP done?
| |
This produces a unified timeline of all activity from the suspicious IP — across identity, network, and endpoint data. A single suspicious sign-in becomes a complete picture of the attacker’s interaction with your environment.
Pivot by time window. The primary query found a suspicious event at 14:32. Pivot: what else happened within 30 minutes of that event?
| |
Pivot by technique. The primary query found an inbox rule creation. Pivot: has this technique been used on other accounts?
| |
Generating hypotheses from different sources
From threat intelligence reports. Read the TTP section. For each technique: “Has this technique been used against us?” Example: “Microsoft reports Storm-0539 using OAuth app consent for persistence. Hypothesis: an attacker may have used OAuth consent phishing to gain persistent access to mailboxes in the last 60 days.”
From incident post-incident reviews. Each PIR (Module 10.5) generates questions: “The attacker used AiTM phishing. Did they target other users we have not identified? Hypothesis: the same phishing infrastructure may have been used against additional users — search for the phishing domain in EmailUrlInfo.”
From UEBA anomalies. A user with a high investigation priority score. “Hypothesis: User X may be compromised — their anomalous behaviour (new location, first-time application access, above-baseline data download) matches a credential compromise pattern.”
From environmental changes. A new SaaS application was deployed. “Hypothesis: the new application may have introduced OAuth permissions that create a data access path that bypasses existing controls — search for consent grants to the new application and verify the permissions are appropriate.”
From peer organisations. An ISAC advisory reports a breach at a similar organisation. “Hypothesis: the same threat actor may be targeting us — search for the reported infrastructure (IPs, domains, TTPs) in our data.”
Handling false positives in hunting
Unlike analytics rules (where false positives create operational noise), hunting false positives are part of the process — expected and informative.
Document benign findings. When a hunting query returns results that turn out to be legitimate: document what the activity was, why it looked suspicious, and why it was determined to be benign. This documentation: prevents the same activity from being re-investigated in future hunts, may inform analytics rule exclusions (if you build a rule for this pattern, include the exclusion), and builds institutional knowledge about normal-but-unusual behaviour in your environment.
Refine the query. If a hunting query consistently returns the same benign patterns, add exclusions to reduce noise for future executions. Unlike analytics rules (where every false positive wastes analyst time), hunting query refinement is iterative — the query evolves across multiple hunt sessions.
Confidence scoring for hunting findings
Not every finding is equally convincing. Assign a confidence score to each finding to prioritise follow-up.
High confidence (80-100%). Multiple corroborating data points. The finding matches a known attack technique exactly. External threat intelligence confirms the indicator. Example: “IP 203.0.113.47 appears in Microsoft’s Storm-1167 report AND signed into 3 accounts in our tenant AND those accounts created inbox rules within 2 hours.” Three independent signals corroborate the finding — this is almost certainly a compromise.
Medium confidence (40-79%). The finding matches a suspicious pattern but lacks corroboration. May be malicious or benign — requires additional investigation. Example: “A user signed in from a new country for the first time. The IP is not in any TI database. No post-authentication suspicious activity detected.” The sign-in is anomalous but could be legitimate travel.
Low confidence (1-39%). The finding is technically anomalous but has a plausible benign explanation. Example: “A rare process (toolname.exe) was executed once on one device. No network connections. No persistence. The process name does not match known malware.” The process is unfamiliar but showed no malicious behaviour.
Recommended actions by confidence: High → promote to incident immediately, trigger containment. Medium → bookmark, investigate further within 24 hours, contact the user for verification. Low → bookmark with benign-unless-proven-otherwise note, revisit if additional signals emerge.
Hypothesis refinement: from broad to specific
Effective hunting often requires refining the hypothesis through multiple iterations.
Iteration 1 (broad): “Has anyone in our environment been compromised via AiTM phishing in the last 30 days?” Query: search for sign-ins with risk indicators. Result: 200 risky sign-ins. Too many to investigate individually.
Iteration 2 (narrower): “Among users with risky sign-ins, did any subsequently create inbox rules or mail forwarding?” Query: join risky sign-ins with CloudAppEvents inbox rule creation. Result: 5 users. Manageable.
Iteration 3 (specific): “Among the 5 users who had risky sign-ins AND created inbox rules, did the inbox rules forward to external addresses?” Query: filter for external forwarding destinations. Result: 1 user. Investigate this user.
Each iteration narrows the scope: 200 → 5 → 1. The hunter does not investigate 200 users — they refine the hypothesis until the finding is specific and actionable. This iterative refinement is the core analytical skill of threat hunting.
Time-bounded investigation
Hunting can be open-ended — a rabbit hole that consumes hours without producing actionable results. Time-box your hunts to maintain productivity.
Time allocation per hunt: 2 hours for a hypothesis hunt (subsection 11.11). 1 hour for an IOC hunt. 30 minutes for a UEBA review.
At the time limit: Stop and assess. If you have found something promising: bookmark it, document the current state, and schedule a follow-up session. If you have not found anything: document the negative finding, close the hunt, and move to the next hypothesis.
The 80/20 rule in hunting: The first 20% of your hunting time produces 80% of your findings. The initial queries — the obvious checks — surface the most detectable threats. Spending the remaining 80% of time on increasingly esoteric queries has diminishing returns. Know when to stop and move to the next hypothesis.
Advanced hypothesis examples with production KQL
The three examples above cover common patterns. Here are additional hypotheses that target M365-specific attack techniques.
Hypothesis 4: “An attacker may have used MFA fatigue (T1621) to bypass MFA in the last 30 days.”
Motivation: MFA fatigue attacks — where the attacker repeatedly triggers push notifications until the user approves one to stop the notifications — are increasingly common against Authenticator push-based MFA.
| |
If this returns results: a user had 5+ MFA denials within an hour, followed by a successful MFA within 2 hours. This is the MFA fatigue pattern. Investigate: was the successful MFA from the same IP as the denials (the attacker finally got the user to approve) or a different IP (the user legitimately approved their own request, unrelated to the attacker)?
Hypothesis 5: “An attacker may have registered a device in Entra ID to satisfy device-based conditional access policies (T1098.005) in the last 60 days.”
| |
Device registrations from non-corporate IPs may indicate: an attacker registering a device to satisfy conditional access policies that require managed devices, an employee registering a personal device (BYOD), or an employee working from home for the first time. Cross-reference with the HR departing employees list and with known attacker IPs from previous incidents.
Hypothesis 6: “An attacker may have created cloud-only accounts (T1136.003) outside the normal HR provisioning process in the last 90 days.”
| |
Any account created outside the standard provisioning process warrants investigation: is it a legitimate admin action (test account, vendor access), or an attacker creating a backdoor account? Cross-reference: does the new account have any role assignments? Has it been used to sign in?
The hypothesis testing workflow
Step 1: Run the primary query. The query that directly tests the hypothesis.
Step 2: Assess initial results. If zero results → document negative finding, close hunt. If results exist → proceed to Step 3.
Step 3: Investigate positive results. For each result: is it a true positive (threat confirmed), a suspicious finding (requires more investigation), or a false positive (legitimate activity that matches the hunt pattern)?
Step 4: Pivot and expand. For suspicious findings, run additional queries to gather context. If the initial query found a service principal signing in from an unusual IP, pivot: what did that service principal do after signing in? (AzureActivity | where TimeGenerated > ago(30d) | where Caller == "<service-principal-id>")
Step 5: Determine the outcome. Threat confirmed → promote to incident, trigger containment, create analytics rule for future detection. Suspicious but inconclusive → bookmark the evidence, schedule follow-up, consider whether additional data sources would help. Benign → document the benign pattern as an exclusion for future hunts.
Documenting hypothesis hunts
Every hunt should produce a documented record — regardless of whether a threat was found.
Hunt record template:
Hunt ID: HUNT-2026-0322-001 Hypothesis: [statement] Motivation: [threat intelligence, UEBA finding, incident follow-up, or coverage gap] Date: 2026-03-22 Hunter: [analyst name] Time spent: 2 hours Data sources queried: [tables] Time range searched: [start] to [end] Queries executed: [count, with links to saved queries] Results: [Threat confirmed / Suspicious finding / Benign / No findings] Findings summary: [2-3 sentences] Actions taken: [incident created, analytics rule built, bookmarks saved, none] Detection improvement: [new rule created? existing rule tuned?]
Maintaining the hunt log. Store hunt records in a shared location (SharePoint document, wiki page, or a dedicated Sentinel workbook that queries hunt metadata stored in a custom table). The hunt log provides: accountability (who hunted for what), coverage tracking (which techniques have been hunted recently), and institutional memory (what was found and what was excluded).
Try it yourself
Select one of the three example hypotheses from this subsection. Write the KQL query to test it. Run the query against your workspace. Assess the results: threat confirmed, suspicious, benign, or no findings? Document the hunt using the template above. If you find suspicious results, create a bookmark. This is one complete hypothesis-driven hunt — the core skill of threat hunting.
What you should observe
In a lab, most hypotheses will return "No findings" — this is expected and valuable (it confirms the lab is not compromised). The exercise builds the muscle memory for the hypothesis testing workflow: formulate → query → assess → document. In production, approximately 1 in 5 hunts finds something worth investigating further — the other 4 confirm the environment is clean for that specific hypothesis.
Knowledge check
NIST CSF: DE.AE-1 (Baseline of operations established), PR.DS-1 (Data-at-rest is protected). ISO 27001: A.8.15 (Logging), A.8.16 (Monitoring activities). SOC 2: CC7.2 (Monitor system components). Every configuration in this subsection contributes to the logging and monitoring controls that auditors verify.
Check your understanding
1. Your hunt query returns zero results. Is the hunt a failure?