11.3 Writing Effective Hunting Queries

14-18 hours · Module 11

Writing Effective Hunting Queries

Introduction

Required role: Microsoft Sentinel Reader (minimum for hunting queries). Sentinel Contributor for bookmark and hunt management.

Hunting queries differ from analytics rule queries in several important ways. Analytics rules are optimised for scheduled execution — they must be fast, precise, and low in false positives. Hunting queries are optimised for discovery — they can be broader, more complex, and computationally expensive because they run on-demand. A hunting query that returns 200 results requiring manual review is perfectly acceptable. An analytics rule that generates 200 alerts per day is operationally destructive.

This subsection teaches the KQL patterns that power effective hunting. Each pattern targets a different category of threat.


Pattern 1: Rare event discovery

Find events that occur infrequently — things that happen once or twice in 30 days are more interesting to a hunter than things that happen thousands of times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Find rare processes executed in the last 30 days
DeviceProcessEvents
| where TimeGenerated > ago(30d)
| summarize
    ExecutionCount = count(),
    UniqueDevices = dcount(DeviceName),
    Devices = make_set(DeviceName, 5),
    Users = make_set(AccountName, 5),
    SampleCmdLine = take_any(ProcessCommandLine)
    by FileName
| where ExecutionCount < 5 and UniqueDevices < 3
| order by ExecutionCount asc

Rare processes are high-value hunting targets. A legitimate tool used by one administrator appears here — but so does an attacker’s custom tool, a renamed system binary, and a malware dropper. The hunter reviews the list and investigates any unfamiliar or suspicious entries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Find rare PowerShell commands in the last 14 days
DeviceProcessEvents
| where TimeGenerated > ago(14d)
| where FileName == "powershell.exe" or FileName == "pwsh.exe"
| where isnotempty(ProcessCommandLine)
| extend CmdHash = hash_sha256(ProcessCommandLine)
| summarize Count = count(), Users = make_set(AccountName, 5),
    Devices = make_set(DeviceName, 5)
    by CmdHash, ProcessCommandLine
| where Count < 3
| order by Count asc
| take 50

Pattern 2: Statistical outlier detection

Find entities whose behaviour deviates significantly from the population.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// Find users with abnormally high sign-in counts (potential token abuse)
let UserBaseline = SigninLogs
| where TimeGenerated > ago(30d)
| where ResultType == "0"
| summarize DailySignins = count() by UserPrincipalName, bin(TimeGenerated, 1d)
| summarize AvgDaily = avg(DailySignins), StdDev = stdev(DailySignins)
    by UserPrincipalName;
SigninLogs
| where TimeGenerated > ago(1d)
| where ResultType == "0"
| summarize TodaySignins = count() by UserPrincipalName
| join kind=inner UserBaseline on UserPrincipalName
| where TodaySignins > AvgDaily + (3 * StdDev) and TodaySignins > 50
| extend DeviationMultiple = round((TodaySignins - AvgDaily) / StdDev, 1)
| project UserPrincipalName, TodaySignins, AvgDaily = round(AvgDaily, 0),
    StdDev = round(StdDev, 1), DeviationMultiple
| order by DeviationMultiple desc

Users more than 3 standard deviations above their average are statistical outliers. This catches: token replay attacks (automated sign-ins at high frequency), compromised accounts being used for bulk operations, and legitimate but unusual activity (a user running a bulk migration).


Pattern 3: First-time occurrence

Find things happening for the first time — a user’s first sign-in from a new country, a device’s first connection to a new external IP, an application’s first consent grant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// First-time sign-in from a new country per user
let HistoricalCountries = SigninLogs
| where TimeGenerated between(ago(90d) .. ago(1d))
| where ResultType == "0"
| extend Country = tostring(LocationDetails.countryOrRegion)
| summarize KnownCountries = make_set(Country) by UserPrincipalName;
SigninLogs
| where TimeGenerated > ago(1d)
| where ResultType == "0"
| extend Country = tostring(LocationDetails.countryOrRegion)
| join kind=inner HistoricalCountries on UserPrincipalName
| where Country !in (KnownCountries)
| project TimeGenerated, UserPrincipalName, Country, IPAddress,
    AppDisplayName, DeviceDetail
| order by TimeGenerated desc

First-time country sign-ins are classic indicators of credential compromise — an attacker signing in from their own location, which differs from the victim’s historical locations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// First-time application consent per user
AuditLogs
| where TimeGenerated > ago(7d)
| where OperationName == "Consent to application"
| extend AppName = tostring(TargetResources[0].displayName)
| extend User = tostring(InitiatedBy.user.userPrincipalName)
| join kind=leftanti (
    AuditLogs
    | where TimeGenerated between(ago(90d) .. ago(7d))
    | where OperationName == "Consent to application"
    | extend AppName = tostring(TargetResources[0].displayName)
    | extend User = tostring(InitiatedBy.user.userPrincipalName)
    | distinct User, AppName
) on User, AppName
| project TimeGenerated, User, AppName, IPAddress = tostring(InitiatedBy.user.ipAddress)

First-time service principal sign-in from a new IP. Service principals should have predictable sign-in patterns — new IPs are anomalous.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
let HistoricalSPIPs = AADServicePrincipalSignInLogs
| where TimeGenerated between(ago(90d) .. ago(1d))
| summarize KnownIPs = make_set(IPAddress, 50)
    by ServicePrincipalId, ServicePrincipalName;
AADServicePrincipalSignInLogs
| where TimeGenerated > ago(1d)
| where ResultType == "0"
| join kind=inner HistoricalSPIPs on ServicePrincipalId
| where IPAddress !in (KnownIPs)
| project TimeGenerated, ServicePrincipalName, IPAddress,
    ResourceDisplayName, ResultType
| order by TimeGenerated desc

Service principal sign-ins from new IPs are higher-fidelity signals than user sign-ins from new IPs — because service principals should only authenticate from known infrastructure. Any new IP is suspicious by default.


Interpreting hunting query results

Raw query results require interpretation. These guidelines help distinguish threats from noise.

High-confidence threat indicators: Multiple corroborating signals (sign-in from unusual IP + inbox rule creation + email forwarding). Activity from IPs in threat intelligence databases. Service accounts performing interactive operations. Processes with encoded command lines. Activity at unusual times from non-shift-working users.

Medium-confidence suspicious indicators: Single anomalous event without corroboration. First-time country access (could be travel). Rare process execution on a single device (could be an IT tool). Elevated UEBA score without specific anomaly type.

Low-confidence/likely benign indicators: Activity from known VPN providers. Sign-ins during business hours from the user’s home country. Processes executed by IT administrators on IT infrastructure. Automated operations from known service accounts.

The analyst’s decision tree: For each hunting result: (1) Is it in TI databases? → Yes: high confidence, promote to incident. (2) Does it have corroborating signals? → Yes: medium-high confidence, investigate further. (3) Is it a first-time occurrence for this entity? → Yes: medium confidence, verify with the user. (4) Is there a plausible benign explanation? → Yes: low confidence, bookmark as benign.


Pattern 4: Temporal anomaly detection

Find activity occurring at unusual times — outside business hours, on weekends, or during holidays.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Find sign-ins during unusual hours (midnight to 5am local time)
SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == "0"
| extend HourOfDay = hourofday(TimeGenerated)
| where HourOfDay >= 0 and HourOfDay < 5
| where UserPrincipalName !in (
    _GetWatchlist('NightShiftUsers') | project SearchKey)
| summarize NightSignins = count(),
    UniqueApps = make_set(AppDisplayName, 5),
    UniqueIPs = make_set(IPAddress, 5)
    by UserPrincipalName
| where NightSignins > 3
| order by NightSignins desc

Exclude known night-shift workers and on-call staff via a watchlist. Remaining night-time sign-ins are suspicious — either an attacker operating in a different time zone, or a legitimate user working late (which still warrants verification).


Pattern 5: Cross-table correlation hunting

Join data across multiple tables to find patterns invisible in any single table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Hunt: user signed in from external IP AND created inbox rule within 2 hours
let SuspiciousSignins = SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == "0"
| where not(ipv4_is_private(IPAddress))
| where IPAddress !in (
    _GetWatchlist('CorporateExternalIPs') | project SearchKey)
| project SigninTime = TimeGenerated, UserPrincipalName, SigninIP = IPAddress;
CloudAppEvents
| where TimeGenerated > ago(7d)
| where ActionType == "New-InboxRule"
| extend RuleCreator = tostring(parse_json(RawEventData).UserId)
| extend RuleIP = tostring(parse_json(RawEventData).ClientIP)
| project RuleTime = TimeGenerated, RuleCreator, RuleIP
| join kind=inner SuspiciousSignins
    on $left.RuleCreator == $right.UserPrincipalName
| where RuleTime between(SigninTime .. (SigninTime + 2h))
| project UserPrincipalName, SigninTime, SigninIP, RuleTime, RuleIP
| order by SigninTime desc

This is the AiTM phishing hunt — finding the attack chain that Module 14 investigates in full. The cross-table join reveals a sequence that no single table can show: sign-in from unusual IP followed by inbox rule creation. This is more powerful than any single-table analytics rule.


Pattern 6: Stacking (frequency analysis)

Count occurrences and examine the distribution. The most common values are normal. The least common values are interesting.

1
2
3
4
5
6
7
8
9
// Stack user agents  unusual user agents may indicate scripted access
SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == "0"
| extend UA = tostring(DeviceDetail.browser)
| summarize UserCount = dcount(UserPrincipalName), SigninCount = count()
    by UA
| order by UserCount asc
| where UserCount < 3 and SigninCount > 5

User agents used by fewer than 3 users but appearing more than 5 times suggest scripted or automated sign-ins — potentially an attacker using a custom tool or a token replay script that does not mimic a standard browser user agent.


Pattern 7: Process tree analysis

Reconstruct parent-child process relationships to find suspicious execution chains.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Hunt: suspicious parent-child process chains
DeviceProcessEvents
| where TimeGenerated > ago(7d)
| where InitiatingProcessFileName in ("outlook.exe", "winword.exe",
    "excel.exe", "powerpnt.exe", "msedge.exe", "chrome.exe", "firefox.exe")
| where FileName in ("powershell.exe", "cmd.exe", "wscript.exe", "cscript.exe",
    "mshta.exe", "certutil.exe", "bitsadmin.exe", "regsvr32.exe",
    "rundll32.exe", "msiexec.exe")
| project TimeGenerated, DeviceName, AccountName,
    ParentProcess = InitiatingProcessFileName,
    ChildProcess = FileName,
    CommandLine = ProcessCommandLine,
    GrandparentProcess = InitiatingProcessParentFileName
| order by TimeGenerated desc

Office applications and browsers spawning command interpreters is the classic initial access pattern: a user opens a phishing document → the macro executes PowerShell → the attacker has code execution. The grandparent process (explorer.exe → outlook.exe → powershell.exe) reveals the full chain.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Hunt: LOLBin (Living off the Land Binary) execution chains
DeviceProcessEvents
| where TimeGenerated > ago(14d)
| where FileName in ("certutil.exe", "mshta.exe", "regsvr32.exe",
    "rundll32.exe", "bitsadmin.exe", "msiexec.exe", "installutil.exe",
    "cmstp.exe", "wmic.exe", "forfiles.exe")
| where ProcessCommandLine has_any ("http", "ftp", "\\\\", "decode",
    "/i:", "/s", "scrobj.dll", "javascript:", "vbscript:")
| project TimeGenerated, DeviceName, AccountName,
    FileName, ProcessCommandLine, InitiatingProcessFileName
| order by TimeGenerated desc

LOLBins are legitimate Windows binaries abused by attackers for download, execution, and defence evasion. The query filters for command-line arguments that indicate abuse (URLs, network paths, decoding, script execution).


Pattern 8: DNS hunting

DNS queries reveal communication patterns that other logs miss — particularly C2 communication using domain generation algorithms (DGA) or DNS tunnelling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Hunt: high-entropy domain queries (potential DGA)
DnsEvents
| where TimeGenerated > ago(7d)
| where isnotempty(Name)
| extend DomainLength = strlen(Name)
| extend SubdomainPart = tostring(split(Name, ".")[0])
| extend SubdomainLength = strlen(SubdomainPart)
| where SubdomainLength > 20  // Long random-looking subdomains
| summarize QueryCount = count(), UniqueClients = dcount(ClientIP)
    by Name
| where QueryCount > 5
| order by QueryCount desc

DGA domains have long, random-looking subdomains (e.g., “a7f2b9e4c1d3.evil.com”). Legitimate domains rarely have subdomains exceeding 20 characters. This query surfaces DGA patterns that may indicate malware calling home.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Hunt: DNS queries to newly registered domains
DnsEvents
| where TimeGenerated > ago(7d)
| extend TLD = tostring(split(Name, ".")[-1])
| extend Domain = strcat(tostring(split(Name, ".")[-2]), ".", TLD)
| summarize QueryCount = count(), UniqueClients = dcount(ClientIP)
    by Domain
| where QueryCount > 10
// Cross-reference with TI or WHOIS data for registration date
// Domains registered in the last 30 days are suspicious

Hunting query optimisation for large datasets

Hunting queries run on-demand, but slow queries waste analyst time. Optimise for large datasets.

Filter before summarise. Apply where clauses as early as possible in the query — this reduces the data volume that summarize processes.

Use let for materialised lookups. If your query joins with a watchlist or small table, materialise it first: let KnownIPs = materialize(_GetWatchlist('CorporateIPs') | project SearchKey);. This evaluates the watchlist once and caches it.

Limit time ranges to what you need. A 90-day hunt sounds thorough but may be unnecessarily slow. Start with 7 days. If the hypothesis is not confirmed in 7 days, expand to 30 days. Only search 90 days if the hypothesis specifically requires it (e.g., “the campaign was active 3 months ago”).

Use take for exploration. When developing a new hunting query, add | take 100 to preview results quickly before running the full query. Remove take for the final execution.

Avoid unnecessary unions. If your hypothesis only applies to one data source, query that table directly — do not use ASIM unifying parsers that query all sources unnecessarily.


Pattern 7: Process tree analysis (endpoint hunting)

Trace the parent-child relationship of processes to identify suspicious execution chains. Attackers often use legitimate tools (powershell.exe, cmd.exe, wscript.exe) in unusual parent-child combinations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Find suspicious process trees  unusual parent-child relationships
DeviceProcessEvents
| where TimeGenerated > ago(7d)
| where FileName in~ ("powershell.exe", "cmd.exe", "wscript.exe", "cscript.exe",
    "mshta.exe", "regsvr32.exe", "rundll32.exe")
| extend ParentProcess = InitiatingProcessFileName
| summarize
    ExecutionCount = count(),
    UniqueDevices = dcount(DeviceName),
    SampleCommand = any(ProcessCommandLine)
    by ParentProcess, FileName
| where ExecutionCount < 5  // Rare parent-child combinations
| where ParentProcess !in~ ("explorer.exe", "services.exe", "svchost.exe",
    "winlogon.exe")  // Exclude known-normal parents
| order by ExecutionCount asc

A rare parent-child process combination — especially something like outlook.exe spawning powershell.exe — is a strong indicator of macro-based malware or phishing-delivered payload execution.

Pattern 8: Beacon detection (network hunting)

Identify regular-interval network connections that may indicate command-and-control beaconing. C2 implants typically call home at predictable intervals.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Find beaconing patterns  regular-interval connections to external IPs
DeviceNetworkEvents
| where TimeGenerated > ago(7d)
| where ActionType == "ConnectionSuccess"
| where not(ipv4_is_private(RemoteIP))
| summarize Connections = count(), Times = make_list(TimeGenerated, 100)
    by DeviceName, RemoteIP
| where Connections > 20  // Minimum connections for pattern detection
| extend Intervals = array_sort_asc(Times)
| mv-apply Intervals to typeof(datetime) on (
    extend NextTime = next(Intervals)
    | where isnotempty(NextTime)
    | extend IntervalSeconds = datetime_diff('second', NextTime, Intervals)
    | summarize
        AvgInterval = avg(IntervalSeconds),
        StdDevInterval = stdev(IntervalSeconds)
)
| where StdDevInterval < (AvgInterval * 0.1)  // Low variance = regular beaconing
| where AvgInterval between (30 .. 3600)  // 30 seconds to 1 hour interval
| project DeviceName, RemoteIP, Connections,
    AvgIntervalMinutes = round(AvgInterval / 60.0, 1),
    Regularity = round(1 - (StdDevInterval / AvgInterval), 2)
| order by Regularity desc

A remote IP receiving connections at near-exact intervals (regularity > 0.9) from a single device is the beaconing signature. Legitimate services also beacon (Windows Update, telemetry) — filter these known services by IP or domain before investigating.

Hunting queries are hypotheses expressed in KQL

Every query pattern in this subsection tests a specific hypothesis about attacker behaviour. Rare events = "has an unusual tool been used?" Statistical outliers = "is anyone behaving abnormally?" First-time occurrences = "has something new appeared?" Temporal anomalies = "is anyone active when they should not be?" Cross-table correlation = "do multiple events form an attack chain?" Stacking = "what is the distribution of this attribute, and are there outliers?" Master these six patterns and you can hunt for any threat.

Try it yourself

Run Pattern 3 (first-time country) and Pattern 4 (temporal anomaly) against your workspace. Review the results: are there users signing in from new countries? Are there sign-ins during unusual hours? For each result, determine: is this suspicious, benign, or inconclusive? If suspicious, create a bookmark. This is one complete hunting iteration — hypothesis (new country = potential compromise), query, analysis, and evidence collection.

What you should observe

In a lab, results may be limited. In a production environment, first-time country queries typically return 5-20 results per week — most are legitimate travel, but 1-2 may warrant investigation. Night-time sign-in queries surface both legitimate late workers and potential compromises. The skill is distinguishing between them — which requires correlating with additional context (was the user on a business trip? Is the IP a known VPN?).


Knowledge check

Compliance mapping

NIST CSF: DE.AE-1 (Baseline of operations established), PR.DS-1 (Data-at-rest is protected). ISO 27001: A.8.15 (Logging), A.8.16 (Monitoring activities). SOC 2: CC7.2 (Monitor system components). Every configuration in this subsection contributes to the logging and monitoring controls that auditors verify.


Check your understanding

1. Which hunting pattern would you use to find an attacker using a custom tool that your analytics rules do not detect?

Rare event discovery (Pattern 1). Search for processes or files that appear only once or twice across the entire environment. Attacker tools are inherently rare — they do not exist on legitimate endpoints. A process that has executed only twice in 30 days across 500 devices is a strong hunting target.
Temporal anomaly — check for unusual hours
Indicator-driven — search for known IOCs
Statistical outlier — find above-baseline users