TH2.6 autocluster() for Behavioral Grouping

4-5 hours · Module 2 · Free

Operational Objective

When a hunt query returns 200 results, manually categorizing them is time-consuming and inconsistent. autocluster() automatically groups results by their shared characteristics — revealing the dominant patterns and, critically, the results that do not fit any pattern. In hunting, the outliers that do not cluster are often the most interesting findings.

Deliverable: The ability to use autocluster() on hunt result sets to automatically identify patterns and surface non-conforming results for investigation.

⏱ Estimated completion: 20 minutes

Let the data tell you its patterns

You ran a hunt query and got 200 sign-ins from new IPs. Scanning all 200 manually takes hours. Most share common characteristics — VPN rotations, mobile network changes, corporate proxy IPs. The 5 that do not share those characteristics are the ones worth investigating.

autocluster() identifies the shared characteristics automatically. It returns segments — groups of results that share column values — with the percentage of results each segment covers. The results not covered by any segment are the non-conforming outliers.

Basic autocluster on hunt results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Cluster sign-ins from new IPs to find the outliers
let newIPSignIns = SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == 0
| join kind=anti (
    SigninLogs
    | where TimeGenerated between (ago(37d) .. ago(7d))
    | where ResultType == 0
    | distinct UserPrincipalName, IPAddress
) on UserPrincipalName, IPAddress
// Sign-ins from IPs not seen in the prior 30 days
| project UserPrincipalName, IPAddress,
    Country = tostring(LocationDetails.countryOrRegion),
    City = tostring(LocationDetails.city),
    Browser = tostring(DeviceDetail.browser),
    OS = tostring(DeviceDetail.operatingSystem),
    App = AppDisplayName,
    RiskLevel = RiskLevelDuringSignIn;
newIPSignIns
| evaluate autocluster()
// Output: segments that describe the dominant patterns
// Example segments:
//   Segment 1 (45%): Country="GB", Browser="Chrome", OS="Windows"
//     → likely corporate VPN users rotating IPs
//   Segment 2 (30%): Country="GB", Browser="Edge", App="Outlook"
//     → likely mobile users on new cell networks
//   Segment 3 (15%): Country="US", OS="Windows", RiskLevel="none"
//     → likely VPN split-tunnel users
//   Unclustered (10%): various countries, various browsers
//     → these are the hunting leads — they do not fit any pattern

Extracting the outliers

autocluster() does not directly return the unclustered results. You need to identify them by excluding the clustered patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Step 1: Get the cluster definitions
let clusters = newIPSignIns
| evaluate autocluster();
// Step 2: Identify results that match the largest cluster
// (adapt the where clause based on autocluster output)
let cluster1 = newIPSignIns
| where Country == "GB" and Browser == "Chrome"
    and OS == "Windows 10";
// Step 3: The outliers are everything NOT in the known clusters
newIPSignIns
| where not(Country == "GB" and Browser == "Chrome"
    and OS == "Windows 10")
| where not(Country == "GB" and Browser == "Edge"
    and App == "Microsoft Outlook")
// Remaining results = non-conforming sign-ins
// These warrant individual investigation using TH1.4 enrichment

This is a manual process because autocluster’s output is descriptive, not a filter. The analyst reads the cluster definitions, constructs exclusion filters for the legitimate patterns, and investigates what remains. The automation is in the pattern identification, not the filtering.

Hunting application: clustering audit log operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// What patterns exist in directory modification events?
AuditLogs
| where TimeGenerated > ago(30d)
| where Result == "success"
| extend Actor = tostring(InitiatedBy.user.userPrincipalName)
| extend ActorApp = tostring(InitiatedBy.app.displayName)
| project OperationName, Actor, ActorApp,
    Category, tostring(TargetResources[0].type)
| evaluate autocluster()
// Dominant clusters will be routine operations:
//   "Update user" by "Azure AD Connect" = sync operations
//   "Add member to group" by IT admins = normal admin work
// Outliers: operations by unexpected actors or unusual categories
//   These are the hunting leads

Figure TH2.6 — autocluster workflow. Identify the dominant patterns automatically. Investigate what does not fit.

Try it yourself

Exercise: Cluster your new-IP sign-ins

Run the first query (new-IP sign-ins) against your environment, then apply autocluster(). Examine the segments: do they match your knowledge of legitimate patterns (VPN, mobile, travel)?

Construct exclusion filters for the known-legitimate clusters. How many results remain? Those are your hunting leads — investigate each using the five enrichment dimensions from TH1.4.

⚠ Compliance Myth: "autocluster provides definitive categorization — results in a cluster are safe"

The myth: If autocluster places a sign-in in the “VPN rotation” cluster, it is legitimate. Cluster membership equals safety.

The reality: autocluster groups by statistical similarity, not by legitimacy. An attacker who routes through a corporate VPN egress IP clusters with legitimate VPN users — because the clustering algorithm sees the same country, the same IP range, and the same browser. Cluster membership reduces suspicion but does not eliminate it. The outliers are the highest-priority leads, but clustered results may still contain compromises that happen to share characteristics with legitimate activity. Campaign modules address this by applying enrichment dimensions (TH1.4) to both outliers and a sample of clustered results.

Extend this operator

autocluster() accepts a `SizeWeight` parameter (0.0–1.0) that controls the tradeoff between segment purity and segment size. Lower values produce more segments with higher purity (tighter clusters). Higher values produce fewer segments with broader coverage. For hunting, lower SizeWeight (0.3–0.5) is preferred because it produces more specific clusters — making it easier to construct exclusion filters for legitimate patterns. The default (0.5) is a reasonable starting point.

References Used in This Subsection

Microsoft. “KQL autocluster Plugin.” Microsoft Learn. https://learn.microsoft.com/en-us/kusto/query/autocluster-plugin

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← TH2.5 top-nested for Frequency Analysis TH2.7 arg_max and arg_min for Entity Investigation →