TH2.6 autocluster() for Behavioral Grouping
Let the data tell you its patterns
You ran a hunt query and got 200 sign-ins from new IPs. Scanning all 200 manually takes hours. Most share common characteristics — VPN rotations, mobile network changes, corporate proxy IPs. The 5 that do not share those characteristics are the ones worth investigating.
autocluster() identifies the shared characteristics automatically. It returns segments — groups of results that share column values — with the percentage of results each segment covers. The results not covered by any segment are the non-conforming outliers.
Basic autocluster on hunt results
| |
Extracting the outliers
autocluster() does not directly return the unclustered results. You need to identify them by excluding the clustered patterns:
| |
This is a manual process because autocluster’s output is descriptive, not a filter. The analyst reads the cluster definitions, constructs exclusion filters for the legitimate patterns, and investigates what remains. The automation is in the pattern identification, not the filtering.
Hunting application: clustering audit log operations
| |
Figure TH2.6 — autocluster workflow. Identify the dominant patterns automatically. Investigate what does not fit.
Try it yourself
Exercise: Cluster your new-IP sign-ins
Run the first query (new-IP sign-ins) against your environment, then apply autocluster(). Examine the segments: do they match your knowledge of legitimate patterns (VPN, mobile, travel)?
Construct exclusion filters for the known-legitimate clusters. How many results remain? Those are your hunting leads — investigate each using the five enrichment dimensions from TH1.4.
The myth: If autocluster places a sign-in in the “VPN rotation” cluster, it is legitimate. Cluster membership equals safety.
The reality: autocluster groups by statistical similarity, not by legitimacy. An attacker who routes through a corporate VPN egress IP clusters with legitimate VPN users — because the clustering algorithm sees the same country, the same IP range, and the same browser. Cluster membership reduces suspicion but does not eliminate it. The outliers are the highest-priority leads, but clustered results may still contain compromises that happen to share characteristics with legitimate activity. Campaign modules address this by applying enrichment dimensions (TH1.4) to both outliers and a sample of clustered results.
Extend this operator
autocluster() accepts a `SizeWeight` parameter (0.0–1.0) that controls the tradeoff between segment purity and segment size. Lower values produce more segments with higher purity (tighter clusters). Higher values produce fewer segments with broader coverage. For hunting, lower SizeWeight (0.3–0.5) is preferred because it produces more specific clusters — making it easier to construct exclusion filters for legitimate patterns. The default (0.5) is a reasonable starting point.
References Used in This Subsection
- Microsoft. “KQL autocluster Plugin.” Microsoft Learn. https://learn.microsoft.com/en-us/kusto/query/autocluster-plugin
You're reading the free modules of this course
The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.