In this section

TH1.10 Behavioral Baselining Methodology

3-4 hours · Module 1 · Free
Operational Objective
Many hunt campaigns depend on comparing current behavior against a baseline of "normal." If the baseline is wrong — too short, too noisy, or contaminated by the attack you are looking for — the hunt produces false negatives (missing real anomalies) or false positives (flagging legitimate changes as threats). This subsection teaches how to construct baselines that are reliable enough to hunt against.
Deliverable: The ability to construct per-user and per-entity behavioral baselines in KQL, select appropriate baseline windows, handle edge cases (new users, role changes, seasonal variation), and avoid the contamination trap.
⏱ Estimated completion: 25 minutes

What "normal" means in your environment

A baseline is a quantitative description of what normal looks like for a specific entity over a specific time window. "Normal" for the CEO's authentication pattern is different from "normal" for a service account. "Normal" for SharePoint access in December (end-of-year reporting) is different from "normal" in March.

Baselines are per-entity, not global. A global baseline ("the average user signs in from 2.3 unique IPs per week") obscures the individual patterns that make anomaly detection work. The SOC analyst who uses VPN from three countries while traveling has a different normal than the accountant who signs in from the same office every day. A global threshold catches the accountant's first new IP but misses the traveler's fifth — or flags the traveler constantly while ignoring the accountant's one anomaly.

Baseline construction in KQL

// Per-user authentication baseline: IP, location, device, app
let baselineStart = ago(37d);  // 30-day baseline
let baselineEnd = ago(7d);     // Ends 7 days ago (gap prevents contamination)
let baseline = SigninLogs
| where TimeGenerated between (baselineStart .. baselineEnd)
| where ResultType == 0
| summarize
    BaselineIPs = make_set(IPAddress, 30),
    BaselineCountries = make_set(
        tostring(LocationDetails.countryOrRegion), 10),
    BaselineDevices = make_set(
        tostring(DeviceDetail.displayName), 15),
    BaselineApps = make_set(AppDisplayName, 20),
    AvgDailySignIns = count() / 30.0  // Average sign-ins per day
    by UserPrincipalName;
// This baseline captures each user's normal:
//   which IPs they sign in from
//   which countries those IPs resolve to
//   which devices they use
//   which applications they access
//   how many sign-ins per day is typical
// Any deviation from this baseline in the detection window is a candidate anomaly
// Identify new users who will not have full baselines
let baselineWindow = 30d;
SigninLogs
| where TimeGenerated > ago(baselineWindow)
| where ResultType == 0
| summarize FirstSeen = min(TimeGenerated) by UserPrincipalName
| where FirstSeen > ago(baselineWindow)
// These users have less than 30 days of sign-in history
// Baseline comparison will produce false positives for them
// Either exclude or flag results as low-confidence
Expand for Deeper Context

The standard pattern: aggregate historical data per entity over a defined window to create a reference, then compare recent data against that reference.

The gap window: preventing contamination

Notice the 7-day gap between the baseline window end and the present. This is not arbitrary. If the attacker has been present for 5 days, and your baseline extends to the present, the attacker's activity is in the baseline. The baseline now considers the attacker's IP as "normal" — because it has been seen during the baseline period. The hunt misses the compromise.

The gap window must be at least as long as the detection window. If you are hunting in the last 7 days, the baseline should end 7 days ago. If the attacker entered during the baseline period (before the gap), their activity is in the baseline — but it will appear as a consistent anomaly in the detection window, which is still detectable through volume and pattern analysis.

For campaigns where longer dwell time is expected (APT, insider threat), extend the gap. A 90-day baseline ending 30 days ago, with a 30-day detection window, provides protection against attackers with up to 30 days of dwell time.

Edge cases that break baselines

New users. A user who joined the organization 2 weeks ago has no 30-day baseline. Every sign-in is "new" by definition. Exclude users with less than the baseline window of history, or use a shorter baseline for new users with a flag that results for new users have lower confidence.

Role changes. A user who transferred from the London office to the New York office last week will sign in from a new country. A user promoted to a new role will access new applications and resources. The baseline reflects the old role. The detection window reflects the new one. Every access in the new role is "anomalous" against the old baseline.

Mitigation: enrich baseline anomalies with HR/directory data. Check AuditLogs for recent role or group membership changes. If the user's role changed during the gap window, the baseline comparison is less reliable — flag but do not escalate without additional indicators.

Seasonal variation. Download volumes in a finance department spike during quarter-end reporting. Travel-related sign-in anomalies increase during conference season. If your baseline window captures a low-activity period and the detection window captures a high-activity period (or vice versa), the comparison produces systematic bias.

Mitigation: for campaigns sensitive to seasonal variation (TH8 data exfiltration, TH13 insider threat), use a same-period-last-year baseline if data retention allows, or use a 90-day baseline that spans at least one business cycle.

BASELINE CONSTRUCTION — THE GAP WINDOW BASELINE WINDOW (30 days) GAP (7d) DETECTION (7 days) "What does normal look like?" Prevents contamination "What is different now?" now() Without the gap: an attacker present during the baseline period is treated as "normal." The gap ensures the baseline reflects pre-attack behavior.

Figure TH1.10 — Baseline construction with gap window. The gap prevents attacker activity from contaminating the baseline, ensuring anomalies in the detection window are measured against genuine pre-attack behavior.

Try it yourself

Exercise: Build and test a per-user IP baseline

Run the baseline construction query from this subsection against your environment. Then run the new-user identification query to find users who will not have full baselines.

Examine 3 users from the baseline results. For each, check: does the baseline IP set match your expectation of their normal behavior? Does the average daily sign-in count seem reasonable for their role? If the baseline does not match your environmental knowledge of the user, the baseline window or the aggregation logic needs adjustment.

This validation step is critical — building a baseline you have not validated is building on assumptions. Validate before hunting against it.

Building the baseline before the hunt

The baseline query runs BEFORE the hunt query. If you are hunting for anomalous SharePoint access, first establish what normal SharePoint access looks like: which users access which libraries, at what volume, during which hours, from which IPs. This baseline becomes the denominator against which the hunt query's results are evaluated. Without the baseline, every finding is ambiguous — is 47 file downloads in one hour anomalous? Without knowing that the user's P95 is 12 downloads per hour, you cannot answer that question. With the baseline, the answer is definitive: 47 downloads is 3.9x the user's P95, which exceeds the 2x anomaly threshold defined in the hunt methodology.

The queries developed during this exercise become reusable templates in your personal hunting library. Parameterise the hardcoded values (user names, IP addresses, time windows) and add a header comment explaining the hypothesis each query tests. A mature hunting program maintains 50-100 parameterised query templates that any team member can execute — reducing the per-hunt preparation time from hours to minutes and ensuring consistent methodology across analysts.

The baseline itself is an artifact worth preserving. Store the baseline query, its results, and the date it was computed alongside each hunt. When the hunt is repeated in 30 days, the baseline may have shifted — seasonal patterns, new employees, infrastructure changes all affect what 'normal' looks like. Comparing the current baseline against the previous baseline reveals environmental drift before it causes false positives in production detection rules.

⚠ Compliance Myth: "A 7-day baseline is sufficient for behavioral detection"

The myth: Short baselines are sufficient because they capture recent behavior most accurately.

The reality: A 7-day baseline captures one work week. It does not capture monthly activities (first-of-month reporting), biweekly patterns (payroll processing), seasonal variation, or infrequent but legitimate activities (quarterly board meeting access, annual audit preparation). A 30-day baseline captures a full business cycle. A 90-day baseline captures seasonal patterns. Shorter baselines produce more false positives because they treat infrequent-but-legitimate activity as anomalous. The appropriate baseline length depends on the technique: authentication anomalies work well with 30 days. Data exfiltration may need 90 days to capture business cycle variation.

Extend this methodology

TH2 (Advanced KQL for Hunting) introduces `make-series` and `series_decompose_anomalies()` — KQL functions that build statistical baselines automatically and flag deviations. The manual baseline methodology in this subsection is the conceptual foundation. The `make-series` approach automates it for scheduled or repeated hunts. Learn the manual approach first (it builds the intuition for what "normal" means in your data), then apply the automated approach for scale.


References Used in This Subsection

NE environmental considerations

NE's detection environment includes specific factors that influence this rule's operation:

Device diversity: 768 P2 corporate workstations with full Defender for Endpoint telemetry, 58 P1 manufacturing workstations with basic cloud-delivered protection, and 3 RHEL rendering servers with Syslog-only coverage. Rules targeting DeviceProcessEvents operate with full fidelity on P2 devices but may have reduced visibility on P1 devices. Manufacturing workstations in Sheffield and Sunderland represent a detection gap for endpoint-level detections.

Expand for Deeper Context

Network topology: 11 offices connected via Palo Alto SD-WAN with full-mesh connectivity. The SD-WAN firewall logs feed CommonSecurityLog in Sentinel. Cross-site lateral movement generates firewall allow events that correlate with DeviceLogonEvents — enabling multi-source detection that single-table rules cannot achieve.

User population: 810 users with distinct behavioral profiles — office workers (predictable hours, consistent applications), field engineers (variable hours, travel patterns), IT administrators (elevated privilege, broad access patterns), and manufacturing operators (fixed shifts, limited application access). Each user population has different detection baselines.

Decision point

You have time for one hunt this quarter. Do you hunt for the threat in the latest advisory or for the gap in your ATT&CK coverage matrix?

Hunt the coverage gap. Advisories describe threats that are CURRENT but may not target NE. Coverage gaps describe techniques that COULD target NE and would succeed undetected. The coverage gap hunt produces a detection rule (closing the gap permanently). The advisory-driven hunt produces a point-in-time assessment (confirming the specific threat is not present today). Both are valuable — but the coverage gap hunt has a longer-lasting impact because it produces a permanent detection improvement.

A hunt query returns 200 results. You have 4 hours remaining in the hunt window. You can investigate 20 results thoroughly or review all 200 superficially. Which approach produces better hunt outcomes?
Review all 200 — you might miss a critical finding in the 180 you skip.
Investigate 20 thoroughly. A superficial review of 200 results produces 200 'looked at it, seemed okay' assessments that provide no investigative value and no documentation for future reference. A thorough investigation of 20 results produces: confirmed findings (true positives requiring remediation), confirmed benign patterns (documented baselines for future comparison), and inconclusive results (flagged for monitoring). Prioritise the 20 by: highest anomaly score, highest-value assets involved, and highest-risk users involved. Document why the remaining 180 were not investigated and recommend a follow-up hunt with refined query criteria to reduce the result set.
Investigate 20 — but only if they are from the most recent 24 hours.
Neither — refine the query first to reduce the result set below 50.

You understand the detection gap and the hunt cycle.

TH0 showed you what detection rules fundamentally cannot catch. TH1 gave you the hypothesis-driven methodology that closes that gap. Now you run the hunts.

  • 10 complete hunt campaigns — from hypothesis through KQL execution through finding disposition, each campaign based on a real TTP
  • 70 production hunt queries — every one mapped to MITRE ATT&CK and tested against realistic telemetry
  • Advanced KQL for hunting — UEBA composite risk scoring, retroactive IOC sweeps, and hunt management metrics
  • Hypothesis-Driven Hunt Toolkit lab pack — 30 days of realistic M365 and endpoint telemetry with multiple attack patterns seeded in
  • TH16 — Scaling hunts across a team — the operating model for a production hunt program
Unlock the full course with Premium See Full Syllabus