In this section

DE0.5 Measuring Detection

8-10 hours · Module 0 · Free
What you already know

You know the five data source families and what each contains. This section introduces the four metrics that measure whether a detection program is working — coverage percentage, mean time to detect, false positive rate, and rule health score. These are the numbers you'll track throughout the course and present to leadership in the capstone board report.

Why measurement matters

A detection program without measurement is an assertion. "We have good detection" is an opinion. "We cover 48% of relevant ATT&CK techniques, our median time to detect is 8 minutes, our false positive rate is 22%, and 87% of our rules are operationally healthy" is evidence. The difference between the two is the difference between a program that gets funded and one that gets cut in the next budget cycle.

Measurement also drives improvement. Without metrics, you don't know whether the rule you deployed last week improved coverage or duplicated an existing detection.

You don't know whether the tuning you did this month reduced false positives or introduced a detection gap. You don't know whether the rule that hasn't fired in 90 days is perfectly tuned for a rare technique or silently broken because its data source disconnected. Measurement tells you what's working, what's failing, and what to prioritize next.

The absence of measurement is itself the most common indicator that a detection program is not functioning as an engineering discipline. An organization that can't tell you its coverage percentage, its false positive rate, or its rule health score doesn't have a detection program — it has a collection of rules. The rules may be good or bad, current or stale, functional or broken.

Without measurement, nobody knows. The first act of detection engineering is measuring the current state. Everything else follows from the numbers.

The four metrics below are the foundation of the board report you'll produce in the capstone (DE11). They're also the operational dashboard you'll use throughout the course to track your own progress. Every metric has a KQL query that calculates it — you'll run those queries against your own workspace in the content modules.

Estimated time: 35 minutes.

FOUR METRICS THAT MEASURE DETECTION PROGRAM HEALTH COVERAGE % What can we detect? Covered techniques ÷ relevant techniques Typical: 10% MTTD How fast do we detect? Attack execution → alert creation time Typical: 12 min FP RATE Can analysts trust it? False positive alerts ÷ total alerts Typical: 43% RULE HEALTH Is it sustainable? Healthy rules ÷ total active rules Typical: 35% No single metric tells the story. Coverage without quality is a false promise. Speed without accuracy is noise.

Figure DE0.5 — The four metrics of detection program health. Each answers a different question. the baseline shows the typical pattern: low coverage, moderate speed, high false positive rate, poor rule health. The metrics improve together through the detection engineering lifecycle.

Scenario

Your detection program has 23 active rules with a median MTTD of 12 minutes. That sounds fast — until you realize MTTD only measures the rules that fire. For the 130 techniques with no rule, MTTD is infinite. Is 12-minute detection across 10% of the threat landscape better or worse than 60-minute detection across 50%?

Metric 1 — ATT&CK coverage percentage

Coverage percentage is the proportion of relevant ATT&CK techniques that have at least one active detection rule. You calculated the concept in Section 3. Here's the precise definition.

Numerator:

Count the distinct ATT&CK techniques with at least one active, healthy analytics rule in your Sentinel workspace. "Healthy" means the rule has fired at least once in the past 90 days or targets a technique rare enough that zero fires is expected (ransomware deployment, for instance, should not fire often).

A dormant rule that hasn't fired because its data source disconnected doesn't count.

Denominator:

Count the ATT&CK techniques relevant to your organization. Relevance is determined by your threat landscape (which threat groups target your industry), your technology stack (which techniques your platform enables), and your crown jewels (which attack paths reach your critical data). The threat modeling module (DE2) teaches the formal prioritization methodology.

For now, a manufacturing company running M365 E5 with hybrid infrastructure has approximately 145 relevant techniques. A cloud-only SaaS company might have 90. A financial services firm with extensive on-premises infrastructure might have 170.

The formula:

(distinct covered techniques / relevant techniques) × 100 = coverage percentage.

the coverage: 15 techniques covered / 145 relevant = 10.3%. That means 89.7% of the attack techniques an adversary would use against the organization generate zero alerts.

Coverage is the strategic metric. It answers the board question "what proportion of attacks can we detect?" The answer is a percentage, not a rule count, and it changes the conversation from "we have 35 rules" (meaningless) to "we cover 10% of the techniques that matter" (actionable).

Metric 2 — Mean time to detect (MTTD)

MTTD measures how long it takes from when an attack technique executes to when a detection rule fires an alert. Lower is better. The ideal is near-real-time for high-impact techniques (ransomware pre-encryption, credential dumping, data exfiltration) and within one or two scheduling cycles for everything else.

MTTD is a function of three things: the rule's scheduled frequency (how often the KQL runs), the lookback window (how far back the query searches), and the query execution time plus alert processing overhead. A rule that runs every 5 minutes with a 5-minute lookback has a theoretical minimum MTTD of approximately 6-7 minutes (5-minute schedule + 1-2 minutes processing).

A Near-Real-Time (NRT) rule runs every minute with a minimum MTTD of approximately 2-3 minutes. A rule that runs every hour has a minimum MTTD of approximately 62 minutes.

The practical implication: high-impact techniques should use NRT rules or 5-minute scheduled rules.

A ransomware pre-encryption detection rule that runs hourly gives the attacker 60 minutes of undetected activity after the technique executes — potentially enough time to encrypt every accessible file share. The same rule running as NRT gives 2-3 minutes of undetected activity, within which the encryption might affect one server instead of every server.

MTTD is only meaningful when paired with coverage. The median MTTD is 12 minutes — which sounds fast until you remember that it only measures the rules that fire. The 89.7% of techniques with no rule have infinite MTTD. A 5-minute MTTD across 10% coverage means 90% of attacks are never detected at all, regardless of how fast the 10% fires.

At a typical mid-size organization

The per-rule MTTD breakdown reveals operational issues the aggregate number hides. The password spray rule runs as NRT — 7-minute MTTD, appropriate for a credential attack. The impossible travel rule runs every 5 minutes — 15-minute MTTD, acceptable. But the suspicious inbox rule runs hourly — 45-minute MTTD for a persistence technique that should be caught in minutes. An attacker who creates an inbox rule has 45 minutes to use it before the alert fires. That's enough time to read three months of email and send the BEC wire transfer request.

Metric 3 — False positive rate

False positive rate measures what percentage of alerts from your detection rules are false positives — alerts where the rule fired correctly on the KQL logic but the underlying activity was not an attack. A high FP rate means analysts waste time investigating legitimate activity. A very low FP rate might mean your thresholds are too conservative and you're missing real attacks.

Three categories require different responses:

True positive (TP):

The rule fired on actual attacker activity. This is the goal — correct detection of a real threat.

False positive (FP):

The rule fired on legitimate activity that matched the detection logic. The impossible travel rule firing on a field engineer's legitimate flight is a false positive. The fix is tuning — adjusting the rule's logic to exclude the legitimate pattern while preserving detection of the malicious one.

Benign positive (BP):

The rule fired on activity that is technically suspicious but authorized. An admin running PowerShell scripts at 3 AM during a maintenance window matches a "suspicious PowerShell execution" rule. The activity is real, matches the pattern, and is authorized.

The fix is a watchlist or exception — an exclusion scoped to specific accounts performing specific actions from specific sources during specific windows.

The distinction between FP and BP matters because the fixes are different. A false positive reveals that the detection logic is too broad — the rule matches something it shouldn't.

A benign positive reveals that the environment has authorized activity that resembles attacks — the rule is correct but the environment has edge cases. Treating all noise as "false positives" and raising thresholds to eliminate it also eliminates real detections. The monthly tuning cadence in DE9 classifies each noise source by type and applies the appropriate fix.

Interpretation ranges: below 20% is excellent — analysts trust the queue. 20-40% is acceptable but needs attention. 40-60% means analysts spend more time dismissing noise than investigating. Above 60%, the detection program has lost operational credibility.

At a typical mid-size organization

FP rate is 43%. The primary sources are the impossible travel rule (fires on VPN users and field engineers — 120 employees who regularly trigger geographic anomalies), the suspicious PowerShell rule (fires on the admin team's scheduled automation — 8 admins running scripts nightly), and the MFA failure rule (fires on users mistyping OTP codes — hundreds of legitimate failures per week). The L1 analyst has learned to auto-close alerts from these three rules without investigation.

The problem: the impossible travel rule occasionally catches real credential compromise from in-country proxies, and auto-closing it means Tom misses the real attack hiding in the noise. The tuning methodology in DE9 fixes this not by disabling the rules but by redesigning them — the impossible travel rule becomes a device fingerprint divergence rule that doesn't fire on VPN users, the PowerShell rule gets a watchlist exclusion for specific admin accounts running specific scripts from specific IPs, and the MFA failure rule gets a threshold that distinguishes "user mistyped" from "attacker spraying."

Metric 4 — Rule health score

Rule health measures the operational condition of your detection rules. Not all active rules are functioning.

Some are dormant — they haven't fired in months because their data source disconnected, their threshold is impossibly high, or the technique they detect no longer occurs. Some are noisy — they fire so frequently that analysts ignore them. Both categories appear in the active rule count but contribute nothing to detection.

Four categories:

Healthy

— fires periodically, produces a manageable volume of alerts, at least some alerts lead to investigation. These rules are doing their job.

Dormant

— hasn't fired in 60+ days. Two common causes: the data source stopped ingesting (a connector disabled or broken), or the threshold is so conservative that no activity — legitimate or malicious — reaches it. Dormant rules give false confidence.

They appear in the rule count and detect nothing.

Noisy

— fires more than 5-6 times per day consistently. At this volume, analysts filter the rule from their view or auto-close its alerts. A noisy rule is operationally equivalent to no rule — the alerts exist but nobody investigates them.

Noisy rules need immediate tuning or redesign.

Low-volume

— fires rarely but recently. Could be a well-tuned rule for a rare technique (good) or a rule with an overly restrictive threshold that catches edge cases (needs review).

Rule health percentage: healthy rules / total active rules. This tells you what proportion of your detection library is actually contributing to security.

At a typical mid-size organization

Of 23 rules, 8 are healthy, 6 are dormant (two because the SecurityEvent data connector disconnected during an agent update, four because thresholds are so high they never trigger), 5 are noisy (100+ alerts per month that analysts auto-close), and 4 are low-volume. Rule health: 35%. The detection program is not 23 rules. It is 8 functional rules and 15 rules in various states of failure.

A rule health query that classifies rules into the four categories uses the SecurityAlert table to check when each rule last fired and how often:

KQL
// Rule health classification
SecurityAlert
| where TimeGenerated > ago(90d)
| summarize
 LastFired = max(TimeGenerated),
 AlertCount = count(),
 DailyAvg = count() / 90.0
 by AlertName = DisplayName
| extend HealthStatus = case(
 DailyAvg > 5, "Noisy",
 LastFired < ago(60d), "Dormant",
 AlertCount < 3, "Low-volume",
 "Healthy"
)
| summarize count() by HealthStatus

This is the operational concept — the full query in DE10 cross-references the rule configuration table to catch rules that exist but have never fired at all (the silent failures that don't appear in SecurityAlert).

NE's rule health breakdown:

Expected Output
HealthStatus  count_
────────────  ──────
Healthy       8
Noisy         5
Dormant       6
Low-volume    4

Eight healthy rules out of 23. The 5 noisy rules generate more than 5 alerts per day each — Tom and Priya auto-close them. The 6 dormant rules haven't fired in 60+ days because thresholds are too high or data sources disconnected. The 4 low-volume rules fire occasionally but haven't been classified. Rule health: 35%. The active rule count says 23. The operational reality is 8.

How the four metrics relate

No single metric tells the complete story. Coverage without quality is a false promise — 50% coverage at 70% FP rate means half your detections produce noise. Speed without accuracy is alert fatigue — 2-minute MTTD on rules that fire 200 times a day trains analysts to ignore the queue. Health without coverage is a well-maintained program that misses most attacks.

The four metrics together form a balanced scorecard. Coverage answers "what can we detect?" MTTD answers "how fast?" FP rate answers "can analysts trust it?" Rule health answers "is it sustainable?" A mature detection program improves all four simultaneously. The capstone board report in DE11 presents the four metrics with month-over-month trend lines and the before-and-after comparison.

A new detection rule can affect all four metrics at once: it increases coverage (new technique monitored), may temporarily increase FP rate (the rule hasn't been tuned yet), improves health (one more functioning rule), and establishes an MTTD for a technique that previously had infinite detection time.

Tuning the rule in the following month reduces the FP rate while preserving coverage. The ongoing relationship between the metrics is why the monthly tuning cadence exists — each metric influences the others, and the balance requires continuous attention.

The metrics the industry uses as benchmarks

When you present your four numbers to leadership, context helps. Industry benchmarks provide the comparison points that make your numbers meaningful.

Coverage

Most organizations that have calculated their coverage land between 5% and 15%. Organizations with established detection engineering programs typically reach 40-65% of relevant techniques. Above 65% requires specialized data sources and advanced analytics that go beyond scheduled KQL rules.

The gap between 10% and 50% is the work of a focused detection engineering program operating for 6-12 months.

MTTD

The industry distinction is between self-detected incidents and externally reported incidents. Self-detected incidents (the detection rules caught it) have median dwell times measured in days. Externally reported incidents (a third party notified you, or the attacker announced themselves) have median dwell times measured in weeks or months.

Moving techniques from "infinite MTTD" (no rule) to "minutes MTTD" (NRT or 5-minute scheduled rule) is the most impactful improvement a detection program delivers.

FP rate

Untuned detection libraries typically run 40-60% false positive rates. After six months of monthly tuning cadence, well-maintained programs achieve 15-25%. Below 15% may indicate thresholds that are too conservative — verify by checking whether rules are firing at all on techniques that should be common in your environment.

Rule health

Organizations without detection engineering typically show 30-50% rule health — meaning half or more of their active rules are dormant, noisy, or low-volume. Programs with monthly maintenance run above 80%.

Recording your baseline

In the content modules, you'll run KQL queries that calculate each metric for your own Sentinel workspace. Record the four numbers alongside your coverage percentage from Section 3. Together they form your detection baseline — the "before" in the before-and-after comparison the capstone produces.

If any metric is unmeasurable (no data, no classification, no rule health visibility), record that. "Unmeasurable" is a baseline too — it means the detection program has no visibility into its own performance. Establishing that visibility is itself a deliverable of this course. By DE9, every rule you've deployed produces data for all four metrics.

Export your baseline to a timestamped file so you have the "before" snapshot for the capstone board report:

PowerShell
# Save your detection baseline — you'll compare against this in DE11
$baseline = [PSCustomObject]@{
    Date              = (Get-Date -Format "yyyy-MM-dd")
    CoveragePct       = "10.3"    # From your coverage query
    MTTD_Median_Min   = "12"      # From MTTD query
    FP_Rate_Pct       = "43"      # From FP rate query
    RuleHealth_Pct    = "35"      # From rule health query
    ActiveRules       = "23"
    HealthyRules      = "8"
    RelevantTechniques = "145"
}
$baseline | Export-Csv ".\detection-baseline-$(Get-Date -f yyyy-MM-dd).csv" `
    -NoTypeInformation
Write-Host "Baseline saved. These numbers are the 'before' in your board report."

Detection Engineering Principle

No single metric tells the story. Coverage without quality is a false promise — 50% coverage at 70% false positive rate means half your detections produce noise. Speed without accuracy is alert fatigue. Health without coverage is a well-maintained program that misses most attacks. The four metrics together form a balanced scorecard.

Next

Section 6 walks CHAIN-HARVEST — a five-phase AiTM credential phishing attack against 23 rules. Every phase generates telemetry. No rule fires. You'll see exactly where detection should have fired and what a detection engineer would build instead.

Unlock the Full Course See Full Course Agenda