SA0.4 The Confidence Threshold Problem

5 hours · Module 0 · Free

Figure SA0.4 — Confidence threshold mapping. Higher detection confidence enables more aggressive automation. Thresholds are measured empirically, not estimated.

Operational Objective

The single most important number in security automation is the detection confidence level. It determines whether an alert triggers auto-containment (stop the attacker in 30 seconds), auto-enrichment (present context for human decision in 5 minutes), or nothing (queue the alert for eventual manual review). Set the threshold too low and automation disrupts legitimate users. Set it too high and the automation never fires, providing no value. This sub teaches how to measure detection confidence empirically and map it to the appropriate automation tier.

Deliverable: The Confidence Threshold Measurement Methodology — a repeatable process for calculating detection accuracy and mapping it to automation tiers for any alert type in your environment.

⏱ Estimated completion: 30 minutes

Why confidence matters more than the playbook

A perfectly constructed playbook — elegant Logic App design, comprehensive error handling, graceful rollback — deployed against a detection with 50% false positive rate will disable legitimate user accounts half the time it fires. The playbook is not the problem. The detection feeding the playbook is the problem.

Security automation has two components: the trigger (what fires the automation) and the action (what the automation does). Most automation training focuses exclusively on the action — how to build the Logic App, how to call the Graph API, how to format the Teams adaptive card. This course starts with the trigger because the trigger determines whether the action is appropriate.

The trigger is an analytics rule. The analytics rule runs a KQL query on a schedule. When the query returns results, the rule creates an alert, the alert becomes an incident, and the incident triggers the automation. Every step in that chain is deterministic except one: the KQL query’s ability to distinguish a true attack from normal activity. That ability — the detection’s true positive rate — is the confidence level.

Measuring detection confidence

Detection confidence is not a property of the detection rule. It is a measured outcome that depends on the rule AND the environment. A detection for “sign-in from Tor exit node” might have 99% confidence in an environment where no employee uses Tor and 30% confidence in a privacy-focused organization where the legal team uses Tor Browser for research.

The measurement process is straightforward:

Step 1: Deploy the detection rule without automation. Configure the analytics rule in Sentinel. Let it fire alerts. Do not attach any playbook or automation rule. This is the measurement phase.

Step 2: Track every alert for 30 days. For each alert, the analyst classifies it as True Positive (TP), False Positive (FP), or Benign True Positive (BTP — the activity is real but not malicious, like a legitimate user accessing a sensitive file they are authorized to access).

Step 3: Calculate the confidence level. Confidence = TP / (TP + FP + BTP) × 100. Benign true positives count against confidence because auto-containment would disrupt a legitimate user for a legitimate action.

Step 4: Map to automation tier. Apply the threshold bands:

95-100%: Auto-contain. The false positive rate is low enough that the rare disruption to a legitimate user is acceptable — and the rollback playbook restores their access quickly.

80-95%: Auto-contain with approval gate. The confidence is high but not high enough for fully autonomous action. The playbook presents the enrichment data and a containment recommendation to a human via Teams adaptive card. The human approves with one click. This reduces containment time from 45 minutes (full manual triage) to 2 minutes (read card, approve).

60-80%: Auto-enrich and notify. The detection catches real attacks but also flags too many legitimate activities for auto-containment. The playbook enriches the alert and presents it to the analyst with all context pre-loaded. The analyst makes the judgment call.

Below 60%: Auto-enrich only. The detection is too noisy for any automated action beyond basic context addition. Consider tuning the detection rule to increase confidence before attaching automation.

The cost of wrong thresholds

Setting thresholds incorrectly has predictable consequences in both directions.

Threshold too low — automation triggers on false positives. An endpoint detection for “suspicious PowerShell execution” has 55% confidence — nearly half of its fires are legitimate IT automation scripts. You attach auto-isolation. In the first week, the playbook isolates 12 workstations. Seven are legitimate: the IT team’s SCCM deployment script, a developer’s build pipeline, and five machines running a legitimate monitoring agent. Each isolation generates a helpdesk ticket, a 15-minute re-onboarding process, and a complaint to the CISO. The automation is disabled by Friday.

Threshold too high — automation never fires. You set the containment threshold at 99.5% because you are afraid of the scenario above. Your AiTM detection has 97% confidence — genuinely high, with only 1 false positive in 35 fires over 30 days. But 97% is below your 99.5% threshold, so the AiTM detection gets enrichment only, no auto-containment. Every AiTM incident still requires manual containment. The attacker operates for 45 minutes while the analyst picks up the alert, triages manually, and executes containment. The automation exists but provides no containment value.

The correct approach: tiered thresholds per alert type. Not all alert types need the same threshold. AiTM with MFA-claim analysis is a high-fidelity detection — 95% is appropriate for auto-containment. “Suspicious process creation” is a broad detection — 95% may never be achievable, and enrichment-only automation is appropriate permanently. The threshold is set per detection, not globally.

⚠ Compliance Myth: "We need 100% detection accuracy before we can automate any containment action"

The myth: Auto-containment requires zero false positives. Until a detection rule achieves 100% accuracy, containment must remain manual.

The reality: 100% accuracy is unachievable for most detection types. Even the highest-fidelity detections (MFA-claim-in-token for AiTM) will occasionally fire on edge cases — a misconfigured application using an old token format, a developer testing authentication flows in a staging environment. Waiting for 100% means waiting forever.

The correct standard is not zero false positives — it is acceptable blast radius. If the detection has 97% confidence, 3 in 100 fires will affect a legitimate user. The question is: how quickly can you restore their access? If the rollback playbook re-enables the account and notifies the user within 5 minutes, the impact of a false positive is a 5-minute disruption. Compare this to the alternative: a real AiTM goes uncontained for 45 minutes while the analyst triages manually. The attacker uses that 45 minutes to register MFA persistence, create inbox rules, add mailbox delegates, and access sensitive data. The cost of the 3% false positive rate is dramatically lower than the cost of the 100% manual response delay.

Measuring confidence in practice — NE’s AiTM detection

Walk through the measurement process using Northgate Engineering’s AiTM detection as the example.

The analytics rule queries SigninLogs for sign-ins where the MFA claim was satisfied by a claim in the token (not by an interactive MFA challenge) from a new IP address that does not match the user’s historical sign-in pattern. This is the hallmark of AiTM — the attacker’s proxy forwards the legitimate MFA challenge to the real user, captures the authenticated session, and replays it from a different IP.

The rule runs every 5 minutes with a 1-hour lookback. In the first 30 days:

Week 1: 8 fires. 7 TP (actual AiTM phishing campaign targeting NE). 1 FP (a user connecting through a new corporate VPN exit node that NE’s Prisma Access configuration routed differently).

Week 2: 5 fires. 5 TP (continuation of the phishing campaign, different users).

Week 3: 12 fires. 11 TP (second wave of phishing). 1 BTP (a user’s session token was refreshed by Azure AD Connect’s seamless SSO mechanism, which presents the MFA claim differently).

Week 4: 10 fires. 10 TP (third wave).

30-day totals: 35 fires. 33 TP. 1 FP. 1 BTP. Confidence = 33/35 = 94.3%.

At 94.3%, this detection sits in the 80-95% band — auto-contain with approval gate. The SOC lead reviews the two non-TP fires: the VPN exit node case is addressable by adding NE’s VPN exit node IPs to a watchlist exclusion in the KQL query. The seamless SSO case is addressable by filtering for AuthenticationProcessingDetails that indicate SSO refresh versus interactive authentication.

After applying both tuning adjustments, the rule runs for another 15 days. Results: 12 fires, 12 TP. Confidence after tuning: 100% over the measurement window.

The detection moves from the 80-95% band to the 95-100% band. Auto-containment is enabled: session revocation + MFA reset, no approval gate required. The VIP watchlist check remains as a safeguard (VIP accounts route to approval instead of auto-execution).

This entire process — initial deployment, 30-day measurement, tuning, remeasurement, threshold promotion — is the standard methodology for every detection that will power containment automation. It takes 45 days of elapsed time but minimal analyst effort (tracking TP/FP takes 30 seconds per alert).

Decision point: Your “suspicious inbox rule creation” detection has 85% confidence over 30 days — 17 TP, 3 FP. The 3 FPs were users creating rules that forward email to personal addresses (legitimate but against policy). You want to attach auto-remediation: automatically remove the inbox rule and notify the user. The question is whether “remove the inbox rule” is Tier 3 containment (requires 95%+ confidence) or something less impactful. Removing a legitimate user’s inbox rule is disruptive (their email workflow breaks) but not as impactful as disabling their account or isolating their device. A reasonable approach: auto-remove inbox rules that match the malicious pattern exactly (forward to external + mark as read + financial keywords) and present other rules for analyst review. Pattern-matched removal is higher confidence than the overall detection confidence because the pattern filter eliminates the 3 FPs (personal forwarding does not include “mark as read” + financial keywords).

Try it: Calculate confidence for your top 3 detections

If you have access to Sentinel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SecurityIncident
| where TimeGenerated > ago(30d)
| where Title has "AiTM" or Title has "impossible travel" or Title has "inbox rule"
| summarize
    Total = count(),
    TP = countif(Classification == "TruePositive"),
    FP = countif(Classification == "FalsePositive"),
    BTP = countif(Classification == "BenignPositive")
  by Title
| extend Confidence = round(todouble(TP) / todouble(Total) * 100, 1)
| project Title, Total, TP, FP, BTP, Confidence
| order by Total desc

If you do not have Sentinel, use NE’s numbers from the worked example and calculate: what would the confidence be if 2 of the 33 TPs were reclassified as BTPs? How does that change the automation tier recommendation?

Your "consent phishing" detection fires 40 times in 30 days. Classification: 28 TP, 8 FP, 4 BTP. What is the confidence level, and what automation tier is appropriate?

70% confidence (28/40). Auto-contain. 70% is not high enough for auto-containment. At 70%, 12 out of 40 fires disrupt legitimate users — that is roughly 1 disruption every 2.5 days. Unacceptable for containment.

70% confidence (28/40). Auto-enrich + auto-notify. No containment. At 70%, this detection is in the 60-80% band. The playbook should enrich the alert with publisher verification status, permission scope, and consenting user details, then present the enrichment to the analyst. The analyst decides whether to revoke the consent.

93% confidence (28/30, excluding BTP). BTPs cannot be excluded from the calculation. A benign true positive still disrupts a legitimate user if auto-containment fires — revoking a legitimate OAuth consent breaks the user's workflow. BTPs count against confidence for containment decisions.

100% confidence for the 28 TPs. Automation should target only the TPs. You cannot selectively fire automation on TPs — the automation fires on every alert from the detection rule. The confidence represents the probability that the NEXT fire is a TP. At 70%, there is a 30% chance the next fire is not a true attack.

Where this goes deeper. SA9 (KQL-Driven Automation) teaches advanced KQL tuning techniques that increase detection confidence: correlation rules that combine multiple signals for higher confidence, watchlist-based exception handling, and alert suppression for known false positive patterns. SA5-SA7 implement the confidence thresholds in production containment playbooks with real Logic App conditions that check confidence before executing containment actions.

Operational Artifact — Confidence Threshold Measurement Template

For each detection rule that will power automation: deploy the rule, track TP/FP/BTP for 30 days, calculate confidence, map to the automation tier, apply tuning if needed, remeasure, and promote or hold. Document the measurement in a table: Rule Name | 30-Day Fires | TP | FP | BTP | Confidence | Tier | Tuning Applied | Post-Tuning Confidence | Final Tier. This table is the evidence base for automation deployment decisions and the artifact you present when management asks "how do you know this automation is safe?"

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← SA0.3 The Three Automation Tiers SA0.5 The Blast Radius Assessment →