DE1.11 The Rule Specification Template

2-3 hours · Module 1 · Free

Operational Objective

The Specification Discipline: Every production detection rule in this course starts with a specification document — not a KQL query. The specification defines the hypothesis, ATT&CK mapping, data sources, entity mapping, severity rationale, false positive analysis, response procedure, and tuning plan BEFORE the first line of KQL is written. This discipline prevents the common failure mode of deploying a technically correct query that is operationally broken — wrong severity, missing entities, no FP analysis, no response procedure, no tuning plan. This subsection provides the complete 12-section template and teaches how to fill it out.

Deliverable: The complete rule specification template — the document you will fill out for every production rule in DE3 through DE8.

⏱ Estimated completion: 25 minutes

Why specification before code

The most common detection engineering mistake is writing the KQL first and everything else second. The engineer finds a query that returns interesting results in Advanced Hunting, deploys it as an analytics rule, and discovers weeks later that: the entity mapping was wrong (alerts do not correlate), the severity was arbitrary (Medium was chosen because it “felt right”), no false positive analysis was done (the rule fires 30 times per day on legitimate activity), and no response procedure exists (the SOC analyst who receives the alert does not know what to investigate or what containment to recommend).

The rule specification template prevents this by requiring every design decision before deployment. The specification is the engineering document. The KQL implements the engineering document.

Figure DE1.11 — The 12-section rule specification template. Four phases: design (sections 1-3), implementation (4-6), configuration (7-9), and operations (10-12). The specification is completed before the KQL is written.

The 12 sections

Section 1 — Rule Name + ID. Naming convention: DE[module]-[technique-short-name]. Example: DE4-AiTM-Token-Anomaly. The name must describe what the rule detects in 3-5 words. Avoid generic names (“Suspicious Activity”) — they provide no triage context.

Section 2 — Detection Hypothesis. A testable statement: “I hypothesize that [technique] produces [observable] in [data source] that can be distinguished from normal activity by [distinguishing characteristic].” Example: “I hypothesize that AiTM credential theft produces a non-interactive token refresh from a different IP than the preceding interactive sign-in, within a 30-minute window, where the device details do not match the user’s baseline.” If you cannot write the hypothesis, you are not ready to write the KQL.

Section 3 — MITRE ATT&CK Mapping. Technique ID at sub-technique level. Tactic. NE attack chain reference (which CHAIN does this rule detect?). Example: T1557 Adversary-in-the-Middle, Credential Access, CHAIN-HARVEST Phase 2.

Section 4 — Data Sources + Tables. Which Sentinel tables does the query need? Are they connected and ingesting? Daily volume estimate. Example: SigninLogs (2.1 GB/day, connected) + AADNonInteractiveUserSignInLogs (4.8 GB/day, connected).

Section 5 — KQL Query. The complete query, annotated line by line. Every where filter, every join, every summarize explained. Not just what it does — why it does it that way instead of an alternative.

Section 6 — Entity Mapping. Account → [field], Host → [field], IP → [field], URL → [field]. Every entity that enables correlation with other rules.

Section 7 — Frequency + Lookback. Frequency selected per DE1.3 (severity-frequency alignment). Lookback configured per DE1.2 (must exceed frequency). Rationale for the chosen values.

Section 8 — Severity + Rationale. Confidence × Impact matrix applied per DE1.5. Explicit rationale: “High severity because detection confidence is high (session anomaly is distinctive) and impact is significant (full account access).”

Section 9 — Trigger + Grouping. Trigger threshold. Event grouping within the alert. Alert-to-incident grouping strategy per DE1.9. Cross-rule correlation enabled?

Section 10 — False Positive Analysis. Expected FP patterns: service accounts, automation, legitimate travel, known applications. Mitigation: watchlist exclusions, additional filters, threshold adjustments. Estimated FP rate before deployment (from historical data testing).

Section 11 — Response Procedure. Step-by-step: what should the SOC analyst do when this alert fires? What to investigate first. What containment to recommend. What escalation path. Automation tier per DE1.10.

Section 12 — Tuning Plan. First review: 14 days after deployment. Monthly cadence: check FP rate, adjust thresholds, update exclusions. Retirement criteria: if the technique is mitigated by a preventive control, the rule may be retired.

Worked example: completed specification for NE

The following is a condensed specification for the CHAIN-HARVEST Phase 2 detection (AiTM token theft). This is the format every rule in DE3-DE8 will follow.

1. Rule name: DE4-AiTM-Token-Theft-Session-Anomaly | ID: DE4-001
2. Hypothesis: “AiTM credential theft produces a non-interactive token refresh from a different IP and device than the preceding interactive sign-in, within a 30-minute window, where the non-interactive session originates from an IP not in the user’s 30-day baseline.”
3. MITRE: T1557 Adversary-in-the-Middle | Tactic: Credential Access | Chain: CHAIN-HARVEST Phase 2
4. Data sources: SigninLogs (2.1 GB/day, connected) + AADNonInteractiveUserSignInLogs (4.8 GB/day, connected)
5. KQL: Cross-table join on UserPrincipalName with 30-minute time window. Filter: interactive sign-in (ResultType=0) followed by non-interactive refresh from different IP where device details mismatch. Exclude: known VPN exit IPs (NE has 3 Prisma Access egress IPs that legitimately differ from user’s direct IP). Annotated query: 28 lines, 6 comments explaining each filter rationale.
6. Entity mapping: Account → UserPrincipalName | IP → RefreshIP (attacker’s IP) | Host → InteractiveDevice (victim’s device)
7. Frequency: 5 minutes | Lookback: 35 minutes (30-min correlation window + 5-min buffer) | Rationale: High severity — AiTM leads to full account compromise within minutes. 5-min frequency provides ~5.5-min average detection latency.
8. Severity: High | Confidence: High (IP + device mismatch on token refresh is a distinctive AiTM signature) | Impact: Significant (full mailbox and app access) | Rationale: Not Critical because legitimate VPN split-tunnel scenarios can produce similar IP mismatches at low frequency — estimated 5-10% FP rate before tuning. After tuning with VPN egress exclusion, estimated <3% FP.
9. Trigger: Results > 0 (threshold in KQL) | Event grouping: All events in single alert | Incident grouping: By Account entity (one incident per compromised user) | Cross-rule correlation: Enabled (links to password spray, inbox rule, and BEC detection rules via shared Account and IP entities)
10. FP analysis: Expected FPs: (1) VPN split-tunnel — user’s interactive sign-in from office IP, token refresh from Prisma Access egress IP. Mitigation: exclude NE’s 3 Prisma egress IPs from the “different IP” filter. (2) Mobile app background refresh — user signs in on laptop, phone refreshes from cellular IP. Mitigation: exclude mobile UserAgent patterns. (3) Corporate proxy — token refresh routed through a different egress. Mitigation: add corporate egress IPs to exclusion. Estimated FP rate: <3% post-tuning (tested against 30 days of historical data: 2 FPs from VPN split-tunnel, both eliminated by egress IP exclusion).
11. Response procedure: Step 1: Verify the RefreshIP is not in the user’s known IP history (check SigninLogs for 30-day IP baseline). Step 2: If IP is unknown — revoke all sessions for the user (Entra ID → Users → Revoke sessions). Step 3: Check OfficeActivity for inbox rule creation within 60 minutes of the token refresh. Step 4: Check EmailEvents for outbound email from the user within 60 minutes. Step 5: If inbox rule or suspicious email found — escalate to Incident Commander and initiate CHAIN-HARVEST containment playbook. Automation tier: Tier 2 (enrich with user department, recent risk history, device compliance; assign to on-call analyst). Not Tier 1 — the 3% FP rate means automated session revocation would lock out approximately 1 legitimate user per month.
12. Tuning plan: First review: 14 days. Check: FP rate, alert volume, VPN egress exclusion effectiveness. Monthly: review for new VPN egress IPs (Prisma Access IP changes during maintenance windows). Quarterly: assess whether phishing-resistant MFA deployment has reduced AiTM attack surface (if NE deploys FIDO2 keys, AiTM becomes less viable and the rule may be tuned to Informational or retired).

⚠ Compliance Myth: "Documentation slows down detection engineering — just deploy the rule"

The myth: Writing a specification for every rule is bureaucratic overhead that reduces velocity.

The reality: The specification takes 15-20 minutes per rule. Troubleshooting an undocumented rule that generates FPs, has wrong entity mapping, or lacks a response procedure takes 1-2 hours per incident. Over 30 days, an undocumented rule generating 5 FPs per day costs 150 hours of analyst time investigating alerts they cannot act on. The 15-minute specification prevents that cost. Additionally, the specification enables peer review (catches design errors before deployment), onboarding (new team members understand what the rule detects without reading the KQL), and compliance evidence (ISO 27001 A.8.16 requires documented monitoring procedures — the specification IS the documentation).

Try it yourself

Exercise: Write a specification for an existing rule

Choose one of your existing analytics rules — preferably one that generates false positives or one that the SOC analysts do not fully understand. Fill out the 12-section template for that rule. Where can you not fill in a section? That gap is where the rule is under-engineered. The FP analysis you cannot write because you have never tracked FP rate? That is why the rule generates noise. The response procedure you cannot write because no one documented the triage steps? That is why analysts close the alert without investigating.

Check your understanding

An engineer deploys a rule that detects "suspicious PowerShell execution" without writing a specification. After 2 weeks, the rule generates 40 alerts per day (most are FPs from IT automation scripts), has no ATT&CK mapping, no entity mapping, and the SOC analysts do not know whether to investigate or ignore the alerts. Which specification sections would have prevented these problems?

Answer: Section 10 (False Positive Analysis): estimating the FP rate before deployment would have identified the IT automation pattern and added a service account exclusion. Section 3 (ATT&CK Mapping): mapping to T1059.001 (PowerShell) would have included the rule in coverage reporting. Section 6 (Entity Mapping): mapping Account and Host entities would have enabled incident correlation. Section 11 (Response Procedure): documenting the triage steps would have given analysts clear guidance. All four problems are preventable by completing the specification before deployment — 15 minutes of design preventing 2 weeks of operational pain.

Troubleshooting: Specification adoption

“Our team resists documentation.” Start with sections 2, 5, and 10 only: hypothesis, KQL, and FP analysis. These three sections prevent the most common failures (building the wrong thing, deploying without testing, generating unmanageable noise). Add the remaining sections as the team matures. Partial specification is better than no specification.

“The specification feels redundant with the KQL.” The specification captures design decisions that the KQL does not express. Why is the severity High and not Medium? What FP patterns were considered during design? What should the analyst investigate first? The KQL answers “what does this query do?” The specification answers “why does this rule exist, and how should it be operated?”

📋 Operational Artifact — Rule Specification Template

1. Rule name: DE___-___ | Rule ID: ___
2. Detection hypothesis: “I hypothesize that [technique] produces [observable] in [data source] that is distinguishable by [characteristic].”
3. MITRE ATT&CK: Tactic: ___ | Technique: T____.___ | Chain: CHAIN-___
4. Data sources: Table: ___ (___GB/day) | Table: ___ (___GB/day)
5. KQL query: [annotated query — see module content]
6. Entity mapping: Account → ___ | Host → ___ | IP → ___
7. Frequency: ___ min | Lookback: ___ min | Rationale: ___
8. Severity: ___ | Confidence: ___ | Impact: ___ | Rationale: ___
9. Trigger: results > ___ | Event grouping: ___ | Incident grouping: by entity (___) | Cross-rule correlation: [Y/N]
10. FP analysis: Expected patterns: ___ | Mitigations: ___ | Estimated FP rate: ___%
11. Response procedure: Step 1: ___ | Step 2: ___ | Containment: ___ | Escalation: ___
12. Tuning plan: First review: 14 days | Monthly: ___ | Retirement criteria: ___

References used in this subsection

Course cross-references: DE1.2-DE1.10 (each section references a specific module subsection), DE3-DE8 (every production rule includes a completed specification), DE10 (specification as documentation standard for detection-as-code)

Validating the specification against the deployed rule

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// SPECIFICATION AUDIT: Does the deployed rule match the documented specification?
// Compare the rule's actual configuration with its YAML specification
_GetWatchlist('RuleSpecifications')
| project RuleID, DocumentedSeverity, DocumentedFrequency, DocumentedThreshold
| join kind=leftouter (
    SecurityAlert
    | where TimeGenerated > ago(30d)
    | summarize ActualAlerts = count(), ActualSeverity = take_any(AlertSeverity) by AlertName
) on $left.RuleID == $right.AlertName
// Mismatches between documented and actual = specification drift

You're reading the free modules of Detection Engineering

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← DE1.10 Automation Rules and Response Integration DE1.12 Common Architecture Mistakes and Anti-Patterns →