10.11 Detection Engineering Lifecycle

16-20 hours · Module 10

Detection Engineering Lifecycle

Introduction

Required role: Microsoft Sentinel Contributor for analytics rules. Sentinel Responder for incident management.

A static rule library degrades over time. Attackers change techniques. New data sources are connected. Applications are deployed and retired. Analysts discover false positive patterns. Without a continuous improvement cycle, your detection coverage develops blind spots that attackers exploit.

Detection engineering is the discipline of systematically building, testing, deploying, tuning, and retiring analytics rules to maintain and improve detection coverage against the threats that matter to your environment.


The detection engineering cycle

Phase 1: Threat modelling. Identify the threats most likely to target your environment. For an M365-heavy organisation: credential phishing (AiTM, token theft), BEC (mailbox compromise, invoice fraud), insider threat (data exfiltration by departing employees), and ransomware (endpoint compromise → lateral movement → encryption). Map each threat to MITRE ATT&CK techniques.

Phase 2: Coverage analysis. Compare your current analytics rule library against the MITRE ATT&CK matrix. Which techniques have detection rules? Which do not?

1
2
3
4
5
6
// Extract MITRE ATT&CK coverage from active analytics rules
// (Query the analytics rule API or parse from SentinelHealth)
// Conceptual approach:
// For each active rule, extract the Tactics and Techniques fields
// Map to ATT&CK matrix
// Identify techniques with zero coverage

Microsoft provides the MITRE ATT&CK blade in Sentinel (Sentinel → MITRE ATT&CK) that visualises your coverage automatically — showing which techniques have active analytics rules and which are uncovered.

Phase 3: Gap-driven rule development. For each uncovered technique identified in Phase 2, decide: is this technique relevant to my environment? If yes, develop a detection rule. If no (e.g., your environment has no macOS devices, so macOS-specific techniques are irrelevant), document the rationale and move on.

Rule development process:

Step 1: Research the technique. What does it look like in the data? Which tables contain evidence of this technique? What is the normal vs malicious pattern?

Step 2: Write the KQL query. Start broad (find all instances of the technique), then narrow (add filters to exclude known-benign patterns).

Step 3: Test against historical data. Run the query against 30 days of data. How many results? Are they true positives, false positives, or ambiguous?

Step 4: Tune the threshold. Adjust until the false positive rate is manageable (target: fewer than 5 false positives per day from a single rule).

Step 5: Configure entity mapping, custom details, alert grouping, and MITRE mapping.

Step 6: Deploy in simulation mode (disabled, manual query execution) for 1 week. Verify results.

Step 7: Enable the rule. Monitor for the first 48 hours.

Phase 4: Operational tuning. Once deployed, rules require ongoing tuning.

Monitor the false positive rate. If a rule generates more than 30% false positives consistently, it needs tuning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Rule false positive rate  identify rules that need tuning
SecurityIncident
| where TimeGenerated > ago(30d)
| where Status == "Closed"
| extend RuleName = tostring(parse_json(tostring(AdditionalData)).alertProductNames)
| summarize
    Total = count(),
    FalsePositives = countif(Classification == "FalsePositive"),
    TruePositives = countif(Classification == "TruePositive")
    by RuleName
| extend FPRate = round(FalsePositives * 100.0 / Total, 1)
| where Total > 5  // Only rules with sufficient sample size
| order by FPRate desc

Rules with FPRate > 40% need immediate attention. Options: add exclusions to the KQL (exclude known-benign patterns), raise the threshold, narrow the entity scope, or add a suppression automation rule while you investigate.

Phase 5: Retirement. Rules that are no longer relevant should be disabled or deleted. A rule detecting a vulnerability that has been patched across the environment. A rule querying a table from a connector that was decommissioned. A rule superseded by a newer, more accurate rule.


Rule quality scoring

Rate every active rule on a quality scale to prioritise tuning efforts.

Quality dimensions:

Detection fidelity — true positive rate over the last 30 days. Above 70% = Good. 40-70% = Needs tuning. Below 40% = Critical.

Coverage breadth — how many MITRE ATT&CK techniques does the rule cover? Rules covering multiple techniques provide more value per rule.

Investigation acceleration — does the rule include entity mappings, custom details, and dynamic titles? A well-enriched rule enables 30-second triage. A bare rule requires 10+ minutes of manual query work.

Operational cost — how often does the rule fire? A rule generating 50 incidents per day (even if all true positives) overwhelms the queue. Consider raising the threshold or aggregating.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Rule quality assessment  comprehensive
SecurityIncident
| where TimeGenerated > ago(30d)
| where Status == "Closed"
| extend RuleName = tostring(parse_json(tostring(AdditionalData)).alertProductNames)
| summarize
    Total = count(),
    TruePositives = countif(Classification == "TruePositive"),
    FalsePositives = countif(Classification == "FalsePositive"),
    AvgMTTR_hours = avg(datetime_diff('hour', ClosedTime, CreatedTime))
    by RuleName
| extend TPRate = round(TruePositives * 100.0 / Total, 1)
| extend QualityScore = case(
    TPRate > 80 and Total < 100, "Excellent",
    TPRate > 60 and Total < 200, "Good",
    TPRate > 40, "Needs Tuning",
    "Critical — Review Immediately")
| where Total > 5
| order by TPRate asc

Rules with “Critical” quality should be tuned within 1 week. Rules with “Needs Tuning” should be scheduled for the next monthly detection review.


The tuning playbook: systematic false positive reduction

When a rule needs tuning, follow this systematic approach.

Step 1: Sample the false positives. Pull 10 recent false positive incidents from the rule. For each, document: what entity triggered it, what activity occurred, and why the analyst classified it as FP.

Step 2: Identify the pattern. Do the false positives share a common characteristic? Typical patterns: specific user account (service account, automated process), specific IP range (internal infrastructure, known partner), specific application (backup tool, monitoring agent), specific time window (maintenance window, batch job).

Step 3: Write the exclusion. Add a where not(...) clause to the KQL that excludes the identified pattern. Be precise — exclude only the specific benign pattern, not broad categories.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Before tuning: fires on service account logons
SigninLogs
| where ResultType != "0"
| summarize FailCount = count() by IPAddress
| where FailCount > 20

// After tuning: excludes known service account IPs
SigninLogs
| where ResultType != "0"
| where IPAddress !in ("10.0.1.50", "10.0.1.51")  // Monitoring servers
| where UserPrincipalName !startswith "svc-"  // Service accounts
| summarize FailCount = count() by IPAddress
| where FailCount > 20

Step 4: Test the exclusion. Run the modified query against 30 days of historical data. Verify: the false positive pattern no longer triggers. True positives are still detected (the exclusion did not remove real threats).

Step 5: Deploy and monitor. Update the rule. Monitor for 1 week. If the FP rate drops to acceptable levels, the tuning is complete. If new FP patterns emerge, repeat from Step 1.

Step 6: Document the exclusion. In the rule’s description or in the Git repository, document every exclusion: what was excluded, why, when, and by whom. This prevents future analysts from removing necessary exclusions during rule reviews.


Threat-informed detection development

Build detection rules based on specific threat intelligence — not just generic MITRE technique coverage.

Industry threat reports. Read threat intelligence reports from Microsoft, CrowdStrike, Mandiant, and industry ISACs. Identify the TTPs (tactics, techniques, and procedures) used by threat actors targeting your industry. Build rules that detect those specific procedures — not just the generic technique.

Example: AiTM phishing detection. The generic MITRE technique is T1557 (Adversary-in-the-Middle). The specific procedure used by Storm-1167 (a known AiTM group) involves: sign-in from a non-corporate IP → token replay from a different IP within 30 minutes → inbox rule creation within 1 hour → mail forwarding to external address. Build a rule that detects this specific sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Threat-informed detection: AiTM phishing sequence (Storm-1167 pattern)
let SuspiciousSignins = SigninLogs
| where TimeGenerated > ago(2h)
| where RiskLevelDuringSignIn in ("medium", "high")
| where IPAddress !in (
    _GetWatchlist('CorporateIPs') | project SearchKey)
| project SigninTime = TimeGenerated, UserPrincipalName, SigninIP = IPAddress;
let InboxRules = CloudAppEvents
| where TimeGenerated > ago(2h)
| where ActionType == "New-InboxRule"
| extend RuleCreator = tostring(parse_json(RawEventData).UserId)
| extend RuleIP = tostring(parse_json(RawEventData).ClientIP)
| project RuleTime = TimeGenerated, RuleCreator, RuleIP;
SuspiciousSignins
| join kind=inner InboxRules
    on $left.UserPrincipalName == $right.RuleCreator
| where RuleTime > SigninTime
| where datetime_diff('minute', RuleTime, SigninTime) < 120
| project UserPrincipalName, SigninTime, SigninIP, RuleTime, RuleIP

This rule does not just detect “suspicious sign-in” or “inbox rule creation” — it detects the specific AiTM attack chain where one follows the other. This drastically reduces false positives compared to either detection alone.


Detection-as-code: analytics rules in Git

Store analytics rules as ARM/Bicep templates or YAML definitions in a Git repository alongside your DCRs (Module 7.12 governance framework). This enables:

Version control. Every rule change is tracked. If a tuning change introduces a false negative, revert via Git.

Pull request review. Rule changes are reviewed by a second analyst before deployment. Catches errors in KQL logic, missing entity mappings, and inappropriate thresholds.

CI/CD deployment. Deploy rules automatically from Git to Sentinel via Azure DevOps or GitHub Actions. The pipeline validates KQL syntax, checks for entity mapping, verifies MITRE mapping, and deploys to the workspace.

Multi-workspace consistency. If you manage multiple Sentinel workspaces (different business units, different environments), deploy the same rule library from Git to all workspaces — ensuring consistent detection coverage.


MITRE ATT&CK coverage dashboard

Build a workbook tile that shows your detection coverage against the ATT&CK matrix.

The Sentinel → MITRE ATT&CK blade provides a built-in coverage visualisation. For a custom workbook, query the active analytics rules and extract their MITRE mappings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Detection coverage by MITRE ATT&CK tactic
SecurityIncident
| where TimeGenerated > ago(90d)
| extend Tactics = parse_json(tostring(AdditionalData)).tactics
| mv-expand Tactics
| summarize
    RuleFirings = count(),
    TruePositives = countif(Classification == "TruePositive")
    by tostring(Tactics)
| order by RuleFirings desc

This shows which ATT&CK tactics your rules are actually detecting threats for — not just which tactics are mapped, but which are producing true positive incidents. A tactic with many rule mappings but zero true positives may indicate rules that are too narrow, or that the specific attack technique is not targeting your environment.


The monthly detection review

Schedule a monthly detection engineering review.

Agenda:

Review new threat intelligence: has the threat landscape changed? Are new attack techniques targeting your industry?

Review coverage gaps: check the MITRE ATT&CK blade. Has the gap list changed since last month? Were planned rules deployed?

Review rule performance: run the false positive rate query. Identify rules needing tuning. Review rules with zero firings in 90 days — are they still relevant, or is the KQL not matching current data patterns?

Review new data sources: were new connectors deployed (Module 8)? Do they require new analytics rules?

Plan next month’s rule development: select 2-3 high-priority gap techniques for rule development. Assign to the detection engineering team (or the solo SOC analyst — who schedules detection engineering alongside investigation work).


Detection-as-code CI/CD pipeline

For organisations managing 50+ analytics rules, manual deployment through the portal is error-prone and unscalable. Implement a CI/CD pipeline.

Repository structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// Git repository layout (not KQL  shown for reference)
// sentinel-detections/
// ├── analytics-rules/
//    ├── P1-InitialAccess-BruteForceSuccess.yaml
//    ├── P1-Persistence-InboxRuleForwarding.yaml
//    ├── P2-LateralMovement-RDPUnusualSource.yaml
//    └── ...
// ├── automation-rules/
//    ├── Routing-Identity-AssignTeam.json
//    └── ...
// ├── hunting-queries/
//    └── ...
// ├── workbooks/
//    └── ...
// └── pipelines/
//     ├── validate.yaml
//     └── deploy.yaml

Validation pipeline (runs on pull request): Parse each rule YAML/JSON file. Validate KQL syntax by running the query against the workspace in dry-run mode. Check for required fields: entity mappings, MITRE technique, severity, schedule. Check naming convention compliance. Report validation results as PR comments.

Deployment pipeline (runs on merge to main): Deploy validated rules to the Sentinel workspace using the Sentinel REST API or Azure Resource Manager. Update existing rules (preserving enable/disable status). Create new rules. Report deployment results.

Rollback: If a deployed rule causes unexpected behaviour (flood of false positives, query timeout), revert the Git commit and re-run the deployment pipeline. The previous rule version is restored automatically.


Rule versioning and change tracking

Every rule change should be tracked with a version number and change description.

In-portal tracking: Add a version comment to the rule description: “v3.2 — Added exclusion for build server (2026-03-20). v3.1 — Lowered threshold from 30 to 20 (2026-03-05). v3.0 — Initial deployment (2026-02-15).” This gives any analyst reviewing the rule a quick history of changes.

In-Git tracking: Each commit message describes what changed and why: “P1-BruteForce: exclude build server IPs from source filter. FP rate was 45% due to CI/CD pipeline authentication failures from 10.0.1.50 and 10.0.1.51. Expected FP reduction: 30%.”


Detection metrics dashboard

Build a dedicated workbook for detection engineering metrics — separate from the SOC operational dashboard.

Tiles:

Rule library summary: total active rules, rules by severity, rules by MITRE tactic.

Coverage heatmap: MITRE ATT&CK technique coverage (from the coverage query earlier in this subsection).

Rule quality distribution: percentage of rules rated Excellent/Good/Needs Tuning/Critical (from the quality scoring query in this subsection).

False positive trend: weekly FP rate across all rules. Healthy trend: stable or declining.

Zero-firing rules: rules that have not generated an alert in 90 days. Either the threat is not targeting your environment (acceptable) or the KQL is not matching current data patterns (investigate).

New rules deployed this month: count and list of rules created in the current month. Tracks detection engineering velocity.

Rules tuned this month: count and list of rules modified. Tracks maintenance effort.

This workbook is the detection engineer’s equivalent of the SOC operational dashboard — it measures the health and effectiveness of the detection library itself, not individual incidents.

Detection engineering is not optional — it is the continuous improvement that separates effective SOC teams from reactive ones

A SOC that deploys rules once and never tunes them will, within 6 months, have a rule library full of false positives, coverage gaps for new threat techniques, and blind spots where data sources changed. The monthly detection review is a 2-hour investment that maintains the value of everything you built in Modules 7-9.


Detection engineering for solo SOC operators

Not every organisation has a dedicated detection engineering team. Many SOC operators are solo — responsible for both incident investigation and detection improvement. Adapting the detection engineering lifecycle for a solo operator requires ruthless prioritisation.

Time allocation: Dedicate 4 hours per month to detection engineering (one afternoon, or 1 hour per week). Split: 1 hour on rule tuning (address the highest-FP rule), 1 hour on coverage analysis (identify the most critical gap), 1 hour on rule development (build one new rule), 1 hour on documentation and review.

Prioritisation framework: Cannot build a rule for every MITRE technique. Focus on: techniques used in recent incidents (you have direct evidence they target your environment), techniques highlighted in Microsoft Threat Intelligence reports for your industry, and techniques with the highest impact if undetected (data exfiltration, privilege escalation, persistence).

Leverage Content Hub. Do not write custom rules when a Content Hub template exists. Install the template, customise it (threshold, exclusions, entity mapping), and move on. Reserve custom rule development for detection gaps that no template covers.

Automate the review. Schedule the false positive rate query (from earlier in this subsection) as a monthly email via Logic Apps. The email arrives on the first Monday of each month with the top 5 rules needing tuning — no manual query execution required.


Tracking the threat landscape

Detection engineering responds to the threat landscape. Stay informed to build relevant rules.

Microsoft Threat Intelligence Blog (microsoft.com/security/blog): monthly reports on active threat groups, new techniques, and campaign details. Microsoft names threat groups (Storm-XXXX) and provides specific IOCs and TTPs that map directly to Sentinel detection opportunities.

MITRE ATT&CK updates: The ATT&CK framework is updated quarterly with new techniques and sub-techniques. Review updates for techniques relevant to your environment — a new sub-technique for cloud persistence may require a new analytics rule.

Industry ISACs: Sector-specific threat intelligence sharing organisations (FS-ISAC for financial services, H-ISAC for healthcare, etc.). Join your sector’s ISAC for threat intelligence that is specifically relevant to your industry and threat profile.

Incident-driven detection. The most valuable detection rules come from your own incidents. After every true positive incident, ask: “What analytics rule detected this? Could we have detected it earlier at a different stage? Are there related techniques the attacker used that we have no rule for?” Module 14 (AiTM investigation) and Modules 14-15 (BEC, token replay) demonstrate this principle — each investigation scenario includes detection engineering takeaways that feed directly back into the rule library.

Try it yourself

Navigate to Sentinel → MITRE ATT&CK. Review the coverage visualisation. Identify 3 techniques that are relevant to your environment but have no analytics rule coverage. For one of those techniques, research the technique on attack.mitre.org, identify which Sentinel table would contain evidence, write a draft KQL query, and test it against historical data. This is one iteration of the detection engineering cycle — repeated monthly, it systematically closes your coverage gaps.

What you should observe

The MITRE ATT&CK blade shows techniques colour-coded by coverage. Green = active analytics rule. Red/grey = no coverage. In a new deployment, many techniques are uncovered. After deploying Content Hub solutions and custom rules through Modules 7-9, coverage should be improving. The exercise of researching one technique and writing a rule is the core skill of detection engineering.


Knowledge check

Compliance mapping

NIST CSF: DE.AE-1 (Baseline of operations established), PR.DS-1 (Data-at-rest is protected). ISO 27001: A.8.15 (Logging), A.8.16 (Monitoring activities). SOC 2: CC7.2 (Monitor system components). Every configuration in this subsection contributes to the logging and monitoring controls that auditors verify.


Check your understanding

1. An analytics rule has a 45% false positive rate over the last month. What should you do?

Analyse the false positive incidents to identify the common benign pattern. Modify the KQL query to exclude that pattern (add a where clause filtering the known-benign condition). Test the modified query against historical data to verify it still detects true positives. Deploy the updated rule. If the FP rate is still above 30%, raise the threshold or narrow the entity scope. A tactical automation rule can suppress the noise during the tuning period.
Disable the rule permanently
Ignore the false positives
Lower the severity to Informational