In this section

TH1.8 Hypothesis Prioritization and Backlog Management

3-4 hours · Module 1 · Free
Operational Objective
Your backlog will grow faster than you can hunt. That is by design — it means you are generating hypotheses from multiple sources. But it means you must prioritize ruthlessly. Hunting the wrong hypothesis first wastes hours that could have found a compromise. This subsection teaches the scoring framework that ensures you always hunt the highest-value hypothesis next.
Deliverable: A prioritization framework for your hunt backlog with a scoring model you can apply to every new hypothesis, and the discipline to manage the backlog as a living document.
⏱ Estimated completion: 25 minutes

You cannot hunt everything

The ATT&CK coverage analysis in TH3 will produce 50–80 hypotheses from coverage gaps alone. Threat intelligence adds more. Prior incidents add more. Environmental changes add more. A monthly hunting cadence executes 12 campaigns per year. You need to select the 12 that matter most from a backlog of 100+.

The selection must be systematic, not instinct-driven. "This one feels important" is not a prioritization framework. A scoring model is.

The scoring model

// Identify your highest-gap ATT&CK tactics — these map to Score 3 detection gap severity
SecurityAlert
| where TimeGenerated > ago(90d)
| where ProviderName == "ASI Scheduled Alerts"
| extend Tactics = parse_json(tostring(
    parse_json(ExtendedProperties).["Tactics"]))
| mv-expand Tactic = Tactics
| summarize RuleCount = dcount(AlertName) by tostring(Tactic)
// Tactics with the lowest RuleCount (or missing entirely) = highest gap severity
// Cross-reference with the MITRE ATT&CK tactic list to find
//   the tactics that DO NOT appear in these results at all
// Those zero-coverage tactics contain your Score-3 hypotheses
Expand for Deeper Context

Three dimensions, each scored 1–3. Multiply for a composite score of 1–27.

Dimension 1: Threat relevance (1–3). How likely is this technique to be used against your environment?

Score 3: The technique is actively used against your industry. Specific threat intelligence reports name threat actors targeting your sector with this technique. Example: AiTM session hijacking against financial services (actively targeted by multiple BEC and initial access groups).

Score 2: The technique is commonly used against M365 environments generally. No sector-specific targeting intelligence, but the technique appears in Mandiant, CrowdStrike, or Microsoft threat reports as broadly prevalent. Example: OAuth consent phishing (common across all M365 environments).

Score 1: The technique is documented in ATT&CK but not widely observed against M365 environments or your sector. Example: DNS tunnelling for C2 (real but less common in cloud-first environments without on-prem DNS infrastructure).

Dimension 2: Data availability (1–3). Can you test the hypothesis with data you currently have?

Score 3: All required data sources are ingested with sufficient retention. The hypothesis is fully testable today. Example: identity compromise hunt using SigninLogs and AADNonInteractiveUserSignInLogs — both ingested with 90-day retention.

Score 2: Most required data is available, but one enrichment source is missing or has limited retention. The hypothesis is testable but with reduced visibility. Example: OAuth consent hunt using AuditLogs (available) but AADServicePrincipalSignInLogs (not ingested) — consent events visible, post-consent app behavior partially blind.

Score 1: Critical data sources are not ingested. The hypothesis is not testable until data ingestion is enabled. Example: Graph API abuse hunt requiring MicrosoftGraphActivityLogs — not enabled. Action: enable ingestion before hunting.

Dimension 3: Detection gap severity (1–3). If this technique is used against you and no detection exists, how bad is the outcome?

Score 3: The technique enables immediate high-impact outcomes — data exfiltration, financial fraud, ransomware deployment, identity infrastructure compromise. Example: privilege escalation to Global Admin (T1098) — enables full tenant compromise.

Score 2: The technique enables significant intermediate outcomes — persistent access, reconnaissance, lateral movement. The attacker needs additional steps to achieve high impact. Example: inbox rule manipulation (T1564.008) — hides security notifications, enabling longer dwell time.

Score 1: The technique enables limited outcomes — information gathering, initial foothold without immediate escalation path. Example: directory enumeration via Graph API (T1087) — provides the attacker with environmental knowledge but does not directly enable data access.

Composite score and priority bands

Multiply the three scores: Threat Relevance × Data Availability × Detection Gap Severity.

Score 18–27: Hunt immediately. This is a high-relevance, testable, high-impact gap. It should be the next hypothesis you execute.

Score 8–17: Hunt this quarter. Relevant and testable with moderate-to-high impact. Schedule within the next 3 months.

Score 3–7: Hunt when higher priorities are addressed. Either low relevance, low data availability, or low impact. Keep in the backlog but do not prioritize over higher-scoring hypotheses.

Score 1–2: Defer or retire. Either the data is not available (fix the data gap first) or the technique is not relevant to your environment. Review periodically — relevance changes as the threat landscape evolves.

HYPOTHESIS PRIORITIZATION SCORING MODEL THREAT RELEVANCE 3 = sector-specific targeting 2 = broadly prevalent 1 = documented but uncommon × DATA AVAILABILITY 3 = all data ingested 2 = partial visibility 1 = critical data missing × GAP SEVERITY 3 = immediate high impact 2 = significant intermediate 1 = limited outcomes 18-27: HUNT NOW 8-17: THIS QUARTER 3-7: WHEN CAPACITY 1-2: DEFER

Figure TH1.8 — Three-dimensional scoring model. Multiply the three scores for a composite of 1–27. Four priority bands determine when each hypothesis is hunted.

Managing the backlog

The backlog is a living document. It grows as new hypotheses are generated and shrinks as hypotheses are hunted, converted to detection rules, or retired.

Add: Every new hypothesis enters with a source tag, a preliminary score, and a status of "Not started." Score it within 48 hours of entry — before the context fades.

Promote: When threat intelligence makes a previously low-relevance hypothesis urgent (new report of active exploitation targeting your sector), re-score and promote.

Retire: Hypotheses with Score 1 data availability (critical data missing) that have been in the backlog for 6+ months without the data gap being addressed should be retired. They consume cognitive space without being executable. If the data source is eventually enabled, re-enter the hypothesis fresh.

Complete: A hunted hypothesis moves from "Not started" to "Completed." The outcome (confirmed, refuted, inconclusive) and the detection rule ID (if produced) are recorded. Completed hypotheses remain in the backlog as a reference — they document what has been hunted and when.

Try it yourself

Exercise: Score your three hypotheses

Take the three hypotheses from TH1.1 exercise. Score each on the three dimensions (1–3 each). Calculate the composite score. Rank them.

The highest-scoring hypothesis should be your first campaign. If two hypotheses tie, prefer the one with higher Data Availability (score 3) — it is immediately executable with maximum visibility.

The coverage gap method

Prioritise hypotheses by mapping them against existing detection rule coverage. Export the list of analytics rules from Sentinel, map each to its MITRE ATT&CK technique, and identify techniques with zero or minimal detection coverage. These uncovered techniques become the highest-priority hunt hypotheses — they represent threats that the SOC is completely blind to. A hunt against an uncovered technique has the highest probability of producing a unique finding because no automated detection exists to catch the activity first. This method transforms the ATT&CK coverage heat map from a reporting artifact into a hunt prioritization tool.

The queries developed during this exercise become reusable templates in your personal hunting library. Parameterise the hardcoded values (user names, IP addresses, time windows) and add a header comment explaining the hypothesis each query tests. A mature hunting program maintains 50-100 parameterised query templates that any team member can execute — reducing the per-hunt preparation time from hours to minutes and ensuring consistent methodology across analysts.

⚠ Compliance Myth: "We should hunt whatever the latest threat report describes"

The myth: The most recent threat intelligence should always drive the next hunt. If a new report comes out today, tomorrow's hunt should test it.

The reality: Recency is not priority. A threat report about a technique that scores 3-3-3 (27) should absolutely drive the next hunt. A threat report about a technique that scores 1-1-2 (2) — low relevance to your sector, data not available, moderate impact — should enter the backlog at low priority regardless of how recent it is. The scoring model prevents recency bias from overriding relevance, availability, and impact. New intelligence updates scores — it does not automatically promote hypotheses to the front of the queue.

Extend this model

Some organizations add a fourth dimension: estimated effort. A high-scoring hypothesis that requires 2 hours of hunting produces faster ROI than one that requires 8 hours. Dividing the composite score by estimated effort produces a "value per hour" metric that helps when choosing between similarly-scored hypotheses. This optimization is useful for mature programs that are managing large backlogs — for initial programs, the three-dimension model is sufficient.


References Used in This Subsection

  • Course cross-references: TH1.1 (hypothesis generation), TH3 (ATT&CK coverage analysis as primary backlog source)

NE environmental considerations

NE's detection environment includes specific factors that influence this rule's operation:

Device diversity: 768 P2 corporate workstations with full Defender for Endpoint telemetry, 58 P1 manufacturing workstations with basic cloud-delivered protection, and 3 RHEL rendering servers with Syslog-only coverage. Rules targeting DeviceProcessEvents operate with full fidelity on P2 devices but may have reduced visibility on P1 devices. Manufacturing workstations in Sheffield and Sunderland represent a detection gap for endpoint-level detections.

Expand for Deeper Context

Network topology: 11 offices connected via Palo Alto SD-WAN with full-mesh connectivity. The SD-WAN firewall logs feed CommonSecurityLog in Sentinel. Cross-site lateral movement generates firewall allow events that correlate with DeviceLogonEvents — enabling multi-source detection that single-table rules cannot achieve.

User population: 810 users with distinct behavioral profiles — office workers (predictable hours, consistent applications), field engineers (variable hours, travel patterns), IT administrators (elevated privilege, broad access patterns), and manufacturing operators (fixed shifts, limited application access). Each user population has different detection baselines.

Decision point

Your ATT&CK coverage analysis shows 45% coverage. The CISO asks: 'What is our target?' Do you say 100%?

No. 100% ATT&CK coverage is neither achievable nor meaningful — some techniques are inherently difficult to detect, some are irrelevant to NE's environment, and the cost of detecting the last 10% is disproportionate to the risk reduction. The target is based on NE's threat profile: 80% coverage of techniques observed in attacks against defense supply chain organizations (sourced from MDDR and CiSP intelligence). This threat-informed target focuses resources on the techniques NE is most likely to face, not on theoretical completeness.

A hunt query returns 200 results. You have 4 hours remaining in the hunt window. You can investigate 20 results thoroughly or review all 200 superficially. Which approach produces better hunt outcomes?
Review all 200 — you might miss a critical finding in the 180 you skip.
Investigate 20 thoroughly. A superficial review of 200 results produces 200 'looked at it, seemed okay' assessments that provide no investigative value and no documentation for future reference. A thorough investigation of 20 results produces: confirmed findings (true positives requiring remediation), confirmed benign patterns (documented baselines for future comparison), and inconclusive results (flagged for monitoring). Prioritise the 20 by: highest anomaly score, highest-value assets involved, and highest-risk users involved. Document why the remaining 180 were not investigated and recommend a follow-up hunt with refined query criteria to reduce the result set.
Investigate 20 — but only if they are from the most recent 24 hours.
Neither — refine the query first to reduce the result set below 50.

You understand the detection gap and the hunt cycle.

TH0 showed you what detection rules fundamentally cannot catch. TH1 gave you the hypothesis-driven methodology that closes that gap. Now you run the hunts.

  • 10 complete hunt campaigns — from hypothesis through KQL execution through finding disposition, each campaign based on a real TTP
  • 70 production hunt queries — every one mapped to MITRE ATT&CK and tested against realistic telemetry
  • Advanced KQL for hunting — UEBA composite risk scoring, retroactive IOC sweeps, and hunt management metrics
  • Hypothesis-Driven Hunt Toolkit lab pack — 30 days of realistic M365 and endpoint telemetry with multiple attack patterns seeded in
  • TH16 — Scaling hunts across a team — the operating model for a production hunt program
Unlock the full course with Premium See Full Syllabus