TH1.8 Hypothesis Prioritization and Backlog Management

3-4 hours · Module 1 · Free
Operational Objective
Your backlog will grow faster than you can hunt. That is by design — it means you are generating hypotheses from multiple sources. But it means you must prioritize ruthlessly. Hunting the wrong hypothesis first wastes hours that could have found a compromise. This subsection teaches the scoring framework that ensures you always hunt the highest-value hypothesis next.
Deliverable: A prioritization framework for your hunt backlog with a scoring model you can apply to every new hypothesis, and the discipline to manage the backlog as a living document.
⏱ Estimated completion: 25 minutes

You cannot hunt everything

The ATT&CK coverage analysis in TH3 will produce 50–80 hypotheses from coverage gaps alone. Threat intelligence adds more. Prior incidents add more. Environmental changes add more. A monthly hunting cadence executes 12 campaigns per year. You need to select the 12 that matter most from a backlog of 100+.

The selection must be systematic, not instinct-driven. “This one feels important” is not a prioritization framework. A scoring model is.

The scoring model

Three dimensions, each scored 1–3. Multiply for a composite score of 1–27.

Dimension 1: Threat relevance (1–3). How likely is this technique to be used against your environment?

Score 3: The technique is actively used against your industry. Specific threat intelligence reports name threat actors targeting your sector with this technique. Example: AiTM session hijacking against financial services (actively targeted by multiple BEC and initial access groups).

Score 2: The technique is commonly used against M365 environments generally. No sector-specific targeting intelligence, but the technique appears in Mandiant, CrowdStrike, or Microsoft threat reports as broadly prevalent. Example: OAuth consent phishing (common across all M365 environments).

Score 1: The technique is documented in ATT&CK but not widely observed against M365 environments or your sector. Example: DNS tunnelling for C2 (real but less common in cloud-first environments without on-prem DNS infrastructure).

Dimension 2: Data availability (1–3). Can you test the hypothesis with data you currently have?

Score 3: All required data sources are ingested with sufficient retention. The hypothesis is fully testable today. Example: authentication anomaly hunt using SigninLogs and AADNonInteractiveUserSignInLogs — both ingested with 90-day retention.

Score 2: Most required data is available, but one enrichment source is missing or has limited retention. The hypothesis is testable but with reduced visibility. Example: OAuth consent hunt using AuditLogs (available) but AADServicePrincipalSignInLogs (not ingested) — consent events visible, post-consent app behavior partially blind.

Score 1: Critical data sources are not ingested. The hypothesis is not testable until data ingestion is enabled. Example: Graph API abuse hunt requiring MicrosoftGraphActivityLogs — not enabled. Action: enable ingestion before hunting.

Dimension 3: Detection gap severity (1–3). If this technique is used against you and no detection exists, how bad is the outcome?

Score 3: The technique enables immediate high-impact outcomes — data exfiltration, financial fraud, ransomware deployment, identity infrastructure compromise. Example: privilege escalation to Global Admin (T1098) — enables full tenant compromise.

Score 2: The technique enables significant intermediate outcomes — persistent access, reconnaissance, lateral movement. The attacker needs additional steps to achieve high impact. Example: inbox rule manipulation (T1564.008) — hides security notifications, enabling longer dwell time.

Score 1: The technique enables limited outcomes — information gathering, initial foothold without immediate escalation path. Example: directory enumeration via Graph API (T1087) — provides the attacker with environmental knowledge but does not directly enable data access.

Composite score and priority bands

Multiply the three scores: Threat Relevance × Data Availability × Detection Gap Severity.

Score 18–27: Hunt immediately. This is a high-relevance, testable, high-impact gap. It should be the next hypothesis you execute.

Score 8–17: Hunt this quarter. Relevant and testable with moderate-to-high impact. Schedule within the next 3 months.

Score 3–7: Hunt when higher priorities are addressed. Either low relevance, low data availability, or low impact. Keep in the backlog but do not prioritize over higher-scoring hypotheses.

Score 1–2: Defer or retire. Either the data is not available (fix the data gap first) or the technique is not relevant to your environment. Review periodically — relevance changes as the threat landscape evolves.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Identify your highest-gap ATT&CK tactics  these map to Score 3 detection gap severity
SecurityAlert
| where TimeGenerated > ago(90d)
| where ProviderName == "ASI Scheduled Alerts"
| extend Tactics = parse_json(tostring(
    parse_json(ExtendedProperties).["Tactics"]))
| mv-expand Tactic = Tactics
| summarize RuleCount = dcount(AlertName) by tostring(Tactic)
// Tactics with the lowest RuleCount (or missing entirely) = highest gap severity
// Cross-reference with the MITRE ATT&CK tactic list to find
//   the tactics that DO NOT appear in these results at all
// Those zero-coverage tactics contain your Score-3 hypotheses
HYPOTHESIS PRIORITIZATION SCORING MODELTHREAT RELEVANCE3 = sector-specific targeting2 = broadly prevalent1 = documented but uncommon×DATA AVAILABILITY3 = all data ingested2 = partial visibility1 = critical data missing×GAP SEVERITY3 = immediate high impact2 = significant intermediate1 = limited outcomes18-27: HUNT NOW8-17: THIS QUARTER3-7: WHEN CAPACITY1-2: DEFER

Figure TH1.8 — Three-dimensional scoring model. Multiply the three scores for a composite of 1–27. Four priority bands determine when each hypothesis is hunted.

Managing the backlog

The backlog is a living document. It grows as new hypotheses are generated and shrinks as hypotheses are hunted, converted to detection rules, or retired.

Add: Every new hypothesis enters with a source tag, a preliminary score, and a status of “Not started.” Score it within 48 hours of entry — before the context fades.

Promote: When threat intelligence makes a previously low-relevance hypothesis urgent (new report of active exploitation targeting your sector), re-score and promote.

Retire: Hypotheses with Score 1 data availability (critical data missing) that have been in the backlog for 6+ months without the data gap being addressed should be retired. They consume cognitive space without being executable. If the data source is eventually enabled, re-enter the hypothesis fresh.

Complete: A hunted hypothesis moves from “Not started” to “Completed.” The outcome (confirmed, refuted, inconclusive) and the detection rule ID (if produced) are recorded. Completed hypotheses remain in the backlog as a reference — they document what has been hunted and when.

Try it yourself

Exercise: Score your three hypotheses

Take the three hypotheses from TH1.1 exercise. Score each on the three dimensions (1–3 each). Calculate the composite score. Rank them.

The highest-scoring hypothesis should be your first campaign. If two hypotheses tie, prefer the one with higher Data Availability (score 3) — it is immediately executable with maximum visibility.

⚠ Compliance Myth: "We should hunt whatever the latest threat report describes"

The myth: The most recent threat intelligence should always drive the next hunt. If a new report comes out today, tomorrow’s hunt should test it.

The reality: Recency is not priority. A threat report about a technique that scores 3-3-3 (27) should absolutely drive the next hunt. A threat report about a technique that scores 1-1-2 (2) — low relevance to your sector, data not available, moderate impact — should enter the backlog at low priority regardless of how recent it is. The scoring model prevents recency bias from overriding relevance, availability, and impact. New intelligence updates scores — it does not automatically promote hypotheses to the front of the queue.

Extend this model

Some organizations add a fourth dimension: estimated effort. A high-scoring hypothesis that requires 2 hours of hunting produces faster ROI than one that requires 8 hours. Dividing the composite score by estimated effort produces a "value per hour" metric that helps when choosing between similarly-scored hypotheses. This optimization is useful for mature programs that are managing large backlogs — for initial programs, the three-dimension model is sufficient.


References Used in This Subsection

  • Course cross-references: TH1.1 (hypothesis generation), TH3 (ATT&CK coverage analysis as primary backlog source)

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus