1.5 Operational Metrics and KPIs

8-10 hours · Module 1 · Free

Operational Metrics and KPIs

Introduction

Metrics answer the question every CISO asks: “Is our SOC getting better?” Without metrics, the answer is a feeling. With metrics, the answer is a trend line with evidence.

SOC metrics serve two purposes. Internally, they identify what needs improvement — which detection rules are too noisy, which incident types take too long to contain, where capacity is consumed. Externally, they demonstrate the SOC’s value to the business — how many threats were detected, how quickly they were contained, and what the trend looks like over time.

This subsection defines the core metrics, explains how to measure each from Sentinel data, and establishes targets.

The six core SOC metrics

1. Mean Time to Detect (MTTD)

How long between an adversary’s first action and the SOC’s first alert.

How to measure: From the technical timeline (Module S8, subsection 8.3), calculate: first detection alert timestamp minus first adversary action timestamp. Average across all incidents per month.

Target: Under 15 minutes for high-severity detections. Under 60 minutes for medium-severity.

What it tells you: MTTD reflects detection rule quality and coverage. A high MTTD means either the detection rules are not catching the initial attack stages or the log ingestion pipeline has excessive latency.

2. Mean Time to Respond (MTTR)

How long between the first detection alert and the completion of containment.

How to measure: From the incident record (Module S8), calculate: containment completion timestamp minus detection alert timestamp. Average across all incidents per month.

Target: Under 30 minutes for Critical, under 2 hours for High, under 8 hours for Medium.

What it tells you: MTTR reflects response process efficiency. A high MTTR means either the triage process is too slow, the escalation path has bottlenecks, or containment decisions are delayed by unclear authority.

3. Dwell Time

How long the adversary had access from initial compromise to complete containment.

How to measure: MTTD + MTTR = Dwell time. Also calculable from the technical timeline: last containment action minus first adversary access.

Target: Under 60 minutes for BEC and credential compromise. Under 4 hours for complex incidents.

What it tells you: Dwell time is the adversary’s window of opportunity. Every reduction in dwell time directly reduces potential damage — fewer emails read, fewer accounts compromised, less data exfiltrated.

4. Alert Volume and Signal-to-Noise Ratio

How many alerts are generated and what proportion are actionable.

How to measure:

1
2
3
4
5
6
7
8
9
SecurityIncident
| where TimeGenerated > ago(30d)
| where Status == "Closed"
| summarize 
    Total = count(),
    TP = countif(Classification == "TruePositive"),
    FP = countif(Classification == "FalsePositive"),
    BP = countif(Classification == "BenignPositive")
| extend SNR = round(100.0 * (TP + BP) / Total, 1)

Target: SNR above 30%. Below 30% means the team is spending more than 70% of triage time on noise.

What it tells you: Alert volume shows workload. SNR shows whether that workload is productive. A high-volume, low-SNR environment creates alert fatigue — the number one cause of missed detections.

5. SLA Compliance

What percentage of incidents meet their severity-appropriate triage and containment SLAs.

How to measure: From incident records, compare actual triage and containment times against the SLAs defined in subsection 7.1.

Target: Above 90% for triage SLAs. Above 85% for containment SLAs.

What it tells you: SLA compliance reflects staffing adequacy and process efficiency. Persistent SLA misses indicate either the SLAs are unrealistic (wrong targets) or the team is under-resourced (right targets, insufficient capacity).

6. Detection Coverage

What percentage of in-scope MITRE ATT&CK techniques have active, tested detection rules.

How to measure: From the coverage map (Module S2, subsection 2.3): techniques with active detections divided by total techniques in scope.

Target: Context-dependent. Focus on coverage of high-priority techniques rather than a blanket percentage.

What it tells you: Coverage shows the breadth of your detection program. Gaps in coverage represent techniques an adversary can use without detection.

The monthly SOC dashboard

Consolidate these metrics into a monthly dashboard for SOC leadership and the CISO:

Metric	This month	Last month	Trend	Target
MTTD (median)	22 min	35 min	↓ Improving	<15 min
MTTR (median)	45 min	52 min	↓ Improving	<30 min (Crit)
Dwell time (median)	67 min	87 min	↓ Improving	<60 min
Alert volume	412	380	↑	—
SNR	34%	28%	↑ Improving	>30%
SLA compliance (triage)	91%	88%	↑	>90%
SLA compliance (contain)	87%	82%	↑	>85%
Detection coverage	32/42 (76%)	28/42 (67%)	↑	Prioritized
PIR actions completed	8/10 (80%)	5/12 (42%)	↑	>80%

One page. Trend arrows. Red/amber/green against targets. The CISO can read this in 2 minutes and know whether the SOC is improving.

Metrics drive behavior — choose carefully

If you measure alert closure speed, analysts will close alerts faster — including by closing them without adequate investigation. If you measure incidents per analyst, analysts will avoid opening incidents. Metrics must be paired: measure closure speed AND reopen rate (incidents that were closed prematurely and had to be reopened). Measure incidents per analyst AND SLA compliance. The paired metric prevents gaming the primary metric.

Try it yourself

Run the SNR query above against your Sentinel workspace. What is your current signal-to-noise ratio? If it is below 30%, identify the top 3 detection rules by false positive count — these are your highest-priority tuning targets.

Check your understanding

1. Your monthly dashboard shows MTTD improved from 35 minutes to 22 minutes, but MTTR worsened from 45 minutes to 65 minutes. What is the most likely explanation?

Detection improved (faster alerts from new or tuned rules) but response slowed. Possible causes: (1) Faster detection is catching more incidents, increasing the queue and creating response bottlenecks — the team detects more but cannot investigate fast enough. (2) The incidents being detected faster are more complex (earlier-stage detections require more investigation to determine scope). (3) Staffing or escalation issues slowed response (approval delays, analyst absence). The improvement in MTTD is genuine progress — but it has exposed a capacity or process constraint in the response phase that now needs attention.

The metrics are measured incorrectly — MTTD and MTTR should move together

The detection rules are generating more false positives, slowing investigation

Faster detection with slower response is a capacity signal. The detection improvement is real and valuable — but it has exposed a bottleneck downstream. The next improvement investment should target response efficiency: playbook optimization, escalation streamlining, or additional analyst capacity.

You're reading the free modules of SOC Operations

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← 1.4 Escalation Framework 1.6 The SOC Charter →