5.9 Workspace Health and Monitoring

90 minutes · Module 5

Workspace Health and Monitoring

By the end of this subsection, you will have KQL queries for monitoring ingestion health, detecting data gaps, and tracking workspace operational metrics.

A workspace that silently stops ingesting data is worse than no workspace at all — you believe you have coverage when you do not. These monitoring queries catch problems before they become blind spots.

Connector health check

1
2
3
4
5
6
7
// Check the last event time per data type
union withsource=TableName *
| where TimeGenerated > ago(24h)
| summarize LastEvent = max(TimeGenerated), EventCount = count() by TableName
| extend HoursSinceLastEvent = round(datetime_diff('minute', now(), LastEvent) / 60.0, 1)
| where HoursSinceLastEvent > 1
| sort by HoursSinceLastEvent desc
Expected Output
TableNameLastEventEventCountHoursSinceLastEvent
IdentityLogonEvents2026-03-21 08:14126.3
AuditLogs2026-03-21 13:452341.1
What to look for: Any table with HoursSinceLastEvent > 4 during business hours indicates a potential connector failure. IdentityLogonEvents showing 6.3 hours of silence means Defender for Identity may have lost connectivity to a domain controller. AuditLogs at 1.1 hours is borderline — check again in 30 minutes.

Ingestion anomaly detection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Compare today's ingestion to the 7-day average per table
let baseline =
    Usage
    | where TimeGenerated between (ago(8d) .. ago(1d))
    | where IsBillable == true
    | summarize AvgDailyMB = round(avg(Quantity), 0) by DataType;
Usage
| where TimeGenerated > ago(1d)
| where IsBillable == true
| summarize TodayMB = round(sum(Quantity), 0) by DataType
| join kind=inner baseline on DataType
| extend PercentChange = round((TodayMB - AvgDailyMB) * 100.0 / AvgDailyMB, 0)
| where abs(PercentChange) > 50
| project DataType, TodayMB, AvgDailyMB, PercentChange
| sort by abs(PercentChange) desc
Expected Output
DataTypeTodayMBAvgDailyMBPercentChange
CommonSecurityLog8,4502,100+302%
SigninLogs1801,200-85%
What to look for: Two types of anomaly matter. Spikes: CommonSecurityLog at +302% means firewall log volume tripled — possible DDoS, a config change, or a new verbose rule. Investigate the cause and consider a DCR filter. Drops: SigninLogs at -85% is more dangerous — it means the Entra ID connector may have failed. You are missing sign-in data, which means your token replay detections are blind. Fix immediately.

Analytics rule health

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Check which analytics rules fired (or failed) in the last 24 hours
SentinelHealth
| where TimeGenerated > ago(24h)
| where SentinelResourceType == "Analytic rule"
| summarize
    Succeeded = countif(Status == "Success"),
    Failed = countif(Status == "Failure"),
    LastRun = max(TimeGenerated)
    by SentinelResourceName
| where Failed > 0
| sort by Failed desc
Expected Output
SentinelResourceNameSucceededFailedLastRun
Token replay from novel IP9242026-03-21 14:15
What to look for: Any rule with Failed > 0 needs investigation. Common failure causes: the KQL query references a table that was moved to Basic tier (join not supported), the query exceeded the execution time limit (optimize with time filters), or the table was renamed in a schema update. 4 failures out of 96 runs may be transient (service hiccup) — 96 out of 96 means the rule is broken.
Build a health monitoring workbook

Combine these three queries into a Sentinel workbook that runs automatically. Module 26 covers workbook construction in detail. For now, run these queries manually once per week during your workspace health check.

Check your understanding

1. SigninLogs ingestion dropped 85% compared to the 7-day average. What is the operational impact?

All analytics rules that depend on SigninLogs are partially blind — they may miss sign-in anomalies, brute force attacks, and token replay because the data is not arriving. The ingestion gap also means any investigation query against SigninLogs for the affected period will return incomplete results. Fix the Entra ID diagnostic settings immediately.
No impact — users are still signing in
Only workbook dashboards are affected