7.11 Workspace Health and Operational Monitoring

16-20 hours · Module 7

Workspace Health and Operational Monitoring

SC-200 Exam Objective

Domain 1 — Manage a SOC Environment: Workspace health monitoring ensures the SIEM platform is operational. If data stops flowing or rules stop executing, the SOC is blind — and may not know it until an incident goes undetected.

Introduction

A Sentinel workspace that appears healthy in the portal but has a failed data connector is worse than no workspace at all. The SOC team believes they have visibility when they do not. An attacker targeting the data source whose connector failed operates undetected.

Workspace health monitoring is the operational discipline that prevents this scenario. It involves monitoring data ingestion volume (is data flowing as expected?), connector health (are all connectors operational?), analytics rule execution (are rules running and producing results?), automation rule execution (are automation actions completing?), and query performance (are queries running within acceptable time limits?).

This subsection teaches you to build the operational monitoring that keeps the workspace healthy and the SOC informed about platform status.

The SentinelHealth table

The SentinelHealth table (enabled in subsection 7.3) logs health events for data connectors, analytics rules, and automation rules. Each record includes the resource type, resource name, status, and any error details.

1
2
3
4
5
6
7
8
// Data connector health — last 24 hours
SentinelHealth
| where TimeGenerated > ago(24h)
| where SentinelResourceType == "Data connector"
| summarize LastStatus = arg_max(TimeGenerated, Status, Description)
    by SentinelResourceName
| project SentinelResourceName, Status, LastStatus, Description
| order by Status asc

Expected Output — Connector Health Status

Connector	Status	Last Check	Description
Microsoft Entra ID	✓ Success	5 min ago	Data flowing normally
Microsoft Defender XDR	✓ Success	8 min ago	Incidents and alerts synced
Syslog via AMA	✗ Failure	6 hours ago	Agent communication lost
Azure Activity	✓ Success	12 min ago	Data flowing normally

Action required: The Syslog connector has not reported success in 6 hours. This means firewall log data is not entering the workspace — analytics rules that depend on Syslog data are not detecting threats. Investigate the Azure Monitor Agent (AMA) on the log forwarder, check network connectivity, and verify the data collection rule configuration.

1
2
3
4
5
6
7
8
// Analytics rule execution health — detect failures
SentinelHealth
| where TimeGenerated > ago(24h)
| where SentinelResourceType == "Analytic rule"
| where Status != "Success"
| project TimeGenerated, SentinelResourceName, Status,
    Description, ExtendedProperties
| order by TimeGenerated desc

Analytics rule failures are silent — the rule does not fire, no alert is generated, no incident is created. The only way to know a rule is failing is to monitor the SentinelHealth table. Common failure reasons: KQL syntax error after a schema change (a column was renamed or removed), query timeout (the rule scans too much data), and workspace throttling (too many concurrent queries).

Monitoring data ingestion volume

Data ingestion volume should be consistent day-to-day (with expected variation for business hours vs off-hours and weekdays vs weekends). A sudden drop indicates a connector failure or configuration change. A sudden spike indicates a new data source, a misconfigured connector, or a security incident generating abnormal log volume.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Daily ingestion trend — detect volume anomalies
Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| extend AvgGB = toscalar(
    Usage | where TimeGenerated > ago(30d) | where IsBillable == true
    | summarize sum(Quantity) / 1024 / 30)
| extend Deviation = (DailyGB - AvgGB) / AvgGB * 100
| where abs(Deviation) > 25  // Flag days with >25% deviation
| project Day = format_datetime(TimeGenerated, 'yyyy-MM-dd'),
    DailyGB = round(DailyGB, 1),
    AvgGB = round(AvgGB, 1),
    Deviation = strcat(round(Deviation, 0), "%")
| order by TimeGenerated desc

Set up an analytics rule for ingestion drops. A scheduled rule that fires when any data type’s daily volume drops below 50% of its 7-day average detects connector failures before they become prolonged blind spots. This is one of the most important operational detection rules — it detects the failure of your detection capability.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Ingestion drop detection — for use in an analytics rule
let baseline = Usage
    | where TimeGenerated between(ago(8d) .. ago(1d))
    | where IsBillable == true
    | summarize AvgDailyGB = sum(Quantity) / 1024 / 7 by DataType;
let today = Usage
    | where TimeGenerated > ago(1d)
    | where IsBillable == true
    | summarize TodayGB = sum(Quantity) / 1024 by DataType;
baseline
| join kind=inner today on DataType
| where TodayGB < AvgDailyGB * 0.5
| project DataType, TodayGB = round(TodayGB, 2),
    AvgDailyGB = round(AvgDailyGB, 2),
    DropPct = round((1 - TodayGB / AvgDailyGB) * 100, 0)

Monitoring analytics rule execution

Beyond SentinelHealth, you can query the analytics rule execution directly to ensure rules are running, completing, and producing expected results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Rules that have not fired in 30 days — potential stale rules
let RuleAlerts = SecurityAlert
    | where TimeGenerated > ago(30d)
    | summarize LastFired = max(TimeGenerated) by AlertName;
// Compare with active rules to find silent ones
SentinelHealth
| where SentinelResourceType == "Analytic rule"
| where Status == "Success"
| distinct SentinelResourceName
| join kind=leftanti RuleAlerts on $left.SentinelResourceName == $right.AlertName
| project RuleName = SentinelResourceName, Status = "No alerts in 30 days"

Rules that have not fired in 30 days may be: correctly configured but detecting a rare threat (expected — the rule is waiting for the threat), misconfigured with a KQL error that returns no results (the query runs but matches nothing), or targeting a data source that is no longer active (the connector was removed or the data source was decommissioned). Review each silent rule to determine which category it falls into.

Building an operational health dashboard

Combine the health monitoring queries into a Sentinel workbook that provides at-a-glance workspace health. The workbook should include:

Connector status panel. A table showing each connector’s last health status (green/red). Red connectors require immediate investigation. This should be the first thing the SOC team checks at the start of each shift.

Ingestion volume trend. A time chart showing daily ingestion volume for the past 30 days, broken down by data type. Anomalies (sudden drops or spikes) are visually obvious on a time chart.

Analytics rule execution. A table showing rule execution results for the last 24 hours: rules that ran successfully, rules that failed, and rules that timed out.

Automation rule execution. A table showing automation actions for the last 24 hours: incident assignments, severity changes, playbook triggers, and any failed automation actions.

Cost tracker. Current month ingestion volume and estimated cost vs budget. Tracks whether ingestion is trending within expected parameters or heading toward an overage.

Alerting on workspace health issues

The most critical health events should generate alerts, not just dashboard entries. Create analytics rules that fire when:

A data connector has not reported success in 4 hours — indicates a connector failure that may go unnoticed without proactive alerting.

Daily ingestion for any billable data type drops below 50% of the 7-day average — indicates a data source failure or misconfiguration.

An analytics rule fails to execute — indicates a rule that is not detecting threats due to a KQL error, timeout, or workspace issue.

Assign these health alerts to the SOC lead or Sentinel administrator, not the general analyst queue. Health alerts require administrative action (fix connector, fix rule), not security investigation.

The shift-start health check: a 5-minute operational ritual

The most effective workspace health practice is a structured 5-minute check at the start of every SOC shift. This catches health issues that developed overnight or during the previous shift before they affect the incoming shift’s investigation capability.

Step 1: Ingestion verification (1 minute). Run the ingestion trend query. Verify that all expected data types received data in the last 4 hours. If any data type shows zero recent ingestion, escalate immediately — a blind spot exists.

Step 2: Connector status (1 minute). Check the Data connectors page or run the connector health query. Look for “Disconnected” or “Error” status on any connector. Priority: Microsoft first-party connectors (Defender XDR, Entra ID) are most critical because they feed the core investigation tables.

Step 3: Rule execution health (1 minute). Run the analytics rule health query. Any rules that failed execution in the last 12 hours need investigation — the failure may have caused missed detections. Check whether any rules were inadvertently disabled.

Step 4: Incident queue review (1 minute). Count unassigned incidents, especially high-severity ones. Any high-severity incident unassigned for more than 1 hour represents an SLA risk. Check for incidents assigned to analysts who are off-shift.

Step 5: Automation status (1 minute). Verify that automation rules and playbooks executed successfully overnight. A failed playbook means an automated response did not occur — the incident that triggered it needs manual review.

This 5-minute ritual catches 90% of operational issues before they affect investigations. Document it as a standard operating procedure and track completion. If the shift-start check is skipped, the team operates without confirmation that the platform is healthy — and any investigation conducted during that shift may be based on incomplete data.

Data latency monitoring

Data ingestion is not instant — there is a delay (latency) between when an event occurs and when it appears in the workspace and becomes queryable. Understanding and monitoring latency is important because: analytics rules that evaluate the last 5 minutes may miss events with 10 minutes of latency, and investigation queries that look at “the last hour” may not include events from the most recent 5-15 minutes.

Measuring ingestion latency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Measure ingestion latency per data type
union withsource=TableName *
| where TimeGenerated > ago(1h)
| extend IngestionDelay = ingestion_time() - TimeGenerated
| summarize
    AvgDelay = avg(IngestionDelay),
    P95Delay = percentile(IngestionDelay, 95),
    MaxDelay = max(IngestionDelay)
    by TableName
| where AvgDelay > 5m  // Only show tables with >5 min avg delay
| order by AvgDelay desc

Normal latency ranges: Microsoft first-party data (SigninLogs, SecurityAlert, DeviceProcessEvents): 2-5 minutes typical. Syslog/CEF from on-premises devices: 5-15 minutes depending on agent configuration. Custom log data via DCR: 5-20 minutes depending on collection interval.

Latency spikes indicate either a source-side issue (the device is not sending data promptly), a connector issue (the connector is processing a backlog), or a workspace issue (Log Analytics is under heavy load). Persistent latency above 30 minutes for any security data type warrants investigation — your analytics rules are evaluating stale data, and your “real-time” detection has a 30-minute blind spot.

The most dangerous workspace failure is silent

A failed data connector does not generate an incident. A broken analytics rule does not create an alert about itself. A misconfigured log tier does not warn you when data stops being queryable. Without proactive health monitoring, these failures persist indefinitely — creating blind spots that only become apparent when a threat goes undetected and the post-incident review asks "why didn't Sentinel alert on this?" The answer: because the data was not there, or the rule was not running. Build the health monitoring now, before you need it.

Try it yourself

Run the connector health query and the ingestion trend query against your workspace. If SentinelHealth is empty, verify that health monitoring is enabled (Sentinel → Settings → Health monitoring). Check the Usage table for ingestion patterns over the past 7 days. Identify any data types with unusual volume changes. This is the daily operational check that a Sentinel administrator should perform — build the habit now in the lab environment.

What you should observe

In a lab, SentinelHealth should show success entries for enabled connectors. Usage should show consistent low-volume ingestion (1-3 GB/day). In production, you will see patterns: weekday ingestion higher than weekend, specific data types with predictable daily volumes, and occasional spikes from security events. The exercise builds familiarity with the baseline so you can spot anomalies.

Knowledge check

Check your understanding

1. Your Syslog connector has been failing silently for 3 days. What is the security impact?

Three days of firewall and network device data is missing from the workspace. Any analytics rules that query Syslog data (CommonSecurityLog, Syslog tables) have not evaluated data for 3 days. Threats that generate signals only in network logs — port scans, C2 communication, data exfiltration via unusual protocols — have gone undetected for 3 days. The SOC has been operating with reduced visibility without knowing it. This is why proactive health monitoring with alerting is essential.

No impact — other connectors compensate

Minor impact — Syslog data is low priority

The data is queued and will flow when the connector recovers

Connector failures create blind spots. Data generated during the failure period is typically lost (depending on the source's buffer capability) — it is not retroactively ingested when the connector recovers. Three days of network data loss means three days of undetected network-layer threats.

2. An analytics rule has been running successfully for 30 days but has never generated an alert. How do you determine if this is expected or a problem?

Run the rule's KQL query manually in the Logs blade, expanding the time range to 30 days. If the query returns zero results, either: (a) the threat the rule detects has not occurred (expected for rare threats — the rule is working correctly but no threat matches), or (b) the query has a logic error that prevents it from ever matching (check the KQL for errors, verify the tables contain the expected data, verify column names match the current schema). If the query returns results, the rule's schedule or threshold may be misconfigured — the detection logic matches but the rule execution settings prevent alert generation.

Delete the rule — it is not working

No action needed — 30 days is normal

Change the rule to fire on every query result

Manual query validation is the troubleshooting method. Zero results + correct logic = rare threat (expected). Zero results + broken logic = silent detection failure. The distinction is critical and can only be made by running the query manually and reviewing the results.

← 7.10 Content Hub and Solutions 7.12 RBAC, Multi-Workspace, and Governance →