In this section

1.9 Monitoring Automation Health

5 hours · Module 1 · Free
What you already know
Section 1.8 covered testing playbooks before production deployment: the four-step validation hierarchy from desk check through production canary. Testing validates that a playbook works correctly at deployment time. Monitoring validates that it continues working correctly after deployment. This section teaches the monitoring model that detects silent playbook failures, tracks execution health over time, and produces the operational metrics that justify continued automation investment.

Scenario

Northgate Engineering's VirusTotal enrichment playbook stops returning results. The managed identity token expires and the Graph API call returns 401 on every execution. The error handling scope catches the failure and posts a generic "enrichment unavailable" comment on each incident. Analysts see the comment and assume the enrichment ran but found nothing suspicious. For three weeks, no one notices that every High-severity incident is missing threat intelligence enrichment. The playbook never crashed. It never stopped running. It failed gracefully, and the graceful failure made the problem invisible.

The two-table monitoring model

Sentinel provides two complementary tables for automation monitoring, and you need both. The SentinelHealth table tracks whether automation rules fired and whether playbooks were triggered. The AzureDiagnostics table tracks what happened inside each playbook after it was triggered. Neither table alone gives you the full picture.

SentinelHealth records two event types: "Automation rule run" (an automation rule's conditions were met and it executed its actions) and "Playbook was triggered" (a playbook was launched, either by an automation rule or manually). The critical limitation is that SentinelHealth only records whether the playbook was launched successfully. It does not record what happened inside the playbook, whether the playbook completed, or what the final result was. A SentinelHealth record showing "Success" for a playbook trigger means the Logic App accepted the trigger. It does not mean the enrichment query ran, the Graph API call succeeded, or the incident comment was posted.

AzureDiagnostics records the workflow-level and action-level execution data from the Logic App itself. It captures every action's status (Succeeded, Failed, Skipped), timing, inputs, outputs, and error messages. This is the table that tells you whether the playbook actually did its job. But AzureDiagnostics requires you to enable diagnostic settings on each Logic App individually. It does not appear automatically.

The SentinelHealth table is not billable. Ingesting health data incurs no charges. AzureDiagnostics follows standard Log Analytics ingestion pricing, but the volume for Logic App diagnostics is typically small: a few kilobytes per run, even for complex playbooks. The cost of monitoring is negligible compared to the cost of undetected failures.

Enabling diagnostic settings

Logic Apps do not send execution data to Log Analytics by default. For each playbook you want to monitor, navigate to the Logic App resource in the Azure portal, then Monitoring → Diagnostic settings → Add diagnostic setting. Name it descriptively: "SendToSentinelWorkspace." Under Categories, select "WorkflowRuntime" (this captures run-level and action-level events). Under Destination, select "Send to Log Analytics workspace" and choose your Sentinel workspace.

Sentinel → Logic App → Diagnostic settings

Open the Logic App resource → Monitoring → Diagnostic settings → Add diagnostic setting. Name: "SendToSentinelWorkspace." Categories: WorkflowRuntime. Destination: Send to Log Analytics workspace → select the Sentinel workspace. Save. Data appears in AzureDiagnostics within 5-10 minutes.

The most common monitoring blind spot: you deploy five playbooks and enable diagnostics on four of them. The fifth playbook fails silently for months because its execution data never reaches the AzureDiagnostics table. Add diagnostic settings to your deployment checklist for every new playbook. If diagnostics are not enabled, the playbook is invisible to every monitoring query in this section.

Querying SentinelHealth with the built-in function

Microsoft recommends querying the SentinelHealth table through the _SentinelHealth() function rather than querying the table directly. The function provides backward compatibility: if Microsoft changes the table schema in a future update, queries built on the function continue working while queries built directly on the table may break. Use _SentinelHealth() as the starting point for all automation health queries.

KQL
// Automation rule failures in the last 24 hours
_SentinelHealth()
| where SentinelResourceType == "Automation rule"
| where Status != "Success"
| where TimeGenerated > ago(24h)
| project
    TimeGenerated,
    RuleName = SentinelResourceName,
    Status,
    Description
| order by TimeGenerated desc

The Status field returns three values for automation rules: "Success" (all actions completed), "Partial success" (at least one action completed, but others failed), and "Failure" (no actions ran). Partial success is common when an automation rule triggers two playbooks and one of them fails. The rule itself did not fail; one of its playbook calls did. The Description field tells you which playbook failed and why.

Correlating triggers with execution outcomes

The most valuable monitoring query joins SentinelHealth with AzureDiagnostics to answer the question that neither table answers alone: which playbooks were triggered and what happened when they ran? The SentinelHealth table's ExtendedProperties field contains a TriggeredPlaybooks array. Each entry in that array includes a RunId. The AzureDiagnostics table's resource_runId_s field contains the same RunId. Join on that field and you connect the automation rule trigger to the Logic App execution outcome.

KQL
// Full automation picture: trigger → playbook outcome
_SentinelHealth()
| where SentinelResourceType == "Automation rule"
| where TimeGenerated > ago(7d)
| mv-expand TriggeredPlaybooks = ExtendedProperties.TriggeredPlaybooks
| extend runId = tostring(TriggeredPlaybooks.RunId)
| join kind=leftouter (
    AzureDiagnostics
    | where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
    | project
        resource_runId_s,
        PlaybookName = resource_workflowName_s,
        PlaybookStatus = status_s,
        ErrorCode = error_code_s,
        Duration = datetime_diff('second', endTime_t, startTime_t)
) on $left.runId == $right.resource_runId_s
| project
    TimeGenerated,
    AutomationRule = SentinelResourceName,
    TriggerStatus = Status,
    PlaybookName,
    PlaybookStatus,
    ErrorCode,
    Duration
| order by TimeGenerated desc

Use kind=leftouter on the join. If diagnostic settings are not enabled for a playbook, the AzureDiagnostics side returns null. The leftouter join preserves the SentinelHealth record so you can see that the playbook was triggered even when you cannot see what happened afterward. Null values in PlaybookStatus mean diagnostics are missing for that playbook, which is itself a monitoring finding.

Building the failure detection rule

The most critical monitoring component is an analytics rule that alerts when any playbook fails. This rule runs hourly and fires when it detects one or more failed playbook executions. The alert goes to the automation owner, not to the SOC queue. Playbook failures are operational issues, not security incidents. Routing them to the SOC queue creates noise that degrades analyst attention to actual threats.

The failure detection rule queries AzureDiagnostics rather than SentinelHealth because you need to know whether the playbook completed its work, not just whether it was triggered. A playbook that triggers successfully but fails internally shows "Success" in SentinelHealth and "Failed" in AzureDiagnostics. The SentinelHealth record is misleading without the AzureDiagnostics correlation.

Configure the analytics rule: frequency every 1 hour, lookback 1 hour, trigger when result count is greater than 0, severity Medium. Assign the resulting incident to the automation owner via an automation rule. Do not trigger a playbook from this alert. If the monitoring alert triggers a playbook and that playbook also fails, you create a feedback loop of failing playbooks generating failure alerts that trigger more failing playbooks. Use a direct automation rule action instead: email notification or incident assignment.

Anti-pattern

Triggering a playbook from a playbook failure alert. The monitoring analytics rule detects that SA-Enrich-AccountContext has failed. The rule triggers SA-Notify-AutomationFailure to send a Teams message. SA-Notify-AutomationFailure also fails because the Teams connector token has expired. The failure generates another monitoring alert, which triggers SA-Notify-AutomationFailure again. The loop continues until the Logic App hits its concurrent run limit. Use a direct automation rule email action for failure notifications. The email action does not depend on a Logic App, so it cannot participate in the failure loop.

The three-layer monitoring model

The monitoring model operates at three time horizons, and each layer catches a different class of problem.

Layer 1 — Real-time failure alerting. The hourly analytics rule described above. Catches outright failures: expired tokens, permission revocations, API endpoint changes, throttling. The alert fires within one hour of any failure. The automation owner investigates the run history (Section 1.8), identifies the root cause, and restores the playbook. Target: zero unresolved playbook failures lasting more than four hours.

Layer 2 — Weekly health review. A set of saved KQL queries or a Sentinel workbook that tracks success rates, execution times, and run volumes per playbook over the past seven days. The weekly review catches degradation that does not trigger the failure alert: a playbook whose success rate drops from 99% to 92% (some runs fail intermittently but not consistently enough to generate a sustained alert), execution times that trend upward (indicating API degradation or growing query volumes), or sudden drops in run volume (indicating the automation rule trigger may have stopped matching incidents). Review these metrics in the weekly SOC operations meeting.

Layer 3 — Monthly value reporting. The operational metrics that justify automation investment. Incidents enriched per month (compare to total incident volume for enrichment coverage). Containment actions executed (break down by type). Estimated analyst hours saved (multiply execution count by estimated time saving per execution: enrichment saves roughly 5 minutes, notification routing saves roughly 3 minutes, containment saves roughly 15 minutes per incident). These numbers go into the monthly SOC report. Automation that cannot demonstrate value loses its maintenance priority.

Posture Assessment

AUTOMATION MONITORING READINESS — Northgate Engineering

Assessed by: Marcus Webb, Security Architect

Date: 2026-05-19

✓ PASS SentinelHealth feature enabled on workspace

✓ PASS Diagnostic settings configured on 4/4 production playbooks

✗ FAIL Diagnostic settings missing on SA-Contain-AccountDisable (deployed last week, diagnostics not added to deployment checklist)

✓ PASS Failure detection analytics rule deployed, frequency 1h

✗ FAIL Failure alert triggers SA-Notify-Teams playbook instead of direct email action (feedback loop risk)

⚠ PARTIAL Weekly health queries saved but not added to workbook (manual execution only)

✓ PASS Monthly reporting template includes automation metrics section

Recommendation: Enable diagnostics on SA-Contain-AccountDisable immediately. Replace Teams playbook notification with direct automation rule email action. Consolidate weekly queries into a Sentinel workbook for team access.

Interpreting partial failures and throttling

Not every playbook failure requires immediate intervention. The monitoring model needs decision criteria for when to act and when to wait.

A single failure followed by successful runs indicates a transient issue: a momentary API timeout, a brief network interruption, or a throttling response that the retry policy resolved on the next execution. The hourly analytics rule fires, but the investigation shows the problem self-corrected. Document the transient failure and move on. If the same transient failure recurs more than three times in a week, treat it as a persistent issue even though each individual occurrence self-corrects.

A sustained failure rate above 5% over 24 hours indicates a persistent issue. Check the error codes in AzureDiagnostics. HTTP 401 means an expired token or revoked permission: re-authenticate the managed identity or API connection. HTTP 429 means throttling: reduce trigger frequency, implement caching, or request an API tier increase. HTTP 5xx means the external service is degraded. Check the service's status page and wait for recovery. For 429 errors, also check whether the retry policy from Section 1.7 is configured correctly. A playbook without retry logic fails on the first throttle response. A playbook with exponential backoff retries transparently and often succeeds on the second attempt.

A sudden drop in execution volume with no corresponding drop in incident volume means the automation rule trigger stopped matching incidents. The playbook is healthy but idle. Check whether the analytics rule that generates the incidents changed its severity, title pattern, or entity mapping. A rule update that changes the incident title can break an automation rule that filters on title pattern. This failure mode is invisible to the AzureDiagnostics failure detection rule because the playbook never runs and therefore never fails. You catch it only through the weekly volume trend review.

THREE-LAYER MONITORING MODEL Layer 1 — Real-Time Hourly analytics rule AzureDiagnostics: status_s == Failed Catches: outright failures Target: resolve within 4 hours Layer 2 — Weekly Workbook or saved queries Success rate, timing, volume trends Catches: gradual degradation Review in SOC team meeting Layer 3 — Monthly Operational metrics report Enrichment coverage, hours saved Catches: declining value Report to CISO MONITORING COVERAGE BY TIME HORIZON Failures detected within 1 hour Trends visible within 7 days ROI validated every 30 days Source: AzureDiagnostics Source: AzureDiagnostics + SentinelHealth Source: incident metadata + time estimates

Figure 1.9a: Three-layer monitoring model. Each layer operates at a different time horizon and catches a different class of automation health issue. Real-time alerting uses AzureDiagnostics. Weekly health review correlates both tables. Monthly reporting derives from incident metadata.

Monitoring that only checks run status

Most teams create a single alert for playbook failures and consider monitoring complete. The alert fires when a run fails — but it misses the playbook that succeeds with empty results because the Graph API permission expired, the playbook that runs in 45 seconds instead of 3 because a downstream API is degrading, and the playbook that processes 1 of 12 entities because the For Each loop silently drops errors. Run status tells you the playbook executed. It does not tell you the playbook produced useful output. The three-layer model catches what run status alone cannot.

Automation Principle

The worst automation failure is the one nobody notices. A playbook that crashes loudly gets fixed in hours. A playbook that fails gracefully (posting generic comments, swallowing errors, returning empty results) can degrade silently for weeks. The monitoring model is designed to catch both types: the analytics rule catches crashes, the weekly health review catches degradation, and the monthly value report catches playbooks that run successfully but no longer produce useful output. If you build a playbook but skip the monitoring, you are building a liability that will eventually erode the SOC's trust in automation.

Next
Section 1.10 covers cost management for Sentinel automation: how Logic App pricing works across Consumption and Standard plans, the API call costs that accumulate when playbooks query external services at scale, and the cost controls that prevent a misconfigured automation rule from generating a surprise Azure bill.
Unlock the Full Course See Full Course Agenda