SA1.9 Monitoring Automation Health

5 hours · Module 1 · Free

Figure SA1.9 — Three monitoring layers: failure detection (immediate), health metrics (daily), operational metrics (monthly). KQL queries power all three.

Operational Objective

An unmonitored playbook is worse than no playbook. With no playbook, the analyst knows they need to enrich manually. With a broken playbook that nobody monitors, the analyst assumes enrichment is happening when it is not. NE's dead VirusTotal playbook failed silently for 6 months. This sub builds the monitoring that detects playbook failures within one hour, tracks health metrics for daily review, and reports operational metrics for the monthly automation review.

Deliverable: Three monitoring components: a failure detection analytics rule, a health metrics dashboard, and the KQL queries that power both. Deployed as part of every playbook's governance checklist.

⏱ Estimated completion: 30 minutes

Prerequisite: enable diagnostic logging

Logic Apps do not send diagnostic data to Log Analytics by default. You must enable it for each Logic App.

Navigate to the Logic App → Monitoring → Diagnostic settings → Add diagnostic setting. Name: “SendToLogAnalytics.” Check “WorkflowRuntime” (captures run-level success/failure) and “allMetrics.” Destination: Send to Log Analytics workspace → select your Sentinel workspace. Save.

After enabling, diagnostic data appears in the AzureDiagnostics table within 5-10 minutes. The key fields are: resource_workflowName_s (playbook name), status_s (Succeeded/Failed/Cancelled), startTime_t and endTime_t (timing), error_code_s and error_message_s (failure details).

Enable this on every Logic App as part of the deployment checklist. If you have 10 playbooks and forget to enable diagnostics on 2 of them, those 2 are invisible to monitoring — exactly the blind spot that causes silent failures.

Layer 1: Failure detection (alert within 1 hour)

Create a Sentinel analytics rule that fires when any playbook fails:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.LOGIC"
| where Category == "WorkflowRuntime"
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where status_s == "Failed"
| where TimeGenerated > ago(1h)
| project
    TimeGenerated,
    PlaybookName = resource_workflowName_s,
    ErrorCode = error_code_s,
    ErrorMessage = substring(error_message_s, 0, 500),
    Duration = datetime_diff('second', endTime_t, startTime_t)
| summarize
    FailureCount = count(),
    LastFailure = max(TimeGenerated),
    Errors = make_set(ErrorMessage, 3)
    by PlaybookName

Configure the analytics rule: frequency = every 1 hour, lookback = 1 hour, trigger when results > 0, severity = Medium.

Critical: do NOT trigger a playbook from this alert. If your monitoring detects a playbook failure and triggers another playbook to notify the team, and THAT playbook also fails, you have an infinite loop of failing playbooks detecting failing playbooks. Instead, use a simple automation rule that sends an email notification directly (the Sentinel automation rule can send email without a Logic App) or changes the incident severity and assigns it to the automation owner.

Layer 2: Health metrics (daily dashboard)

Build a Sentinel workbook or saved queries for daily review. The key metrics:

Success rate per playbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.LOGIC"
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where TimeGenerated > ago(7d)
| summarize
    Total = count(),
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed")
    by PlaybookName = resource_workflowName_s
| extend SuccessRate = round(todouble(Succeeded) / todouble(Total) * 100, 1)
| project PlaybookName, Total, Succeeded, Failed, SuccessRate
| order by SuccessRate asc

Target: 99%+ success rate for all playbooks. Below 95% indicates a persistent issue requiring investigation.

Mean execution time per playbook (trend):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.LOGIC"
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where status_s == "Succeeded"
| where TimeGenerated > ago(30d)
| extend DurationSeconds = datetime_diff('second', endTime_t, startTime_t)
| summarize AvgDuration = avg(DurationSeconds) by
    PlaybookName = resource_workflowName_s,
    Day = bin(TimeGenerated, 1d)
| render timechart

This chart shows whether playbooks are getting slower over time. A gradual increase in execution time may indicate: growing data volumes (KQL queries take longer), API degradation (external TI service is slower), or permission token issues (authentication retries increasing latency).

Daily execution volume:

1
2
3
4
5
6
7
8
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.LOGIC"
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where TimeGenerated > ago(30d)
| summarize DailyRuns = count() by
    PlaybookName = resource_workflowName_s,
    Day = bin(TimeGenerated, 1d)
| render timechart

A sudden drop in execution volume means either incident volume decreased (check Sentinel) or the automation rule trigger is broken (the rule stopped firing the playbook). A sudden spike means alert volume increased or a misconfigured rule is triggering the playbook on incidents it should not match.

Layer 3: Operational metrics (monthly review)

Monthly, compile the metrics that prove automation value and identify improvement opportunities:

Incidents auto-enriched: count of incidents with the “enriched” tag. Compare to total incident count for enrichment coverage percentage.

Containment actions executed: count of incidents with the “auto-contained” tag. Break down by containment type (session revoke, endpoint isolate, account disable). Track false positive rate (incidents where auto-containment was followed by a rollback).

Estimated analyst hours saved: for each playbook, multiply execution count by estimated time saving per execution. Enrichment: 5 minutes saved × execution count. FP auto-close: 3 minutes saved × execution count. Containment: 15 minutes saved × execution count.

Automation health summary: average success rate, number of failure incidents, mean downtime (time between failure and fix), playbooks requiring tuning or retirement.

These metrics form the monthly automation review document in SA11 and the business case for continued automation investment in SA12.

⚠ Compliance Myth: "We don't need to monitor enrichment playbooks — they can't cause harm if they fail"

The myth: Enrichment playbooks are Tier 1 — zero blast radius. If they fail, the worst case is missing enrichment data. Monitoring is only needed for Tier 3 containment playbooks.

The reality: A silently failing enrichment playbook creates a false sense of automation. Analysts believe every incident is enriched. When enrichment stops, they do not notice — they assume the lack of enrichment data means there is nothing interesting to enrich. The analyst opens an AiTM incident with no enrichment comment and thinks “no unusual sign-in activity found” when in reality the enrichment playbook crashed and no sign-in data was checked. Silent enrichment failure can delay triage decisions by making incidents appear less severe than they are. Monitor ALL playbooks, regardless of tier.

Decision point: Your monitoring analytics rule fires — the enrichment playbook has failed 15 times in the last hour (it normally runs 20-25 times per hour). The error message is “Request was throttled.” The failure rate is 60%. Should you disable the playbook? No. A 60% failure rate means 40% of incidents are still being enriched — which is better than 0% if you disable the playbook entirely. Instead: check which action is being throttled (probably an external API), reduce the playbook’s trigger frequency if possible, implement caching for repeated lookups, or upgrade the API tier. Disable only if the failure is causing actual harm (cascading errors, incorrect enrichment data), not if it is causing degraded service. Partial automation is better than no automation.

Try it: Deploy automation health monitoring

Enable diagnostic settings on your enrichment playbook (SA1.4)
Wait 10 minutes for data to appear in AzureDiagnostics
Run the failure detection query in the Sentinel logs blade — confirm it returns results (or no results if the playbook has not failed)
Create a Sentinel analytics rule from the failure detection query (frequency: 1 hour)
Configure the rule’s automated response: automation rule that assigns the failure incident to you (the automation owner)
Test: temporarily break the playbook (remove a required permission). Trigger the playbook. Confirm the monitoring rule fires within 1 hour. Restore the permission.
Run the success rate and execution time queries — save them as favorites for daily review

Your automation health dashboard shows that the enrichment playbook's mean execution time increased from 8 seconds to 45 seconds over the past 2 weeks. The success rate is still 99%. What should you investigate?

Nothing — the success rate is 99%, so the playbook is working fine. The success rate measures whether the playbook completes, not how fast it completes. A 45-second enrichment time means analysts wait 45 seconds for enrichment data after opening an incident. If enrichment was 8 seconds two weeks ago, something changed that is degrading performance.

Check which action within the playbook is taking longer. Open recent run history and compare action-level durations against runs from 2 weeks ago. Common causes: KQL query scope increased (more data to search), external API response time degraded (VirusTotal slower), token acquisition taking longer (managed identity issue), or network latency increased. Identify the slow action and optimize: reduce KQL query scope, add caching, or switch to a faster API tier.

Increase the playbook timeout to accommodate the slower execution. Increasing the timeout masks the degradation without addressing the cause. The playbook will continue to slow down until it starts timing out at the new limit. Fix the root cause.

Disable the playbook and rebuild it. The playbook is working (99% success). It is slow, not broken. Rebuilding discards 6+ months of validated production operation. Optimize the existing playbook, do not replace it.

Where this goes deeper. SA9 builds the complete KQL-driven automation health monitoring system — health dashboard workbook, trend analysis queries, anomaly detection for automation performance, and the monthly review template. SA12 uses the monitoring data to build the business case: analyst hours saved, MTTA improvement, containment speed.

Operational Artifact — Automation Health Monitoring KQL Pack

Four KQL queries deployed as saved queries or analytics rules: failure detection (1-hour alerting), success rate per playbook (daily review), execution time trend (degradation detection), and volume trend (trigger health). Enable diagnostic settings on every Logic App. Include monitoring setup in the deployment checklist for every new playbook. The monitoring KQL pack is the foundation for the health dashboard in SA9 and the monthly review in SA11.

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← SA1.8 Testing Automation Safely SA1.10 Cost Management →