SA1.9 Monitoring Automation Health
Figure SA1.9 — Three monitoring layers: failure detection (immediate), health metrics (daily), operational metrics (monthly). KQL queries power all three.
Prerequisite: enable diagnostic logging
Logic Apps do not send diagnostic data to Log Analytics by default. You must enable it for each Logic App.
Navigate to the Logic App → Monitoring → Diagnostic settings → Add diagnostic setting. Name: “SendToLogAnalytics.” Check “WorkflowRuntime” (captures run-level success/failure) and “allMetrics.” Destination: Send to Log Analytics workspace → select your Sentinel workspace. Save.
After enabling, diagnostic data appears in the AzureDiagnostics table within 5-10 minutes. The key fields are: resource_workflowName_s (playbook name), status_s (Succeeded/Failed/Cancelled), startTime_t and endTime_t (timing), error_code_s and error_message_s (failure details).
Enable this on every Logic App as part of the deployment checklist. If you have 10 playbooks and forget to enable diagnostics on 2 of them, those 2 are invisible to monitoring — exactly the blind spot that causes silent failures.
Layer 1: Failure detection (alert within 1 hour)
Create a Sentinel analytics rule that fires when any playbook fails:
| |
Configure the analytics rule: frequency = every 1 hour, lookback = 1 hour, trigger when results > 0, severity = Medium.
Critical: do NOT trigger a playbook from this alert. If your monitoring detects a playbook failure and triggers another playbook to notify the team, and THAT playbook also fails, you have an infinite loop of failing playbooks detecting failing playbooks. Instead, use a simple automation rule that sends an email notification directly (the Sentinel automation rule can send email without a Logic App) or changes the incident severity and assigns it to the automation owner.
Layer 2: Health metrics (daily dashboard)
Build a Sentinel workbook or saved queries for daily review. The key metrics:
Success rate per playbook:
| |
Target: 99%+ success rate for all playbooks. Below 95% indicates a persistent issue requiring investigation.
Mean execution time per playbook (trend):
| |
This chart shows whether playbooks are getting slower over time. A gradual increase in execution time may indicate: growing data volumes (KQL queries take longer), API degradation (external TI service is slower), or permission token issues (authentication retries increasing latency).
Daily execution volume:
| |
A sudden drop in execution volume means either incident volume decreased (check Sentinel) or the automation rule trigger is broken (the rule stopped firing the playbook). A sudden spike means alert volume increased or a misconfigured rule is triggering the playbook on incidents it should not match.
Layer 3: Operational metrics (monthly review)
Monthly, compile the metrics that prove automation value and identify improvement opportunities:
Incidents auto-enriched: count of incidents with the “enriched” tag. Compare to total incident count for enrichment coverage percentage.
Containment actions executed: count of incidents with the “auto-contained” tag. Break down by containment type (session revoke, endpoint isolate, account disable). Track false positive rate (incidents where auto-containment was followed by a rollback).
Estimated analyst hours saved: for each playbook, multiply execution count by estimated time saving per execution. Enrichment: 5 minutes saved × execution count. FP auto-close: 3 minutes saved × execution count. Containment: 15 minutes saved × execution count.
Automation health summary: average success rate, number of failure incidents, mean downtime (time between failure and fix), playbooks requiring tuning or retirement.
These metrics form the monthly automation review document in SA11 and the business case for continued automation investment in SA12.
The myth: Enrichment playbooks are Tier 1 — zero blast radius. If they fail, the worst case is missing enrichment data. Monitoring is only needed for Tier 3 containment playbooks.
The reality: A silently failing enrichment playbook creates a false sense of automation. Analysts believe every incident is enriched. When enrichment stops, they do not notice — they assume the lack of enrichment data means there is nothing interesting to enrich. The analyst opens an AiTM incident with no enrichment comment and thinks “no unusual sign-in activity found” when in reality the enrichment playbook crashed and no sign-in data was checked. Silent enrichment failure can delay triage decisions by making incidents appear less severe than they are. Monitor ALL playbooks, regardless of tier.
Decision point: Your monitoring analytics rule fires — the enrichment playbook has failed 15 times in the last hour (it normally runs 20-25 times per hour). The error message is “Request was throttled.” The failure rate is 60%. Should you disable the playbook? No. A 60% failure rate means 40% of incidents are still being enriched — which is better than 0% if you disable the playbook entirely. Instead: check which action is being throttled (probably an external API), reduce the playbook’s trigger frequency if possible, implement caching for repeated lookups, or upgrade the API tier. Disable only if the failure is causing actual harm (cascading errors, incorrect enrichment data), not if it is causing degraded service. Partial automation is better than no automation.
Try it: Deploy automation health monitoring
- Enable diagnostic settings on your enrichment playbook (SA1.4)
- Wait 10 minutes for data to appear in AzureDiagnostics
- Run the failure detection query in the Sentinel logs blade — confirm it returns results (or no results if the playbook has not failed)
- Create a Sentinel analytics rule from the failure detection query (frequency: 1 hour)
- Configure the rule’s automated response: automation rule that assigns the failure incident to you (the automation owner)
- Test: temporarily break the playbook (remove a required permission). Trigger the playbook. Confirm the monitoring rule fires within 1 hour. Restore the permission.
- Run the success rate and execution time queries — save them as favorites for daily review
Where this goes deeper. SA9 builds the complete KQL-driven automation health monitoring system — health dashboard workbook, trend analysis queries, anomaly detection for automation performance, and the monthly review template. SA12 uses the monitoring data to build the business case: analyst hours saved, MTTA improvement, containment speed.
You're reading the free modules of this course
The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.