SA0.9 The Automation Governance Framework
Figure SA0.9 — The four governance pillars and the failure mode they prevent. Automation without governance degrades silently.
Automation is code — treat it like code
Logic Apps are defined as JSON ARM templates. Automation rules have exportable configurations. KQL queries in analytics rules are text. All of these are code artifacts that should follow the same lifecycle as any production code: version controlled, tested before deployment, monitored in production, and documented for the people who maintain them.
The teams that treat automation as “configuration” instead of “code” hit the governance failure mode within 6 months. The playbook that someone built, deployed, and forgot — it runs until it breaks, and when it breaks, nobody knows how to fix it because nobody documented how it works, nobody monitored it to detect the failure, and nobody version-controlled it so the last working version is recoverable.
Pillar 1: Version control
Every playbook should exist as an ARM template or Bicep file in a Git repository. The repository structure follows the same pattern as infrastructure-as-code:
The repository contains one directory per playbook. Each directory contains the ARM template (the deployable definition), a README (what the playbook does, what it triggers on, what permissions it requires), and a change log (what changed and when).
When an analyst modifies a playbook, they export the updated ARM template, commit it to the repository, and create a pull request. A second analyst reviews the change — checking for logic errors, permission scope creep, and blast radius implications. After approval, the change is merged and deployed.
This may seem like overhead for a Logic App. It is not. The pull request review catches errors that the builder misses. The version history allows rollback when an update introduces a regression. The repository serves as the documentation of what exists and what has changed.
For teams that find full Git workflow excessive, the minimum standard is: export the ARM template after every modification and store it in a shared location (SharePoint, Teams files, or a simple Git repo). Even without PR reviews, having a recoverable copy of each playbook version prevents the “rebuild from scratch” scenario when a change breaks something.
Pillar 2: Testing
Never deploy automation directly to production. The workflow is: build in the Logic App designer → test against sample data → deploy to a staging workspace with test incidents → validate behavior → promote to production.
The staging workspace is a second Sentinel workspace with sample data and test incidents. It does not need to be expensive — a workspace with minimal data ingestion and a few manually created test incidents is sufficient. SA11 covers staging workspace setup in detail.
For Tier 3 automation (containment), testing is critical. Create dedicated test accounts and test devices that can be safely disabled, isolated, and restored. Never test containment against production users or systems. The test account receives the containment action, the playbook verifies the action worked, and the rollback playbook restores the account — all in the staging environment.
Pillar 3: Monitoring
Every playbook must be monitored for success and failure. Logic Apps provide run history — every execution is logged with status (Succeeded, Failed, Cancelled), duration, and error details. But nobody checks run history manually. Monitoring must be automated.
The minimum monitoring setup is a KQL query in Sentinel that checks Logic App diagnostic logs for failed runs:
| |
This query becomes a Sentinel analytics rule that fires when any playbook fails. The alert notifies the SOC team. The team investigates the failure before it becomes a silent gap in coverage.
Beyond failure detection, track execution metrics: mean execution time (is the playbook getting slower?), success rate over 30 days (is reliability declining?), and action count (how many incidents are processed?). SA9 builds the full automation health monitoring dashboard.
Pillar 4: Documentation
Every playbook has a runbook. The runbook answers six questions:
- What does this playbook do? One-paragraph description of the complete workflow.
- What triggers it? The specific automation rule or analytics rule that fires the playbook.
- What actions does it take? Every action in the workflow, including external API calls and containment actions.
- What can go wrong? Known failure modes: API rate limits, timeout on large data queries, entity extraction failures, permission expiration.
- How do I override or disable it? How to temporarily disable the playbook or the automation rule without deleting either. How to manually approve or reject a pending approval gate.
- Who owns it? The person responsible for fixing it when it breaks. If the owner leaves the organisation, the runbook enables their replacement to understand the playbook without reverse-engineering the Logic App JSON.
The runbook is not optional. It is the difference between “the automation broke and we fixed it in 15 minutes” and “the automation broke and we spent 3 hours figuring out what it does before we could fix it.”
The myth: Every change to an automation rule or playbook must go through the standard change management process — change request, impact assessment, CAB review, scheduled implementation window. This process takes 1-2 weeks per change.
The reality: Automation changes should follow a streamlined change process, not the full enterprise CAB cycle. Tier 1 changes (enrichment playbook modifications) require peer review and testing — deploy the same day. Tier 2 changes (notification routing, collection scope) require SOC lead approval — deploy within 1-2 days. Tier 3 changes (containment logic, confidence threshold adjustments) require IR lead review and staging workspace testing — deploy within 1 week. Document the change in the Git commit message and the runbook. The CAB reviews a monthly summary of automation changes, not each individual change.
Automation Governance Checklist — Deploy Gate
Before any playbook goes live in production, verify:
- ARM template exported and committed to Git repository
- README created: purpose, trigger, permissions, actions
- Tested in staging workspace with test incidents
- Tier 3 playbooks: tested with test accounts/devices (not production)
- Monitoring configured: failed-run analytics rule active
- Runbook written: 6 questions answered
- Owner assigned: name and contact in runbook
- Rollback documented: how to undo every action the playbook takes
- SOC team briefed: analysts know the playbook exists and what it does
- Permissions reviewed: managed identity has minimum required scope
Decision point: An analyst builds a Tier 1 enrichment playbook and wants to deploy it immediately — it is read-only, zero blast radius, and the team needs it. The governance process says “test in staging first.” The analyst argues that testing an enrichment playbook in staging is unnecessary because it cannot cause harm. The correct answer: test it anyway. Not because it might cause harm, but because testing validates that it works. An enrichment playbook that fails silently (timeout on the TI API, entity extraction error on certain alert types, incorrect JSON parsing) does not cause damage but it does not provide value either. Testing catches these failures before the team starts relying on enrichment data that is not being generated. The testing is quick — create one test incident in staging, run the playbook, verify the enrichment comment appears. Ten minutes. Then deploy with confidence.
Try it: Audit your automation governance
For every active playbook in your Sentinel workspace:
- Is the ARM template saved somewhere recoverable? (Git repo, SharePoint, anywhere?)
- When was it last tested? (Not “when was it last run” — when was it deliberately tested?)
- Is there monitoring? Would you know within 24 hours if it stopped working?
- Is there a runbook? Could a new team member understand what it does without asking the builder?
- Who owns it? Is the owner still on the team?
If any answer is “no” — that is your governance gap. Fix it before building new playbooks.
Where this goes deeper. SA11 is dedicated to automation testing and governance — staging workspace setup, testing frameworks (unit, integration, scenario, load), version control structure, change management processes, and continuous improvement methodology. SA12 covers the organisational side — building the automation team, metrics dashboards, and the monthly review process.
You're reading the free modules of this course
The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.