SA0.9 The Automation Governance Framework

5 hours · Module 0 · Free

Figure SA0.9 — The four governance pillars and the failure mode they prevent. Automation without governance degrades silently.

Operational Objective

Automation is code. Code without version control, testing, monitoring, and documentation degrades silently until it fails during the incident where you need it most. NE's dead VirusTotal playbook is the predictable outcome of automation without governance. This sub establishes the four governance pillars — version control, testing, monitoring, and documentation — that keep automation operational over months and years, not just the first week after deployment.

Deliverable: The Automation Governance Framework — four pillars with specific, implementable practices for each. Applied to every playbook you build in this course.

⏱ Estimated completion: 25 minutes

Automation is code — treat it like code

Logic Apps are defined as JSON ARM templates. Automation rules have exportable configurations. KQL queries in analytics rules are text. All of these are code artifacts that should follow the same lifecycle as any production code: version controlled, tested before deployment, monitored in production, and documented for the people who maintain them.

The teams that treat automation as “configuration” instead of “code” hit the governance failure mode within 6 months. The playbook that someone built, deployed, and forgot — it runs until it breaks, and when it breaks, nobody knows how to fix it because nobody documented how it works, nobody monitored it to detect the failure, and nobody version-controlled it so the last working version is recoverable.

Pillar 1: Version control

Every playbook should exist as an ARM template or Bicep file in a Git repository. The repository structure follows the same pattern as infrastructure-as-code:

The repository contains one directory per playbook. Each directory contains the ARM template (the deployable definition), a README (what the playbook does, what it triggers on, what permissions it requires), and a change log (what changed and when).

When an analyst modifies a playbook, they export the updated ARM template, commit it to the repository, and create a pull request. A second analyst reviews the change — checking for logic errors, permission scope creep, and blast radius implications. After approval, the change is merged and deployed.

This may seem like overhead for a Logic App. It is not. The pull request review catches errors that the builder misses. The version history allows rollback when an update introduces a regression. The repository serves as the documentation of what exists and what has changed.

For teams that find full Git workflow excessive, the minimum standard is: export the ARM template after every modification and store it in a shared location (SharePoint, Teams files, or a simple Git repo). Even without PR reviews, having a recoverable copy of each playbook version prevents the “rebuild from scratch” scenario when a change breaks something.

Pillar 2: Testing

Never deploy automation directly to production. The workflow is: build in the Logic App designer → test against sample data → deploy to a staging workspace with test incidents → validate behavior → promote to production.

The staging workspace is a second Sentinel workspace with sample data and test incidents. It does not need to be expensive — a workspace with minimal data ingestion and a few manually created test incidents is sufficient. SA11 covers staging workspace setup in detail.

For Tier 3 automation (containment), testing is critical. Create dedicated test accounts and test devices that can be safely disabled, isolated, and restored. Never test containment against production users or systems. The test account receives the containment action, the playbook verifies the action worked, and the rollback playbook restores the account — all in the staging environment.

Pillar 3: Monitoring

Every playbook must be monitored for success and failure. Logic Apps provide run history — every execution is logged with status (Succeeded, Failed, Cancelled), duration, and error details. But nobody checks run history manually. Monitoring must be automated.

The minimum monitoring setup is a KQL query in Sentinel that checks Logic App diagnostic logs for failed runs:

1
2
3
4
5
6
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.LOGIC"
| where status_s == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, resource_workflowName_s, error_message_s
| order by TimeGenerated desc

This query becomes a Sentinel analytics rule that fires when any playbook fails. The alert notifies the SOC team. The team investigates the failure before it becomes a silent gap in coverage.

Beyond failure detection, track execution metrics: mean execution time (is the playbook getting slower?), success rate over 30 days (is reliability declining?), and action count (how many incidents are processed?). SA9 builds the full automation health monitoring dashboard.

Pillar 4: Documentation

Every playbook has a runbook. The runbook answers six questions:

What does this playbook do? One-paragraph description of the complete workflow.
What triggers it? The specific automation rule or analytics rule that fires the playbook.
What actions does it take? Every action in the workflow, including external API calls and containment actions.
What can go wrong? Known failure modes: API rate limits, timeout on large data queries, entity extraction failures, permission expiration.
How do I override or disable it? How to temporarily disable the playbook or the automation rule without deleting either. How to manually approve or reject a pending approval gate.
Who owns it? The person responsible for fixing it when it breaks. If the owner leaves the organisation, the runbook enables their replacement to understand the playbook without reverse-engineering the Logic App JSON.

The runbook is not optional. It is the difference between “the automation broke and we fixed it in 15 minutes” and “the automation broke and we spent 3 hours figuring out what it does before we could fix it.”

⚠ Compliance Myth: "We need formal change management approval for every automation rule change, which takes 2 weeks per change"

The myth: Every change to an automation rule or playbook must go through the standard change management process — change request, impact assessment, CAB review, scheduled implementation window. This process takes 1-2 weeks per change.

The reality: Automation changes should follow a streamlined change process, not the full enterprise CAB cycle. Tier 1 changes (enrichment playbook modifications) require peer review and testing — deploy the same day. Tier 2 changes (notification routing, collection scope) require SOC lead approval — deploy within 1-2 days. Tier 3 changes (containment logic, confidence threshold adjustments) require IR lead review and staging workspace testing — deploy within 1 week. Document the change in the Git commit message and the runbook. The CAB reviews a monthly summary of automation changes, not each individual change.

Automation Governance Checklist — Deploy Gate
Before any playbook goes live in production, verify:
ARM template exported and committed to Git repository
README created: purpose, trigger, permissions, actions
Tested in staging workspace with test incidents
Tier 3 playbooks: tested with test accounts/devices (not production)
Monitoring configured: failed-run analytics rule active
Runbook written: 6 questions answered
Owner assigned: name and contact in runbook
Rollback documented: how to undo every action the playbook takes
SOC team briefed: analysts know the playbook exists and what it does
Permissions reviewed: managed identity has minimum required scope

Decision point: An analyst builds a Tier 1 enrichment playbook and wants to deploy it immediately — it is read-only, zero blast radius, and the team needs it. The governance process says “test in staging first.” The analyst argues that testing an enrichment playbook in staging is unnecessary because it cannot cause harm. The correct answer: test it anyway. Not because it might cause harm, but because testing validates that it works. An enrichment playbook that fails silently (timeout on the TI API, entity extraction error on certain alert types, incorrect JSON parsing) does not cause damage but it does not provide value either. Testing catches these failures before the team starts relying on enrichment data that is not being generated. The testing is quick — create one test incident in staging, run the playbook, verify the enrichment comment appears. Ten minutes. Then deploy with confidence.

Try it: Audit your automation governance

For every active playbook in your Sentinel workspace:

Is the ARM template saved somewhere recoverable? (Git repo, SharePoint, anywhere?)
When was it last tested? (Not “when was it last run” — when was it deliberately tested?)
Is there monitoring? Would you know within 24 hours if it stopped working?
Is there a runbook? Could a new team member understand what it does without asking the builder?
Who owns it? Is the owner still on the team?

If any answer is “no” — that is your governance gap. Fix it before building new playbooks.

A playbook that auto-revokes sessions for confirmed AiTM incidents has been running successfully for 4 months. The analyst who built it left the company 2 weeks ago. An update to the Microsoft Graph API changes the session revocation endpoint. The playbook starts failing. What governance failure caused this to become a problem?

Version control failure — the ARM template was not in Git. Version control helps with rollback, but the playbook is failing because the API changed, not because of a code regression. Rolling back to the previous version would not fix the API endpoint change.

Testing failure — the playbook was not tested after the API change. The API change is external — Microsoft changed it without coordinating with NE. Testing would not have prevented this. However, monitoring should have detected the failure immediately.

Documentation and ownership failure — the analyst left without a runbook or an assigned successor. The playbook fails, nobody knows how it works (no runbook), nobody is responsible for fixing it (no owner), and the team may not even notice (monitoring may or may not be in place). The runbook would describe the API calls and the ownership handoff would ensure someone is responsible for maintaining the playbook.

Monitoring failure — the team did not detect the playbook failure. Monitoring is part of the problem, but even with monitoring, the team cannot fix what they do not understand. The root cause is the missing runbook and missing owner. Monitoring tells you WHEN it broke. Documentation tells you HOW to fix it. Ownership tells you WHO fixes it.

Where this goes deeper. SA11 is dedicated to automation testing and governance — staging workspace setup, testing frameworks (unit, integration, scenario, load), version control structure, change management processes, and continuous improvement methodology. SA12 covers the organisational side — building the automation team, metrics dashboards, and the monthly review process.

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← SA0.8 Defender XDR Automation Architecture SA0.10 The Automation Maturity Model →