SA1.7 Error Handling and Retry Logic
Figure SA1.7 — Scope-based error handling with retry policies per action type. The error handler runs ONLY when the enrichment scope fails — successful runs skip it entirely.
The silent failure problem
A Logic App without error handling fails on the first error and stops. If the VirusTotal API returns a 429 (rate limit) on the third action of a 10-action playbook, actions 4-10 never execute. The incident receives no enrichment comment, no Teams notification, and no tag update. The analyst opens the incident hours later and sees a raw alert — exactly as if no automation existed. The playbook failed silently because nobody checked the Logic App run history, and the analyst had no way to know that enrichment was supposed to happen and did not.
Silent failure is the most damaging automation failure mode because it erodes trust without anyone noticing. The analyst assumes enrichment is running. It is not. When someone eventually discovers the playbook has been failing for three weeks, the response is “see, automation doesn’t work” — and the entire program loses credibility.
Scope-based error handling
The Scope action is a Logic App control that groups multiple actions and provides unified error handling for the group. Think of it as a try-catch block.
Architecture: Create a Scope called “Enrichment Queries” containing all enrichment actions (KQL query, Graph API call, VirusTotal lookup, etc.). Create a second Scope called “Error Handler.” Configure the Error Handler’s “Run After” setting to trigger when “Enrichment Queries” has failed or has timed out. The Error Handler does NOT run when Enrichment Queries succeeds — it only activates on failure.
Inside the Error Handler, add actions that: identify which specific query failed (check the result of each action within the Enrichment Queries scope), compose an error summary, add an incident comment documenting the partial failure, and tag the incident “enrichment-partial.”
The key insight: the Error Handler catches the failure and documents it instead of letting the entire playbook crash. After the Error Handler completes, the playbook continues to the next Scope — the Teams notification and tag update still execute. The analyst receives a Teams message with the incident summary (even though enrichment failed) and can see the “enrichment-partial” tag in the incident queue.
Configuring “Run After.” Every Logic App action has a “Run After” configuration that determines under what conditions the action executes. The options are: “is successful” (default), “has failed,” “is skipped,” and “has timed out.” The Error Handler Scope should be configured to run after “Enrichment Queries” with conditions: “has failed” AND “has timed out.” This ensures the Error Handler catches both error types.
Retry policies
Retry policies apply to individual actions, not Scopes. Each HTTP action, connector action, or API call can be configured with a retry policy that determines how the action handles transient failures.
Fixed interval retry. The action retries N times with a fixed delay between attempts. Use for APIs that are generally stable but occasionally return errors due to transient load: Microsoft Graph (429 throttle during high-volume periods), Log Analytics queries (503 during workspace maintenance), Sentinel API (occasional timeouts).
Configuration: Retry policy = Fixed, Count = 2, Interval = PT10S (10 seconds). The action tries once, waits 10 seconds, tries again, waits 10 seconds, tries a third time. If all three attempts fail, the action status is “Failed” and the Scope’s error handler activates.
Exponential backoff retry. The action retries with increasing delays: 15 seconds, 30 seconds, 60 seconds. Use for external APIs with rate limits: VirusTotal (4 requests/minute on free tier), AbuseIPDB (1,000 requests/day), Shodan (1 request/second).
Configuration: Retry policy = Exponential, Count = 3, Minimum interval = PT15S, Maximum interval = PT60S. The increasing delay gives the API time to recover from rate limiting. After 3 attempts with exponential backoff, if the action still fails, it moves to the error handler.
No retry. The action fails immediately on the first error with no retry. Use for containment actions: disabling a user account, isolating an endpoint, revoking sessions. These actions are not idempotent in the way enrichment queries are — a retry on “disable account” after a transient error could attempt to disable an account that was already disabled by the first attempt (which may have succeeded despite returning an error). Containment failures need human investigation, not automated retry.
The dead letter pattern
When an enrichment source fails after all retries, the failure must be recorded and the playbook must continue. This is the dead letter pattern: log the failure, skip the failed enrichment, and proceed with remaining sources.
Implementation: inside the “Enrichment Queries” Scope, wrap each individual enrichment action in its own try-catch. If the VirusTotal lookup fails, the playbook captures the error, adds it to a “failures” array variable, and continues to the next enrichment source (Graph user risk, device compliance, etc.). After all enrichment actions complete, the playbook checks the failures array. If it has entries, the enrichment comment includes a section: “The following enrichment sources failed: VirusTotal (429 rate limit). Manual lookup recommended.”
This pattern ensures the analyst receives every enrichment result that succeeded, with clear documentation of what failed and why. Partial enrichment (4 of 5 sources) is far more valuable than no enrichment (playbook crashed on source 2).
Timeout management
The default action timeout in Logic Apps is 2 minutes. For most API calls, 2 minutes is sufficient. For KQL queries against large datasets (30 days of SigninLogs for a user with thousands of sign-ins), the query may need up to 5 minutes. For the overall playbook, you want completion within 60 seconds for enrichment playbooks (the analyst should not wait more than a minute for enrichment to appear).
Set per-action timeouts based on the action type: HTTP calls to Graph = 30 seconds (these are fast). KQL queries = 120 seconds (can be slow for large results). HTTP calls to external TI = 45 seconds (network latency varies). If an action times out, the retry policy handles it — the timeout counts as a failure, and the next retry attempt starts.
For the overall playbook, consider adding a “Terminate” action after 5 minutes of total execution. If enrichment has not completed in 5 minutes, something is fundamentally wrong (workspace outage, API endpoint down). The playbook should terminate with a status of “Failed,” triggering the monitoring analytics rule from SA1.9 to alert the SOC team.
The myth: If the playbook fails, the analyst does the work manually. The automation is a nice-to-have, not a dependency. Error handling is over-engineering.
The reality: The analyst does not know the playbook failed unless error handling tells them. The incident sits in the queue looking like every other incident — the analyst has no visual indicator that enrichment was attempted and failed versus enrichment was not attempted at all. Without error handling, the analyst may assume the lack of enrichment means the incident is low-priority (no enrichment comment = nothing interesting found), when in reality the enrichment playbook crashed before producing any results. Error handling is not over-engineering — it is the mechanism that maintains visibility into automation health.
Decision point: Your containment playbook’s session revocation action returns an HTTP 500 (internal server error) from Microsoft Graph. Should the playbook retry the revocation? The answer is nuanced. A 500 error could mean the request was received but processing failed (the sessions were not revoked), or it could mean the request was received, processing succeeded, but the response generation failed (the sessions WERE revoked but Graph could not confirm it). If you retry and the first attempt actually succeeded, the second attempt revokes sessions that are already revoked — harmless. If the first attempt failed, the retry catches it. For session revocation specifically, a single retry after a 500 error is safe because the action is idempotent (revoking already-revoked sessions has no additional effect). But for account disable, a retry after an ambiguous error is riskier — add a verification step (query account status) before retrying the disable action.
Try it: Add error handling to your enrichment playbook
Open the playbook from SA1.4 in the Logic App designer:
- Select the KQL query action and the incident comment action
- Click “Add a parallel branch” → add a new action → search “Control” → select “Scope”
- Move the KQL query and comment actions INTO the Scope (drag and drop or cut/paste)
- Name the Scope “Enrichment Queries”
- Add a second Scope called “Error Handler”
- Configure Error Handler’s “Run After” to fire when “Enrichment Queries” has failed or timed out
- Inside Error Handler, add an “Add comment to incident” action: “Enrichment failed. Error: [insert error details]. Manual enrichment required.”
- Test by temporarily removing Log Analytics Reader permission from the managed identity — the KQL query should fail, the error handler should fire, and the incident should receive the error comment
- Restore the permission after testing
Where this goes deeper. SA2 builds the enrichment pipeline with parallel execution and individual error handling per enrichment source — the dead letter pattern applied to production playbooks. SA5-SA7 implement error handling for containment playbooks where the stakes are higher: a failed containment attempt needs immediate analyst attention, not just an incident comment.
You're reading the free modules of this course
The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.