SA1.7 Error Handling and Retry Logic

5 hours · Module 1 · Free

Figure SA1.7 — Scope-based error handling with retry policies per action type. The error handler runs ONLY when the enrichment scope fails — successful runs skip it entirely.

Operational Objective

Logic Apps fail. APIs timeout. Rate limits hit. Permissions expire. The managed identity token refresh fails during a brief Azure AD outage. The question is not whether your playbook will encounter errors — it is whether those errors are handled gracefully (partial enrichment with clear reporting) or catastrophically (entire playbook stops, no enrichment applied, analyst unaware). This sub teaches the error handling architecture that prevents silent failures: scope-based error handling, retry policies per action type, and the dead letter pattern for unrecoverable failures.

Deliverable: Error handling patterns applicable to every playbook in this course. Scope configuration, retry policies, and the "partial enrichment" pattern that ensures the analyst always receives the maximum available context.

⏱ Estimated completion: 25 minutes

The silent failure problem

A Logic App without error handling fails on the first error and stops. If the VirusTotal API returns a 429 (rate limit) on the third action of a 10-action playbook, actions 4-10 never execute. The incident receives no enrichment comment, no Teams notification, and no tag update. The analyst opens the incident hours later and sees a raw alert — exactly as if no automation existed. The playbook failed silently because nobody checked the Logic App run history, and the analyst had no way to know that enrichment was supposed to happen and did not.

Silent failure is the most damaging automation failure mode because it erodes trust without anyone noticing. The analyst assumes enrichment is running. It is not. When someone eventually discovers the playbook has been failing for three weeks, the response is “see, automation doesn’t work” — and the entire program loses credibility.

Scope-based error handling

The Scope action is a Logic App control that groups multiple actions and provides unified error handling for the group. Think of it as a try-catch block.

Architecture: Create a Scope called “Enrichment Queries” containing all enrichment actions (KQL query, Graph API call, VirusTotal lookup, etc.). Create a second Scope called “Error Handler.” Configure the Error Handler’s “Run After” setting to trigger when “Enrichment Queries” has failed or has timed out. The Error Handler does NOT run when Enrichment Queries succeeds — it only activates on failure.

Inside the Error Handler, add actions that: identify which specific query failed (check the result of each action within the Enrichment Queries scope), compose an error summary, add an incident comment documenting the partial failure, and tag the incident “enrichment-partial.”

The key insight: the Error Handler catches the failure and documents it instead of letting the entire playbook crash. After the Error Handler completes, the playbook continues to the next Scope — the Teams notification and tag update still execute. The analyst receives a Teams message with the incident summary (even though enrichment failed) and can see the “enrichment-partial” tag in the incident queue.

Configuring “Run After.” Every Logic App action has a “Run After” configuration that determines under what conditions the action executes. The options are: “is successful” (default), “has failed,” “is skipped,” and “has timed out.” The Error Handler Scope should be configured to run after “Enrichment Queries” with conditions: “has failed” AND “has timed out.” This ensures the Error Handler catches both error types.

Retry policies

Retry policies apply to individual actions, not Scopes. Each HTTP action, connector action, or API call can be configured with a retry policy that determines how the action handles transient failures.

Fixed interval retry. The action retries N times with a fixed delay between attempts. Use for APIs that are generally stable but occasionally return errors due to transient load: Microsoft Graph (429 throttle during high-volume periods), Log Analytics queries (503 during workspace maintenance), Sentinel API (occasional timeouts).

Configuration: Retry policy = Fixed, Count = 2, Interval = PT10S (10 seconds). The action tries once, waits 10 seconds, tries again, waits 10 seconds, tries a third time. If all three attempts fail, the action status is “Failed” and the Scope’s error handler activates.

Exponential backoff retry. The action retries with increasing delays: 15 seconds, 30 seconds, 60 seconds. Use for external APIs with rate limits: VirusTotal (4 requests/minute on free tier), AbuseIPDB (1,000 requests/day), Shodan (1 request/second).

Configuration: Retry policy = Exponential, Count = 3, Minimum interval = PT15S, Maximum interval = PT60S. The increasing delay gives the API time to recover from rate limiting. After 3 attempts with exponential backoff, if the action still fails, it moves to the error handler.

No retry. The action fails immediately on the first error with no retry. Use for containment actions: disabling a user account, isolating an endpoint, revoking sessions. These actions are not idempotent in the way enrichment queries are — a retry on “disable account” after a transient error could attempt to disable an account that was already disabled by the first attempt (which may have succeeded despite returning an error). Containment failures need human investigation, not automated retry.

The dead letter pattern

When an enrichment source fails after all retries, the failure must be recorded and the playbook must continue. This is the dead letter pattern: log the failure, skip the failed enrichment, and proceed with remaining sources.

Implementation: inside the “Enrichment Queries” Scope, wrap each individual enrichment action in its own try-catch. If the VirusTotal lookup fails, the playbook captures the error, adds it to a “failures” array variable, and continues to the next enrichment source (Graph user risk, device compliance, etc.). After all enrichment actions complete, the playbook checks the failures array. If it has entries, the enrichment comment includes a section: “The following enrichment sources failed: VirusTotal (429 rate limit). Manual lookup recommended.”

This pattern ensures the analyst receives every enrichment result that succeeded, with clear documentation of what failed and why. Partial enrichment (4 of 5 sources) is far more valuable than no enrichment (playbook crashed on source 2).

Timeout management

The default action timeout in Logic Apps is 2 minutes. For most API calls, 2 minutes is sufficient. For KQL queries against large datasets (30 days of SigninLogs for a user with thousands of sign-ins), the query may need up to 5 minutes. For the overall playbook, you want completion within 60 seconds for enrichment playbooks (the analyst should not wait more than a minute for enrichment to appear).

Set per-action timeouts based on the action type: HTTP calls to Graph = 30 seconds (these are fast). KQL queries = 120 seconds (can be slow for large results). HTTP calls to external TI = 45 seconds (network latency varies). If an action times out, the retry policy handles it — the timeout counts as a failure, and the next retry attempt starts.

For the overall playbook, consider adding a “Terminate” action after 5 minutes of total execution. If enrichment has not completed in 5 minutes, something is fundamentally wrong (workspace outage, API endpoint down). The playbook should terminate with a status of “Failed,” triggering the monitoring analytics rule from SA1.9 to alert the SOC team.

⚠ Compliance Myth: "If the automation fails, we should fall back to the manual process — no error handling needed"

The myth: If the playbook fails, the analyst does the work manually. The automation is a nice-to-have, not a dependency. Error handling is over-engineering.

The reality: The analyst does not know the playbook failed unless error handling tells them. The incident sits in the queue looking like every other incident — the analyst has no visual indicator that enrichment was attempted and failed versus enrichment was not attempted at all. Without error handling, the analyst may assume the lack of enrichment means the incident is low-priority (no enrichment comment = nothing interesting found), when in reality the enrichment playbook crashed before producing any results. Error handling is not over-engineering — it is the mechanism that maintains visibility into automation health.

Decision point: Your containment playbook’s session revocation action returns an HTTP 500 (internal server error) from Microsoft Graph. Should the playbook retry the revocation? The answer is nuanced. A 500 error could mean the request was received but processing failed (the sessions were not revoked), or it could mean the request was received, processing succeeded, but the response generation failed (the sessions WERE revoked but Graph could not confirm it). If you retry and the first attempt actually succeeded, the second attempt revokes sessions that are already revoked — harmless. If the first attempt failed, the retry catches it. For session revocation specifically, a single retry after a 500 error is safe because the action is idempotent (revoking already-revoked sessions has no additional effect). But for account disable, a retry after an ambiguous error is riskier — add a verification step (query account status) before retrying the disable action.

Try it: Add error handling to your enrichment playbook

Open the playbook from SA1.4 in the Logic App designer:

Select the KQL query action and the incident comment action
Click “Add a parallel branch” → add a new action → search “Control” → select “Scope”
Move the KQL query and comment actions INTO the Scope (drag and drop or cut/paste)
Name the Scope “Enrichment Queries”
Add a second Scope called “Error Handler”
Configure Error Handler’s “Run After” to fire when “Enrichment Queries” has failed or timed out
Inside Error Handler, add an “Add comment to incident” action: “Enrichment failed. Error: [insert error details]. Manual enrichment required.”
Test by temporarily removing Log Analytics Reader permission from the managed identity — the KQL query should fail, the error handler should fire, and the incident should receive the error comment
Restore the permission after testing

Your enrichment playbook has 5 enrichment queries running inside a Scope. Query 3 (VirusTotal) fails with a 429 rate limit after 3 retries. What happens to queries 4 and 5?

They execute normally — Scope actions are independent. By default, actions inside a Scope run sequentially, and a failed action stops subsequent actions in the sequence. Queries 4 and 5 do NOT execute unless they are in parallel branches.

They do not execute if configured sequentially. The Scope fails on query 3, and queries 4-5 are skipped. To prevent this, configure queries 3, 4, and 5 as parallel branches within the Scope — or wrap each query in its own sub-Scope with individual error handling. Parallel execution ensures one failure does not block the others.

The entire Logic App stops and no output is produced. The Logic App continues to the Error Handler Scope (configured to run after failure), which produces the error comment and tags the incident. The playbook does not crash — it handles the failure gracefully.

The Scope retries all 5 queries from the beginning. Scopes do not retry — individual actions retry based on their retry policy. The Scope's outcome is determined by the actions within it. If any action fails after its own retries, the Scope fails.

Where this goes deeper. SA2 builds the enrichment pipeline with parallel execution and individual error handling per enrichment source — the dead letter pattern applied to production playbooks. SA5-SA7 implement error handling for containment playbooks where the stakes are higher: a failed containment attempt needs immediate analyst attention, not just an incident comment.

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← SA1.6 Entity Extraction and Mapping SA1.8 Testing Automation Safely →