In this section

1.7 Error Handling and Retry Logic

5 hours · Module 1 · Free

What you already know

Section 1.4 built the enrichment playbook with KQL queries, Graph API calls, and incident comments. Section 1.5 configured managed identity permissions. Section 1.6 covered entity extraction. All of these assumed every action succeeds. This section addresses what happens when they do not: API timeouts, rate limits, expired tokens, and service outages. The error handling patterns here apply to every playbook in this course.

Scenario

The VirusTotal API returns HTTP 429 (rate limit exceeded) on the third action of a ten-action enrichment playbook. Without error handling, actions four through ten never execute. The incident receives no enrichment comment, no severity update, no Teams notification. The analyst opens the incident at 07:00 and sees a raw alert with no context. Three hours of enrichment coverage were lost because a single API rate limit stopped the entire workflow. Nobody knew until the analyst wondered why enrichment had stopped appearing on overnight incidents.

The silent failure problem

A Logic App without error handling stops on the first unhandled failure. When an action's HTTP call returns a 5xx error or times out, that action enters the Failed state. Every subsequent action in the workflow is Skipped because the default Run After condition requires the previous action to have Succeeded. The playbook run status shows Failed in the run history, but nobody monitors run history at 02:00. The incident sits in the queue with no enrichment, no tag, no comment. The analyst who picks it up has no way to distinguish between "enrichment ran and found nothing interesting" and "enrichment never ran at all."

This is the most damaging automation failure mode because it degrades trust invisibly. The SOC operates on the assumption that enrichment is running. When the assumption is wrong and nobody discovers it for days or weeks, the response is predictable: leadership questions whether automation works, and the program loses the credibility it took months to build. Error handling is not defensive programming for edge cases. It is the mechanism that keeps automation visible and accountable.

Figure 1.7a: The try-catch-finally pattern using Scope actions. The Catch scope runs only when the Try scope fails or times out. The Finally scope runs regardless of outcome, ensuring the analyst always receives a Teams notification.

The Scope action and Run After conditions

The Scope action groups multiple Logic App actions into a single container with a combined status. After all actions inside a Scope finish running, the Scope itself receives one of four statuses: Succeeded (all actions completed), Failed (at least one action failed), Cancelled (a cancellation was requested), or TimedOut (the scope exceeded its time limit). This combined status is what makes error handling possible. Instead of checking each action individually, you check the Scope status and branch accordingly.

The Run After configuration controls when an action or scope executes based on the outcome of the preceding action or scope. Every action has a Run After setting with four conditions: is successful, has failed, is skipped, and has timed out. By default, actions run only after the previous action succeeds. To build error handling, you change the Run After conditions on the Catch scope to fire when the Try scope has failed or has timed out. The Catch scope does not run when the Try scope succeeds. It activates only on failure.

The third scope in the pattern is the Finally scope. Configure its Run After to fire after the Catch scope on any status: Succeeded, Failed, Skipped, or TimedOut. If the Try scope succeeded, the Catch scope was Skipped, and the Finally scope runs after the Skipped status. If the Try scope failed, the Catch scope ran (Succeeded or Failed itself), and the Finally scope runs after whatever status the Catch scope produced. Either way, the Finally scope executes. This is where you place actions that must happen regardless of enrichment outcome: Teams notification, incident tag update, execution logging.

Building the try-catch pattern

Create a Scope action in the Logic App designer and name it "Enrichment Queries." Move the KQL query, Graph API call, and VirusTotal HTTP action inside this scope. These are the actions that may fail. Create a second Scope named "Error Handler" immediately after the first. Click the three dots on the Error Handler scope, select Configure Run After, uncheck "is successful," and check "has failed" and "has timed out." The Error Handler now activates only when the Enrichment Queries scope fails.

Inside the Error Handler scope, you need to identify which specific action failed and extract the error details. The result() function returns the outputs from all actions within a named scope, including their status and error messages. Add a Filter Array action that filters result('Enrichment_Queries') where item()?['status'] equals "Failed." This gives you an array of only the failed actions. Add a Compose action that formats the error details into a readable message. Then add an "Add comment to incident" action that posts the error summary to the incident timeline.

CLI Output

// Logic App Run History — Error Handler scope output
// Incident: NE-2026-04281 (Brute Force — Multiple Failed Sign-ins)
// Playbook: SA-Enrich-AccountRisk
// Run ID: 08585924410694835632...
//
// result('Enrichment_Queries') — filtered for Failed actions:
[
  {
    "name": "Query_VirusTotal_IP",
    "status": "Failed",
    "code": "ActionResponseSkipped",
    "error": {
      "code": "429",
      "message": "Rate limit exceeded. Retry after 60 seconds."
    },
    "startTime": "2026-04-28T02:14:33.812Z",
    "endTime": "2026-04-28T02:15:04.219Z"
  }
]
//
// Error Handler composed message (posted as incident comment):
//
// ⚠ ENRICHMENT PARTIAL — SA-Enrich-AccountRisk
// Failed action: Query_VirusTotal_IP
// Error: 429 — Rate limit exceeded. Retry after 60 seconds.
// Succeeded actions: Query_SigninLogs, Query_UserRisk
// Timestamp: 2026-04-28T02:15:04Z
//
// Partial enrichment applied. VirusTotal IP reputation unavailable.
// Manual lookup required for IP 185.220.101.42.

The incident comment tells the analyst three things: which action failed and why (VirusTotal rate limit), which actions succeeded (SigninLogs and UserRisk enrichment is available), and what manual step remains (look up the IP in VirusTotal directly). Compare this to the alternative: no error handling, no comment, and the analyst has no idea enrichment was attempted.

Retry policies

Before an action reaches the Failed state and triggers the Catch scope, Logic Apps can retry the action automatically. Every HTTP action and connector action supports a configurable retry policy that determines how transient failures are handled.

Logic Apps provides four retry policy types. The Default policy is an exponential interval that sends up to 4 retries at exponentially increasing intervals. These intervals scale by 7.5 seconds and are capped between 5 and 45 seconds. This is what every action uses unless you explicitly configure something different. The None policy disables retries entirely. The action fails immediately on the first error. The Fixed Interval policy retries a specified number of times with a constant delay between attempts. The Exponential Interval policy retries with exponentially increasing delays, selected randomly from a growing range.

Retries are triggered by specific HTTP status codes: 408 (Request Timeout), 429 (Too Many Requests), and 5xx server errors (500, 502, 503, 504). A 401 (Unauthorized) or 403 (Forbidden) does not trigger a retry because these are authentication and authorization errors that will not resolve on their own. A 404 (Not Found) does not trigger a retry because the resource does not exist. Understanding which status codes trigger retries matters because it determines whether your playbook waits and retries (transient failure) or fails immediately and enters the Catch scope (permanent failure).

JSON

{
  "Query_VirusTotal_IP": {
    "type": "Http",
    "inputs": {
      "method": "GET",
      "uri": "https://www.virustotal.com/api/v3/ip_addresses/@{...}",
      "headers": {
        "x-apikey": "@{body('Get_VT_API_Key')?['value']}"
      },
      "retryPolicy": {
        "type": "exponential",
        "count": 3,
        "interval": "PT15S",
        "minimumInterval": "PT10S",
        "maximumInterval": "PT2M"
      }
    }
  },
  "Query_SigninLogs": {
    "type": "ApiConnection",
    "inputs": {
      "host": { "connection": { "name": "@parameters('$connections')['azuremonitorlogs']..." } },
      "method": "post",
      "body": "SigninLogs | where UserPrincipalName == '@{...}' | ...",
      "retryPolicy": {
        "type": "exponential",
        "count": 4,
        "interval": "PT10S",
        "minimumInterval": "PT5S",
        "maximumInterval": "PT1M"
      }
    }
  },
  "Revoke_User_Sessions": {
    "type": "Http",
    "inputs": {
      "method": "POST",
      "uri": "https://graph.microsoft.com/v1.0/users/@{...}/revokeSignInSessions",
      "retryPolicy": {
        "type": "none"
      }
    }
  }
}

Three different retry configurations for three different action types. The VirusTotal lookup uses exponential backoff with a 15-second initial interval because the free tier rate-limits at 4 requests per minute. Longer initial delays reduce the chance of hitting the limit again on retry. The KQL query uses the default exponential policy with a 10-second interval because Log Analytics transient errors (503 during workspace operations) typically recover within seconds. The session revocation action uses the None policy because retrying a write operation that may have partially completed creates ambiguity. If the first attempt succeeded but returned a 500 error on the response, retrying would revoke sessions that are already revoked. For session revocation, that is harmless (idempotent), but for other containment actions like account disable, a retry could produce unexpected state.

Retry policy design for security playbooks

The retry count and interval should reflect what the action calls and how time-sensitive the playbook's outcome is. Enrichment playbooks can tolerate retries over a few minutes. Containment playbooks cannot wait minutes for a retry when an attacker is actively exfiltrating data.

For Microsoft API calls (Graph API, Log Analytics), use exponential with count 4, interval PT10S, maximum PT1M. Microsoft services recover from transient issues quickly. Four retries over approximately 90 seconds covers brief throttling periods and workspace maintenance windows. For third-party TI APIs (VirusTotal, AbuseIPDB, Shodan), use exponential with count 3, interval PT15S, maximum PT2M. Third-party services have stricter rate limits, and aggressive retries make throttling worse. Fewer retries with longer intervals give the rate limit window time to reset. For containment actions (session revocation, device isolation), either use None (if the action is not safely retryable) or Fixed with count 2, interval PT5S (if the action is idempotent). Containment is time-critical. A fixed 5-second interval is predictable, and two retries provide coverage for transient network errors without the variable delays of exponential backoff.

Set per-action timeouts appropriate to the action type. HTTP calls to Graph API should time out at 30 seconds. KQL queries against large tables may need 120 seconds. HTTP calls to external TI APIs should time out at 45 seconds. If an action times out, the retry policy activates. The timeout counts as a failure, and the next retry attempt starts with a fresh timeout window.

Using the Default retry policy on every action

The Default policy (4 retries, exponential, 5-45 second range) is reasonable for most connector actions, but it is a poor fit for two categories. External TI APIs with strict rate limits receive 4 retry attempts when 3 would have been sufficient, and each additional retry during a rate limit window extends the cooldown period. Containment actions receive 4 retries with variable delays when they need either zero retries (non-idempotent write operations) or exactly 2 retries with fixed timing (idempotent operations where speed matters). Configure retry policies explicitly on every HTTP action in your playbooks. The 30 seconds you spend configuring the policy saves hours of debugging when the playbook behaves unpredictably during a real incident.

Graceful degradation

The try-catch pattern prevents the playbook from stopping. But the analyst still needs the enrichment data that did arrive. If the VirusTotal lookup failed but the KQL query and Graph API risk check succeeded, the incident comment should contain the sign-in analysis and risk level. Partial enrichment is substantially better than no enrichment.

Build the enrichment comment in stages. Initialize a string variable "EnrichmentOutput" at the start of the playbook. After each successful enrichment action, append its results to the variable. The KQL query succeeds: append the sign-in summary. The Graph API risk check succeeds: append the risk level. The VirusTotal lookup fails: the variable still contains the sign-in summary and risk level. In the Finally scope, post the contents of "EnrichmentOutput" as the incident comment. If all actions succeeded, the comment contains full enrichment. If one action failed, the comment contains partial enrichment. If all actions failed, the variable is empty, and the Catch scope's diagnostic comment is the only output.

Tag the incident to signal the enrichment status. Add an "enrichment-complete" tag when the Try scope succeeds. Add an "enrichment-partial" tag when the Catch scope fires. These tags are queryable in Sentinel and feed the monitoring analytics rule in Section 1.9. When the SOC lead reviews the incident queue, the tags provide immediate visibility: which incidents have full enrichment, which have partial, and which had failures requiring attention.

The dead letter pattern extends graceful degradation to unrecoverable failures. If all enrichment actions fail after exhausting their retry policies, the Catch scope writes a detailed diagnostic comment and tags the incident "enrichment-failed." The monitoring rule in Section 1.9 detects these tags and alerts the automation owner. The failed incident is the dead letter: it was not processed successfully, but it was documented, tagged, and escalated. Nobody discovers the failure by accident three weeks later.

Automation Principle

A playbook that fails silently is worse than no playbook at all. No playbook means the analyst knows they are working manually. A silently failing playbook means the analyst assumes enrichment ran and found nothing, when enrichment never ran. Every playbook in this course uses the try-catch-finally pattern: the Try scope contains enrichment actions, the Catch scope documents failures to the incident, and the Finally scope delivers whatever data arrived. The analyst always knows what happened.

Section 1.8 covers testing automation safely: how to validate playbook logic without triggering containment actions on real infrastructure, the development/production flag pattern that gates destructive actions, test incident creation, and the Logic App run history as a debugging tool.

Unlock the Full Course See Full Course Agenda

Get weekly detection and investigation techniques

KQL queries, detection rules, and investigation methods — the same depth as this course, delivered every Tuesday.

No spam. Unsubscribe anytime. ~2,000 security practitioners.

← Previous Next →