8.9 Connector Troubleshooting and Validation

14-18 hours · Module 8

Connector Troubleshooting and Validation

Introduction

A connector that reports “Connected” in the portal but delivers no data is worse than a connector that reports “Disconnected” — at least the disconnected one tells you there is a problem. Silent failures are the most dangerous connector issue because they create blind spots that only surface during an investigation when the expected data is missing. This subsection teaches systematic troubleshooting for every connector type.


Universal troubleshooting checklist

Before diagnosing connector-specific issues, verify the fundamentals.

Step 1: Is the connector status “Connected”? Sentinel → Data connectors → select the connector → check status. If “Not connected” or “Error,” the connector configuration itself has failed. Re-run the setup steps.

Step 2: Is data arriving in the expected table? Run: TableName | where TimeGenerated > ago(4h) | count. If the count is zero, data is not flowing. If the count is non-zero but lower than expected, data is flowing but may be partially interrupted.

Step 3: Is the data fresh? Run: TableName | summarize LastEvent = max(TimeGenerated). If LastEvent is more than 30 minutes old for a table that normally receives data every few minutes, there is a recent interruption.

Step 4: Has the daily cap been reached? If the workspace has a daily cap configured (not recommended in production — Module 7.3), check whether ingestion has stopped for all tables. Usage → check daily ingestion against the cap.

Step 5: Are there health events? Query: SentinelHealth | where TimeGenerated > ago(24h) | where SentinelResourceType == "Data connector" | where Status != "Success". Health events with failure details point to the specific issue.


Troubleshooting by connector type

Service-to-service connectors (Entra ID, Azure Activity, M365):

Common issues: Licence requirement not met (Entra ID P1/P2 required for sign-in logs), conflicting diagnostic settings (Azure portal → Entra ID → Diagnostic settings may be sending data to a different workspace), permissions (the Sentinel workspace needs appropriate read permissions on the data source), and tenant misconfiguration (the connector is configured for a different tenant than expected).

Diagnostic query:

1
2
3
4
5
6
// Check for data gaps in the last 48 hours
SigninLogs
| where TimeGenerated > ago(48h)
| summarize EventCount = count() by bin(TimeGenerated, 1h)
| extend ExpectedRange = iff(EventCount < 10, "⚠️ LOW", "✓ OK")
| order by TimeGenerated desc

Hourly event counts should be relatively consistent. A sudden drop to zero or near-zero indicates an interruption. The timestamp of the drop pinpoints when the issue started.

Defender XDR connector:

Common issues: Bi-directional sync not enabled (incidents appear in Defender but not Sentinel), Advanced Hunting tables not selected (the connector is connected but specific tables are not ingesting), and duplicate incidents (both sync and analytics rules creating incidents for the same alerts — see subsection 8.3).

Diagnostic query:

1
2
3
4
5
6
7
8
9
// Verify Advanced Hunting table ingestion
union withsource=TableName
    DeviceProcessEvents,
    DeviceNetworkEvents,
    EmailEvents,
    CloudAppEvents
| where TimeGenerated > ago(4h)
| summarize LastEvent = max(TimeGenerated), Events = count() by TableName
| extend Status = iff(datetime_diff('minute', now(), LastEvent) > 60, "⚠️ STALE", "✓ OK")

AMA-based connectors (Windows Security Events, Syslog, CEF):

Common issues: AMA not installed or not running (check agent status on the VM), DCR not associated with the VM (the VM is not in the DCR’s resource list), firewall blocking AMA outbound connections (AMA needs HTTPS access to Azure endpoints), wrong facility/level configuration in DCR (collecting the wrong Syslog facilities), and Syslog daemon misconfiguration (rsyslog not forwarding to the correct port).

AMA network requirements: AMA must be able to reach these Azure endpoints over HTTPS (port 443): *.ods.opinsights.azure.com (data upload), *.oms.opinsights.azure.com (agent management), *.monitoring.azure.com (data collection endpoints), login.microsoftonline.com (authentication). If a firewall or proxy blocks any of these, AMA cannot send data. Check connectivity from the host: curl -v https://<workspace-id>.ods.opinsights.azure.com — a connection refused or timeout indicates a network block.

AMA service status check:

On Windows: open Services → find “Azure Monitor Agent” → check status (Running/Stopped). If stopped, check the Windows Event Log (Application log) for AMA error events.

On Linux: systemctl status azuremonitoragent. If the service is not running: systemctl start azuremonitoragent then check journalctl -u azuremonitoragent -n 50 for error messages.

DCR association verification: Navigate to Azure portal → Monitor → Data Collection Rules → select the DCR → Resources tab. Verify the target VM appears in the resource list. If the VM is missing, add it. If you use Azure Policy for DCR association, run a compliance scan to check whether the policy was applied to the VM.

Diagnostic steps:

1
2
3
4
5
6
// Check AMA heartbeat  is the agent running?
Heartbeat
| where TimeGenerated > ago(1h)
| where Computer == "YOUR-SERVER-NAME"
| summarize LastHeartbeat = max(TimeGenerated)
| extend Status = iff(datetime_diff('minute', now(), LastHeartbeat) > 10, "⚠️ AGENT DOWN", "✓ OK")

Common CEF troubleshooting patterns:

Issue: CommonSecurityLog has events but all structured fields (DeviceVendor, SourceIP, DestinationIP) are empty. Cause: The device is sending plain Syslog, not CEF format. The message lacks the CEF:0| header. Fix: Reconfigure the device to output CEF format.

Issue: CommonSecurityLog has events with DeviceVendor populated but SourceIP is empty. Cause: The device is sending valid CEF but not including the src= extension key. Some devices only include source IP for certain event types. Fix: Check the device’s CEF implementation guide. Configure the device to include all standard extension keys. If the device does not support the key, use a DCR transformation to extract the IP from the Message field.

Issue: Events appear in the Syslog table instead of CommonSecurityLog. Cause: The CEF message is being sent on a Syslog facility that is not configured for CEF parsing in the DCR. By default, CEF parsing applies to messages received on LOG_LOCAL0 through LOG_LOCAL7. Fix: Verify the device sends CEF messages on a LOCAL facility, and the DCR includes that facility in the CEF collection configuration.

Issue: High latency (events arrive in Sentinel 30+ minutes after the device generates them). Cause: The log forwarder VM is overloaded (CPU/memory), the Syslog daemon has a large queue, or the AMA upload buffer is full. Fix: Check forwarder VM resource utilisation (top, free -m). If CPU > 80% or memory > 90%, the forwarder needs more resources. Consider deploying a second forwarder behind a load balancer.


Systematic troubleshooting flowchart

For any connector with missing data, follow this sequence:

Step 1: Is the connector configured? Check Sentinel → Data connectors → status. If “Not connected,” the connector is not set up. Follow the configuration steps in the relevant subsection.

Step 2: Is the agent running (AMA-based connectors)? Check Heartbeat table. If no heartbeat, the agent is down. Restart it. If the agent is not installed, deploy it.

Step 3: Is the DCR correct and associated? Check the DCR in Azure Monitor. Is the target VM in the Resources list? Is the data source configured correctly? Is the destination workspace correct?

Step 4: Is the network path clear? Can AMA reach Azure endpoints? Can CEF devices reach the forwarder? Use curl and telnet to test connectivity.

Step 5: Is data arriving but in the wrong table? CEF data in the Syslog table instead of CommonSecurityLog indicates a parsing issue. Custom API data in the wrong custom table indicates a DCR routing issue.

Step 6: Is the daily cap reached? If all tables stop receiving data at the same time, the workspace daily cap has triggered. Check Usage and estimated costs.

If the Heartbeat query returns no results, AMA is not running on the target machine. Check the agent service status on the VM: systemctl status azuremonitoragent (Linux) or check the “Azure Monitor Agent” service in Windows Services.

Custom logs (Logs Ingestion API):

Common issues: Authentication failure (service principal expired or missing role assignment), DCR schema mismatch (the JSON payload does not match the DCR schema), DCE unreachable (network configuration blocks HTTPS to the DCE endpoint), and rate limiting (the API has ingestion rate limits — check for HTTP 429 responses).

Diagnostic: Check the DCR metrics in Azure Monitor → Data Collection Rules → select the DCR → Metrics → look for “Logs Ingestion Requests” (success and failure counts). Failed requests indicate authentication or schema issues.


Building the connector health dashboard

Combine the diagnostic queries into a single health monitoring workbook (or use the shift-start health check from Module 7.11).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Comprehensive connector health  all tables in one query
let ExpectedTables = datatable(TableName:string, ExpectedFrequency:string) [
    "SigninLogs", "Minutes",
    "AuditLogs", "Minutes",
    "SecurityAlert", "Hours",
    "AzureActivity", "Minutes",
    "SecurityEvent", "Minutes",
    "CommonSecurityLog", "Minutes",
    "Syslog", "Minutes"
];
union withsource=TableName *
| where TimeGenerated > ago(4h)
| summarize LastEvent = max(TimeGenerated), EventCount = count() by TableName
| join kind=inner ExpectedTables on TableName
| extend MinutesSinceLastEvent = datetime_diff('minute', now(), LastEvent)
| extend Status = case(
    MinutesSinceLastEvent < 30, "✓ Healthy",
    MinutesSinceLastEvent < 120, "⚠️ Delayed",
    "❌ Down")
| project TableName, Status, MinutesSinceLastEvent, EventCount, ExpectedFrequency
| order by MinutesSinceLastEvent desc

This single query shows the health of every connector in one view. Any table showing “⚠️ Delayed” or “❌ Down” requires immediate investigation using the connector-specific diagnostic steps above.


Validation after initial connector deployment

When you first deploy a connector, validate it thoroughly before trusting it for production detection.

Validation 1: Data completeness. Compare the event count in Sentinel with the event count at the source. If the source generated 10,000 events in the last hour but Sentinel received 8,000, there is a 20% loss — investigate the forwarder, network, or DCR transformation.

Validation 2: Field correctness. Verify that key fields are populated and contain expected values. Run TableName | where TimeGenerated > ago(1h) | summarize CountIf = countif(isempty(SourceIP)) / count() * 100 to check the percentage of events with empty SourceIP. A high percentage indicates a parsing issue.

Validation 3: Latency measurement. Measure the delay between event generation and availability in Sentinel: TableName | where TimeGenerated > ago(1h) | extend Latency = ingestion_time() - TimeGenerated | summarize AvgLatency = avg(Latency), P95Latency = percentile(Latency, 95). Latency over 15 minutes for security data warrants investigation.

Validation 4: Analytics rule compatibility. After the connector is delivering data, enable the analytics rules that query the new table. Run each rule’s KQL manually to verify it returns expected results against the real data. A rule designed for CEF data from Palo Alto may not work correctly with CEF data from Fortinet — vendor-specific field differences can cause rule failures.


Real-world failure scenarios and resolution

These are the connector failures that occur in production environments. Each is drawn from common operational experience.

Scenario 1: “SigninLogs stopped flowing at 03:00 but nobody noticed until 09:00.” Root cause: Microsoft performed maintenance on the Entra ID diagnostic pipeline. Data delivery paused for 2 hours, then resumed with a backfill. The events from 03:00-05:00 arrived at 05:30 with their original TimeGenerated timestamps. Impact: no data was lost, but analytics rules that ran between 03:00 and 05:30 evaluated an empty or partial dataset — potentially missing detections. Resolution: Microsoft service health incidents are published at status.azure.com. Subscribe to Service Health alerts in the Azure portal to receive notifications when Microsoft services experience issues that affect data delivery. Prevention: the shift-start health check (Module 7.11) would have caught this at 07:00 if the night shift had one.

Scenario 2: “CEF data is flowing but all events show DeviceVendor=‘Unknown’ since yesterday.” Root cause: a firewall firmware update changed the CEF output format. The new firmware uses a slightly different CEF header that AMA does not parse correctly. Resolution: compare a sample raw Syslog message (from the forwarder’s local log) with the CEF specification. If the format changed, adjust the device’s Syslog output configuration, or create a workspace transformation that corrects the parsing. Prevention: after any device firmware update, verify CEF data quality within 24 hours.

Scenario 3: “SecurityEvent volume doubled overnight without any configuration change.” Root cause: a Windows Group Policy change enabled “All Events” collection on 50 servers that were previously set to “Common.” The GPO change was made by the Windows team without notifying the SOC. Resolution: identify the GPO change (SecurityEvent | where TimeGenerated > ago(2d) | where EventID == 4719 | project TimeGenerated, Computer, Account, Activity). Revert the GPO or update the DCR to “Common.” Prevention: include the SOC team in change advisory board reviews for GPO changes that affect audit policy.

Scenario 4: “Custom log ingestion started failing with HTTP 403 at midnight.” Root cause: the service principal certificate used for API authentication expired. The certificate had a 1-year validity and was not renewed. Resolution: renew the certificate, update the secret in the ingestion script, and verify data flow resumes. Prevention: set a calendar reminder 30 days before certificate expiration. Better: use a managed identity (which has no expiration) instead of a service principal with certificate.


Automated alerting for connector failures

Do not rely on the shift-start health check alone — some failures need immediate notification.

Analytics rule: connector data gap detection. Create a scheduled analytics rule that runs every 30 minutes and checks for tables with no data in the last 2 hours:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Connector failure detection rule  fires when any critical table goes silent
let CriticalTables = datatable(TableName:string) [
    "SigninLogs", "SecurityAlert", "AuditLogs",
    "DeviceProcessEvents", "EmailEvents"
];
let RecentData = union withsource=TableName *
| where TimeGenerated > ago(2h)
| summarize LastEvent = max(TimeGenerated) by TableName;
CriticalTables
| join kind=leftanti RecentData on TableName
| extend AlertDetail = strcat("Critical table ", TableName,
    " has received no data in the last 2 hours")

Configure this rule to generate a high-severity incident. Assign it to the Sentinel administrator (not the general SOC queue). The alert fires when any critical table has zero events for 2 hours — catching connector failures, daily cap triggers, and service outages.

Automation rule: notify on connector failure. Attach an automation rule that triggers a Logic Apps playbook when the connector failure incident is created. The playbook sends a Teams notification to the SOC channel and an email to the Sentinel administrator. This ensures the failure is noticed immediately — not at the next shift-start check.

Try it yourself

Run the comprehensive connector health query against your workspace. Identify any tables showing "Delayed" or "Down" status. For each unhealthy table, use the connector-specific diagnostic steps to identify the root cause. Even if all connectors are healthy, run the latency measurement query to establish your baseline — knowing normal latency helps you detect abnormal latency during future investigations.

What you should observe

In a healthy lab, all connected tables should show "Healthy" status with LastEvent within the last 30 minutes. Tables for unconnected sources will not appear in the results (they have no data to report). If a table shows "Delayed," check the corresponding connector configuration and agent status.


Knowledge check

Check your understanding

1. SecurityEvent data stopped arriving 3 hours ago. The connector status shows "Connected" in the portal. AMA Heartbeat shows the agent is running. What do you check next?

Check the Data Collection Rule association. The DCR may have been modified or disassociated from the VM. Navigate to the DCR in Azure Monitor and verify the VM is still listed in the Resources. Also check if the DCR's collection configuration was changed (someone may have accidentally set the collection level to "None" or removed the Security event source). If the DCR is correct, check the workspace daily cap — if it was reached, all ingestion stops.
Restart the Sentinel workspace
Reinstall AMA
Wait — data will resume automatically

2. After deploying a CEF connector, CommonSecurityLog shows events but the DeviceVendor column is empty for all events. What is the cause?

The device is sending plain Syslog, not CEF-formatted messages. AMA detects CEF by the header format (CEF:0|Vendor|Product|...). If the header is missing or malformed, AMA writes the event to CommonSecurityLog but cannot parse the structured fields — resulting in empty DeviceVendor, DeviceProduct, and other CEF-specific columns. Check the device's log output format configuration and ensure CEF is selected (not plain Syslog or LEEF or another format).
The DCR is misconfigured
AMA needs to be updated
The CommonSecurityLog table is on Basic tier