6.6 Connector Troubleshooting

75 minutes · Module 6

Connector Troubleshooting

By the end of this subsection, you will be able to diagnose and fix the five most common connector failures using systematic troubleshooting and KQL diagnostic queries.

Connectors fail silently. The workspace does not alert you when data stops flowing — you discover the failure when an investigation returns empty results or an analytics rule stops firing. This subsection teaches you to find and fix failures before they create blind spots.

The universal diagnostic approach

Every connector failure falls into one of three zones. Determine which zone first, then drill down.

Zone 1: Source device — Is the device generating and sending logs? Zone 2: Transport layer — Is the data reaching the forwarder/ingestion endpoint? Zone 3: Workspace — Is the data being accepted, transformed, and stored?

Start at Zone 1, work forward. This prevents the common mistake of debugging the workspace when the firewall stopped sending logs.

Problem 1: Connector shows “Connected” but no data

Symptom: Green status in the connector page. Target table returns zero events.

Zone 1 check — is the device sending?

For Syslog/CEF sources, SSH to the forwarder:

1
2
3
4
-- Shell command on the forwarder VM:
-- sudo tcpdump -i any port 514 -c 10
-- If packets appear: data is reaching the forwarder. Problem is Zone 2 or 3.
-- If no packets: the source device is not sending. Check device config.

Zone 2 check — is rsyslog processing?

1
2
3
4
5
-- Shell commands:
-- systemctl status rsyslog
-- tail -f /var/log/syslog | grep CEF
-- If rsyslog is running and CEF messages appear: problem is the AMA (Zone 3).
-- If rsyslog is stopped: sudo systemctl restart rsyslog

Zone 3 check — is the AMA connected?

Check Azure Portal → Monitor → Data Collection Rules → verify the forwarder VM is listed. Check AMA agent health: sudo /opt/microsoft/azuremonitoragent/bin/mdsd --version on the forwarder. If AMA is not running, restart: sudo systemctl restart azuremonitoragent.

For Microsoft first-party connectors (no forwarder), check:

Entra ID: Are diagnostic settings still configured? (Settings can be deleted during admin changes)
M365 Defender: Is the connector still enabled in the Sentinel portal? (Portal updates can reset connector state)
Azure Activity: Is the subscription still selected?

Problem 2: Data flowing but with high latency

Symptom: Data arrives 10-30 minutes after the event occurred.

1
2
3
4
5
6
7
CommonSecurityLog
| where TimeGenerated > ago(1h)
| extend IngestionDelay = ingestion_time() - TimeGenerated
| summarize
    AvgDelay = round(avg(IngestionDelay) / 1m, 0),
    P95Delay = round(percentile(IngestionDelay, 95) / 1m, 0),
    MaxDelay = round(max(IngestionDelay) / 1m, 0)

Expected Output

AvgDelay (min)	P95Delay (min)	MaxDelay (min)
18	32	47

What to look for: Normal: average under 5 minutes, P95 under 10 minutes. The 18-minute average above is a problem. Common causes: forwarder CPU saturation (check with top), disk I/O bottleneck (rsyslog buffering), old AMA version, network congestion between forwarder and Azure. Fix: upgrade AMA, increase VM resources, move forwarder to same Azure region as workspace.

Problem 3: Duplicate events

Symptom: The same event appears multiple times.

1
2
3
4
5
6
7
CommonSecurityLog
| where TimeGenerated > ago(1h)
| summarize Count = count()
    by DeviceEventClassID, SourceIP, DestinationIP,
    bin(TimeGenerated, 1s)
| where Count > 1
| summarize DuplicateEvents = sum(Count - 1), AffectedBuckets = count()

Expected Output

DuplicateEvents	AffectedBuckets
1,247	623

What to look for: 1,247 duplicate events in 1 hour means ~30 GB/month of wasted ingestion. Common causes: two forwarders receiving the same stream (misconfigured load balancer), source device sending to both forwarder and workspace directly, legacy MMA agent running alongside AMA. Fix: verify exactly one ingestion path per data source.

Problem 4: Missing fields (null columns)

Symptom: Data arrives but SourceIP, DeviceAction, or other critical fields are null.

1
2
3
4
5
6
CommonSecurityLog
| where TimeGenerated > ago(1h)
| where isempty(SourceIP) or isempty(DeviceAction)
| take 5
| project TimeGenerated, DeviceVendor, DeviceProduct,
    SourceIP, DeviceAction, Message

Expected Output — Diagnosis

DeviceVendor	DeviceProduct	SourceIP	DeviceAction	Message (truncated)
Palo Alto	PAN-OS	(empty)	(empty)	Mar 21 14:32 fw01 1,2026/03/21 14:32,TRAFFIC,allow...

What to look for: The structured fields (SourceIP, DeviceAction) are empty but the Message column contains the data as raw text. This means the device is sending Syslog format, not CEF. The CEF parser cannot extract structured fields from raw Syslog. Fix: reconfigure the device to output CEF format. If CEF is not available, write a DCR parse transformation to extract fields from the raw Message.

Problem 5: Connector was working, then stopped

Systematic check (in order):

Source device: Configuration change? Reboot that lost Syslog config? Firmware update that changed log format?
Network: Firewall rule change blocking port 514 (to forwarder) or 443 (forwarder to Azure)?
Forwarder: VM running? Disk full? rsyslog running? AMA running?
DCR: Was the DCR modified? A syntax error in the transformation silently drops all data. Check the DCR’s “last modified” timestamp.
Workspace: Accepting data? Check the ingestion health query from Module 5.9.

DCR syntax errors are invisible

A broken DCR transformation does not generate an error message. It silently drops all data that passes through it. If a connector was working and suddenly stopped after someone modified the DCR, revert the transformation to the previous version and test.

Try it yourself

Your Palo Alto firewall data stopped flowing 4 hours ago. The connector page shows "Connected" (green). Walk through the diagnostic steps in order and write down what command or query you would run at each zone.

Zone 1 (source): SSH to forwarder, run sudo tcpdump -i any port 514 -c 10.

→ If packets arrive: Zone 1 is healthy. Move to Zone 2.

→ If no packets: Check the Palo Alto Syslog server configuration. Did someone remove or change the Syslog destination? Did the firewall reboot and lose the config?

Zone 2 (forwarder): systemctl status rsyslog — is it running? df -h — is the disk full? tail -f /var/log/syslog | grep CEF — are CEF messages being processed?

→ If rsyslog is processing CEF: Move to Zone 3.

→ If disk is 95%+: AMA cannot send. Clear rsyslog buffer after fixing AMA.

Zone 3 (AMA/workspace): Check AMA status: systemctl status azuremonitoragent. Check DCR: was it modified in the last 4 hours? Run the workspace ingestion health query to verify other tables are still flowing.

→ If other tables flow but CommonSecurityLog does not: DCR or AMA issue specific to CEF.

→ If all tables stopped: Workspace-level problem (throttling, permissions, subscription issue).

Check your understanding

1. SigninLogs stopped flowing but all other tables are healthy. Where is the most likely failure?

The Entra ID diagnostic settings were deleted or modified. SigninLogs uses a different ingestion path (diagnostic settings) than the other tables (Defender connector, Office 365 connector). Check Entra ID → Monitoring → Diagnostic settings and verify the configuration is intact.

The workspace is down

The forwarder VM crashed

Each connector has an independent failure domain. SigninLogs flows via Entra diagnostic settings. Defender data flows via the M365 Defender connector. Office data flows via the Office 365 connector. One failing while others work means the failure is specific to that connector's ingestion path.

2. A colleague modified a DCR transformation yesterday. Today, CommonSecurityLog has zero events. No error messages anywhere. What happened?

The DCR transformation has a syntax error that silently drops all incoming data. DCR failures do not generate error messages or alerts — they just stop writing data. Revert the transformation to the previous version and verify data flows.

The firewall stopped sending

The table was moved to Archive tier

Silent DCR failures are one of the most dangerous failure modes because there is no indication anything is wrong. The connector shows "Connected" (it is receiving data), but the DCR drops everything before it reaches the table. Always version-control your DCR transformations and test changes in a dev workspace first.

← 6.5 Custom Logs and API Ingestion 6.7 Connector Validation and Ongoing Monitoring →