In this module
NF1.10 Sensor Performance Tuning and Troubleshooting
Figure 1.10.1 — Four sensor failure categories and the signature-diagnosis-fix pattern for each.
Category 1 — Packet loss under load
Packet loss is the failure mode that matters most for evidentiary integrity. Every packet the sensor missed is a gap in your capture. If the loss happens during an attack window, you've missed part of the attack. The signature is visible, the diagnostic is mechanical, and the fix is one of four well-trodden paths depending on where the bottleneck sits.
Signature. Zeek's capture_loss.log begins emitting entries. Or ethtool -S shows rx_dropped or rx_no_buffer counters climbing. Or Suricata's stats.log shows the capture.kernel_drops counter incrementing. Any of these three indicates the sensor is receiving traffic it cannot process fast enough. Sub-0.1% loss during quiet periods is usually fine; sustained loss above 0.1% during peak is the threshold for investigation.
Diagnostic path. Four questions, in order.
First, is the CPU saturated? Run top or htop and check whether Zeek or Suricata is pinning a core at 100%. If one worker is pinning and others are idle, the workload isn't being distributed across cores — AF_PACKET clustering or PF_RING is misconfigured or absent. If all cores are at 100%, you're CPU-bound at the current traffic level and need either more cores, faster cores, or fewer features (rule count, script load).
Second, is the kernel dropping before the packets reach the sensor application? Check ethtool -S for rx_dropped, rx_no_buffer, or rx_missed_errors. These are kernel-level drops. Increase the interface ring buffer with ethtool -G (typically 4096 or 8192). Verify RX queues are bound to dedicated CPUs with set_irq_affinity. If your NIC has hardware offloads enabled (LRO, GRO, checksum offload), they can corrupt captured packet timing — disable them on the capture interface with ethtool -K .
Third, is the sensor application's internal buffer full? Zeek's capture_loss.log distinguishes sensor-side loss from kernel-side loss. If the loss is on Zeek's side and the kernel is fine, Zeek can't process packets as fast as the kernel delivers them. Add more workers if you're on AF_PACKET clustering. Check for expensive scripts running synchronously in Zeek's main event loop — profile with zeek -r capture.pcap prof-time-budget to see which scripts consume the most CPU.
Fourth, is storage I/O the bottleneck? Less common but possible if you're writing PCAP to spinning disk. Check iostat -x 1 during peak — if %util on the storage device pegs at 100% with high await, the disk can't keep up. Move PCAP to NVMe, or write metadata only during peak and queue PCAP writing for off-peak flush.
Remediation paths. The four fixes in rough order of effort: enable AF_PACKET clustering with 2-4 workers (cheapest, usually sufficient for traffic under 2-3 Gbps); enable PF_RING if the NIC supports it (moderate effort, significant performance gain); tune ring buffer and disable NIC offloads (quick, always worth doing); and finally, upgrade hardware (expensive, last resort). Most capture loss I see in practice comes from step 1 not being done — the sensor was deployed with a single Zeek worker and nobody enabled clustering.
Category 2 — Zeek log anomalies
The second-most-common failure mode is Zeek running but not producing the logs you expected. Conn.log is emitting but smtp.log is empty. Or dns.log is full of entries for a single internal host. Or weird.log has grown to 2GB of parse errors. These aren't capture problems — they're analysis problems inside Zeek.
Signature. Missing connections in conn.log relative to what you can see in tcpdump -n -i . Empty protocol logs (dns.log, http.log, ssl.log) when you know the protocol is on the wire. Large weird.log or reporter.log files. Connections showing history=S (SYN only) for traffic you can verify is completing normally.
Diagnostic path. Start with Zeek's own health.
Run zeekctl status (or broctl status on older versions). All workers should be running. If any are crashed or stopped, check zeekctl diag for the crash reason and the corresponding worker's crash/ directory for the core dump or reason log.
Check the log rotation state. Zeek rotates logs via local.zeek's log-rotate configuration and zeekctl cron. If rotation stalls, logs stop writing. ls -la /opt/zeek/logs/current/ should show all expected logs with recent mtimes. If a log is present but not updating, the process managing it has stalled.
Check loaded scripts. zeekctl check validates the configuration. A script that references an undefined identifier or has a syntax error will prevent Zeek from loading — but a script that references an identifier that's only defined in a specific traffic context (e.g., ssl events when no TLS is seen) will load but produce no output for that protocol. Review local.zeek and site/local.zeek for recent changes.
Check disk space. df -h /opt/zeek/logs — if the logs partition is 100% full, Zeek may be writing to a partial log or silently dropping writes. Clear old rotated logs, expand the partition, or adjust retention policy.
Check the weird.log. Every line is a protocol anomaly Zeek flagged. A handful is normal; thousands in an hour means Zeek is seeing traffic it can't parse. Common causes: asymmetric routing (Zeek sees one direction of a flow but not the other), tunnel encapsulation Zeek doesn't decode by default, or malformed traffic from the attacker itself.
Remediation. Restart Zeek with zeekctl deploy (which reloads scripts and restarts workers). Fix any script errors surfaced by zeekctl check. Clear stale state in /opt/zeek/spool/ if a previous crash left it inconsistent. For asymmetric routing, either correct the network topology or configure Zeek to handle asymmetric flows via redef ConnSize::register_time = 1sec. For tunnel traffic, load the appropriate decapsulation scripts (@load protocols/conn/mac-logging for layer 2 tunnels, specific protocol scripts for GRE/GENEVE/VXLAN).
Category 3 — Suricata firing wrong
Suricata either fires on traffic it shouldn't (false positives, alert storms) or doesn't fire on traffic it should (false negatives, missed detection). Both are rule-tuning problems, but they're detected differently and fixed differently. The discipline is the same: reproduce against a known PCAP, then fix the rule, then retest against the same PCAP.
Signature for false positives. Alert count jumps 100x from baseline in a short window. Specific rule SIDs dominating the alert feed. Analysts ignoring alerts from the sensor because signal-to-noise is too low.
Signature for false negatives. You know a technique was used in traffic (from PCAP you captured during a red-team exercise or a known-bad PCAP you replayed) but Suricata produced no alert. Or your organization's detection maturity baseline (MITRE ATT&CK coverage from DE0) shows gaps that ought to be covered by the loaded ruleset.
Diagnostic path for false positives. Identify the firing SID — eve.json includes alert.signature_id for every alert. Group by SID and count. Pull the rule definition from the ruleset (/etc/suricata/rules/*.rules or wherever your ruleset is deployed). Read the rule's match conditions.
Then ask: is the rule actually matching malicious behavior but your environment produces legitimate traffic that looks the same (the classic "legitimate use case that the rule can't distinguish")? Or is the rule poorly written — matching on a common string that has no actual forensic value? Or is the rule firing on traffic that was never meant to be in scope (e.g., a rule for Microsoft SQL Server matching on a test-lab host that you forgot to exclude)?
The fix for each: for legitimate-use-case matches, add threshold or suppress logic (suricata.yaml → threshold-file references a threshold.config where you can suppress specific source/dest combinations, or threshold to rate-limit alerts per SID). For badly-written rules, disable the rule (suricata-update disable-source ) and either replace with a better one or accept the detection gap. For out-of-scope traffic, add a BPF filter at the sensor level (capture.interface.bpf-filter) or suppress at the rule level.
Diagnostic path for false negatives. Reproduce against a known-bad PCAP. If you don't have one, generate one — for most common techniques (C2 beacons, DNS tunnelling, common web exploits), public PCAP corpora like malware-traffic-analysis.net or Kaggle have reference captures. Replay through the sensor with suricata -r and check whether the expected alert fires.
If the alert doesn't fire against the known-bad PCAP, the rule isn't loaded or isn't matching. Check suricata -T for rule-loading errors. Check the rule's matching conditions against the actual PCAP content — the HTTP header format, the TLS fingerprint, the payload bytes — with Wireshark or tshark. Common causes: rule requires a protocol-specific signature that Suricata's parser didn't identify for the session (the alert http rule doesn't fire because Suricata classified the session as generic TCP rather than HTTP), or the rule's content: matches are offset-dependent and your traffic has a different framing.
Category 4 — Slow decay
The fourth category is the hardest to catch because it's invisible on any single day. The sensor is running, logs are flowing, alerts are firing at a normal rate. And yet six months later you realize the logs you have from month four aren't what you think they are. Sensor integrity erodes slowly, and the discipline that catches it is trend-based rather than snapshot-based.
Signatures you catch with monthly review. Log volume shifted 30% in one direction without a corresponding traffic change. The sensor's NTP status drifted (verify with ntpq -p or chronyc sources). A TLS certificate used by the sensor's management interface or its export to SIEM expired without alerts. The ruleset hasn't been updated in 6 weeks because an suricata-update cron silently started failing. Storage is 85% full and the retention policy hasn't been adjusted since deployment.
Signatures you catch with trend-based alerting. Set up alerts that compare this week's metrics to last month's baseline. Weekly log volume deviating by more than 20%. Percentile values (p50, p99) of connection duration shifting significantly. Unique destinations per day trending up or down outside normal business-cycle variation. The alerts don't name specific failures — they name drift, and drift is almost always worth investigating even when the cause turns out to be benign.
Remediation — monthly sensor review cadence. Put it on the calendar. Thirty minutes per sensor per month. Check: NTP status and clock drift. Certificate dates for anything the sensor presents to SIEM, dashboards, or export targets. Storage trend (is the fill rate matching the retention policy or are you heading for a surprise?). Ruleset update status — when did suricata-update last succeed, when did zkg last pull updates for the Zeek scripts? Log-volume trend against last month's baseline. Alert-volume distribution — have any SIDs started dominating that weren't before?
"Monitor the monitor" principle. Your SIEM or monitoring platform should treat the sensor itself as a monitored asset. Heartbeat metrics (Zeek is up, Suricata is up, NTP is synchronised, disk is below 85%). Log-volume freshness (a 15-minute window with zero new log lines is an alert). Ruleset freshness (hash of the loaded ruleset, alerting if the hash hasn't changed in 14 days despite upstream updates being available). If the monitoring of the sensor breaks and no-one notices for a month, the sensor's silence doesn't mean nothing happened — it means you can't distinguish nothing from something.
tail -f /opt/zeek/logs/current/capture_loss.log in one terminal. In another, check Suricata stats with jq 'select(.event_type=="stats") | .stats.capture' /var/log/suricata/eve.json | tail -5. Compare against kernel drops with ethtool -S <iface> | grep -i drop.top -H -p $(pgrep -d, zeek) — if one thread is at 100% and others idle, clustering isn't working. Check ring buffer with ethtool -g <iface> — if current is much smaller than max, increase it. Check NIC offloads with ethtool -k <iface> | grep on — offloads should be off on the capture interface.Your sensor is dropping 0.5% of packets during morning peak. You've tuned the ring buffer and disabled offloads. Clustering is configured with four AF_PACKET workers. Drops persist. The CISO asks whether you should add more hardware to the sensor or accept the 0.5% loss as within operational tolerance.
The knee-jerk response is "add hardware" — more CPU cores, faster NIC, faster disk. It's always technically possible to throw more hardware at packet loss. But the right answer depends on what the 0.5% represents.
If 0.5% is uniform across all traffic types — that is, it's random drops that affect every protocol and every destination roughly equally — then the operational impact on investigations is small. You might miss a handful of packets from a connection, but conn.log will still record the flow, dns.log will still record the queries, ssl.log will still capture the TLS fingerprint. Investigations work on metadata; random 0.5% drops rarely corrupt investigative conclusions.
If 0.5% is concentrated in specific flows — that is, one particular protocol or one particular host's traffic is disproportionately affected — then the operational impact is larger and depends on which protocol. Dropping 0.5% of DNS queries randomly is probably fine. Dropping 0.5% of HTTP/HTTPS connection establishments means you miss 0.5% of new sessions; still usually fine. Dropping 0.5% of the TLS ClientHello is the problematic case — you lose JA3 fingerprinting for those sessions. If the drops cluster on sessions you care about for investigation, the hardware upgrade is justified.
The operational lesson: packet-loss tolerance is a function of which packets are being lost and what questions the investigation needs to answer. "Is 0.5% loss acceptable" isn't a numerical question — it's a question about which evidence goes missing. Quantify the loss distribution before recommending the hardware spend.
The dashboard shows you the sensor's self-reported health — the process is up, the CPU is under 80%, alerts are flowing. None of those metrics tell you whether the sensor is actually capturing the traffic you think it is.
A sensor can be fully green on its self-reported metrics while: the SPAN port upstream is misconfigured and sending only one direction of the flows, the VLAN tags are being stripped at the mirror point so the sensor sees packets that don't match its expected topology, the MTU mismatch between the mirror port and the sensor causes fragmentation that Zeek's reassembly silently drops, or a switch upgrade six months ago quietly removed the sensor's SPAN configuration and no-one noticed.
The discipline that catches this: periodic end-to-end validation. Pick a known endpoint, generate traffic from it (ping, a curl to a specific destination, a DNS lookup for a specific hostname), and verify the sensor actually logged it. NF1.7 taught this as a deploy-time validation; the hygiene is to repeat it quarterly. If the sensor's dashboard says it's capturing but your end-to-end test shows it isn't, the dashboard is lying. Trust end-to-end verification over self-reported health.
The myth has a second form: "if Zeek's capture_loss.log is empty, we're not losing packets." capture_loss.log only measures what Zeek knows to measure — specifically, the gap between expected TCP sequence numbers and received ones for connections Zeek is already tracking. It doesn't measure connections that never reached the sensor because the SPAN port didn't mirror them. A silent mirror failure leaves capture_loss.log empty and shows zero drops on ethtool, while 100% of the affected traffic is invisible. End-to-end validation is the only reliable check.
Try it: identify the failure category from a symptom description
Setup. Five scenarios below. For each, write one sentence naming the failure category (1, 2, 3, or 4) and one sentence on what you'd check first.
Task. (1) "Zeek's conn.log shows half as many connections per day this week compared to two weeks ago; traffic volume hasn't changed." (2) "Suricata alert count jumped from 200 per day to 8,000 per day last Tuesday; single rule SID dominates the new alerts." (3) "Sensor deployed six months ago; last week found that two days of Zeek logs from month three are missing entirely." (4) "During the 09:00-09:30 window every weekday, Zeek's capture_loss.log shows 1-2% loss; rest of the day is clean." (5) "Sensor reports healthy, but the analyst testing it noticed that HTTPS traffic from a specific internal subnet shows up in conn.log but never in ssl.log."
Expected result. (1) Category 2 (Zeek log anomaly) — check zeekctl status, log rotation, recent script changes, disk space. (2) Category 3 (Suricata firing wrong, false positive) — identify the SID, read the rule, check whether legitimate environmental traffic triggers it. (3) Category 4 (slow decay) — check storage, NTP, ruleset update status, and add monthly review if not already in place. (4) Category 1 (packet loss under load) — concentrated in a time window; likely morning backup job or meeting-start traffic spike; check ring buffer, clustering, NIC offloads. (5) Category 2 — traffic is being captured (it's in conn.log) but Zeek's SSL analyzer isn't parsing it; likely an asymmetric routing issue or a custom script interfering with protocol identification.
Debugging branch. If you categorized (5) as Category 1, you missed the distinction: packets reached Zeek (they're in conn.log) so the capture layer is fine; the analysis layer isn't producing ssl.log. If you categorized (2) as Category 4, you're reading single-event spikes as decay patterns — decay is slow drift, not step changes. Recognizing the signature-to-category match is the whole point of the diagnostic workflow.
You should be able to do the following without referring back to this sub. If you can't, the sections to re-read are noted.
You've built the sensor and mapped the evidence landscape.
NF0 established why network evidence matters when every other source is compromised. NF1 built your Zeek + Suricata sensor with the 10 investigation query patterns. From here, every module teaches protocol-specific investigation against real attack scenarios.
- DNS deep dive (NF3) — tunnelling detection, DGA analysis, passive DNS infrastructure mapping, and the INC-NE-2026-0227 AiTM phishing DNS trail
- Protocol analysis (NF4–NF7) — HTTP/HTTPS, SMB lateral movement, SSH tunnelling, and email protocol investigation with Zeek metadata and PCAP
- Detection and hunting (NF8–NF11) — Suricata rule writing, C2 beacon detection with JA3, NetFlow analytics, and proactive network threat hunting
- NSM architecture (NF13) — production sensor deployment at 1–10 Gbps with Arkime, Security Onion, and enterprise storage planning
- INC-NE-2026-0830 capstone (NF14) — multi-stage investigation using only network evidence: phishing → domain-fronted C2 → lateral movement → DNS tunnel exfiltration
Cancel anytime