In this module
NF1.9 Sensor Maintenance and Monitoring
You maintain endpoint tools — update antivirus signatures, rotate logs, monitor agent health. Network sensors need the same attention. This sub covers the ongoing maintenance tasks that keep your sensor producing reliable evidence: log rotation, rule updates, disk management, NTP verification, and health monitoring.
A sensor that works today can fail silently tomorrow. Disk fills, NTP drifts, rules go stale, processes crash at 03:00. The gap between "sensor is running" and "sensor is producing reliable evidence" is maintenance. NF0.3 covered the failure modes — deploy-and-forget, alert-only, no baseline, retention mismatch. This sub gives you the operational procedures to prevent each one.
Deliverable: A maintenance checklist covering daily, weekly, and monthly tasks, plus a monitoring script that detects sensor health issues before they become evidence gaps.
Figure NF1.9 — The three maintenance cadences. Daily tasks are automated via cron. Weekly checks are a 5-minute manual review. Monthly validation runs the full script from NF1.7.
Daily Tasks (Automated)
These run via cron without analyst intervention:
Suricata rule update. The cron job from NF1.5 runs suricata-update daily at 04:00 and reloads Suricata. Verify the cron job is present:
cat /etc/cron.d/suricata-updateLog rotation. Zeek rotates logs hourly by default (when running via zeekctl for live capture). For PCAP replay in the lab, rotation isn't relevant — each analysis produces a new set of logs. For production, verify logrotate is configured to compress and age-out old Zeek logs based on your retention target.
Disk usage monitoring. Add a simple disk check to cron that alerts when disk usage exceeds 80%:
echo '0 */6 * * * root df -h /opt/sensor/ | awk "NR>1 && int(\$5)>80 {print \"SENSOR DISK WARNING: \" \$5}" | logger' | sudo tee /etc/cron.d/sensor-disk-checkWeekly Tasks (5-Minute Manual Check)
Run these weekly — they take 5 minutes and catch drift before it becomes a gap:
# NTP synchronisation
chronyc sources | grep '^\*'
# Zeek process (if running live)
pgrep -a zeek
# Suricata process (if running live)
pgrep -a suricata
# Disk usage
df -h /opt/sensor/
# Oldest Zeek log (retention check)
ls -lt /opt/sensor/zeek-logs/ | tail -3
# Packet loss (live capture only)
tail -5 /opt/sensor/zeek-logs/capture_loss.log 2>/dev/nullIf NTP shows no synchronised source (*), investigate immediately — every timestamp is unreliable until NTP is fixed. If disk usage exceeds 70%, review retention settings or add storage.
Monthly Tasks (Full Validation)
Run the validation script from NF1.7 monthly. Additionally:
OS updates. Apply security updates to the sensor VM. Schedule a maintenance window — updates may require a reboot, which interrupts live capture.
sudo apt update && sudo apt upgrade -yTool version check. Verify Zeek and Suricata are on supported versions. Check for security advisories:
zeek --version
suricata -VRe-validate after updates. After any OS update, Zeek upgrade, or Suricata upgrade, run the full validation script to confirm the sensor still produces correct output.
Troubleshooting Common Issues
Zeek not producing logs. First check the process: pgrep zeek. If Zeek isn't running, look at why it exited. sudo zeekctl status gives a service-level view; tail -50 /opt/sensor/zeek-logs/current/reporter.log shows the last errors Zeek emitted. Common root causes. Interface name changed after an OS update — new kernel enumerated interfaces differently and the node.cfg still refers to the old name. Fix by updating node.cfg with the current interface (check with ip link) and running sudo zeekctl deploy. Disk full — Zeek halts writes when the filesystem is full and the process eventually exits. Fix disk first, then restart. Permissions issue after a package update — the zeek user lost access to its log directory. Verify with ls -la /opt/sensor/zeek-logs/ and restore ownership with sudo chown -R zeek:zeek /opt/sensor/zeek-logs/.
Suricata showing zero alerts for weeks. Rules may be stale even though suricata-update appears to run. Three specific checks. First, verify the cron job actually executed: grep suricata-update /var/log/syslog should show recent entries. Second, confirm the downloaded rules are current: ls -la /var/lib/suricata/rules/suricata.rules — the timestamp should be within the last 24 hours. Third, verify Suricata reloaded after the rules updated: the cron job uses suricatasc -c reload-rules or equivalent; if the reload failed, Suricata is still running old rules. Common failure: the Emerging Threats PPA GPG key expired and suricata-update now silently skips the ruleset. Fix with suricata-update update-sources && suricata-update.
Community ID mismatch after upgrade. Zeek and Suricata both generate Community ID values for connections; the point is that both tools produce identical IDs for the same flow, enabling correlation between Zeek logs and Suricata alerts. A version upgrade in either tool can change the Community ID algorithm version or seed. Re-validate with the NF1.7 script immediately after any upgrade. If IDs differ, check both tools' Community ID configuration (Communityid::seed in Zeek's local.zeek, community-id-seed in suricata.yaml) — they must match exactly.
Disk full. Zeek logs are the primary consumer of disk on most sensors. Three response options ordered from least to most operationally disruptive. One — reduce retention. Adjust the log-retention policy to delete older logs; a 90-day retention is longer than most investigations need if your MTTD is under 30 days. Two — compress older logs. find /opt/sensor/zeek-logs/ -name '*.log' -mtime +7 -exec gzip {} \; compresses logs older than 7 days; saves 70-80% space at the cost of slower queries. Three — add storage. Expand the VM disk (needs maintenance window), add a separate evidence-storage volume and symlink Zeek's log directory onto it.
The Sensor Health Monitoring Script
The weekly 5-minute manual check is fine for starting out. For ongoing operation, automate the same checks and alert when anything fails.
Save the following as /opt/sensor/bin/sensor-health-check.sh and make it executable. Run from cron hourly — it's cheap and catches drift fast.
#!/usr/bin/env bash
# Sensor health check — run hourly via cron, alert on any failure
set -u
HEALTH_LOG="/opt/sensor/health-log.txt"
ALERT_LOG="/var/log/sensor-alerts.log"
SENSOR_ROOT="/opt/sensor"
MAX_DISK_PCT=80
MIN_ZEEK_AGE_MIN=10 # newest Zeek log should be younger than this
timestamp() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
alert() { echo "$(timestamp) SENSOR_ALERT: $*" | tee -a "$ALERT_LOG" | logger -t sensor-health; }
log() { echo "$(timestamp) $*" >> "$HEALTH_LOG"; }
# NTP synchronisation
if ! chronyc sources 2>/dev/null | grep -q '^\*'; then
alert "NTP not synchronised — all timestamps unreliable"
fi
# Disk usage
DISK_PCT=$(df -P "$SENSOR_ROOT" | awk 'NR==2 {gsub("%","",$5); print $5}')
if [ "$DISK_PCT" -gt "$MAX_DISK_PCT" ]; then
alert "Disk at ${DISK_PCT}% (threshold ${MAX_DISK_PCT}%)"
fi
# Zeek log freshness (live capture only — skip if lab PCAP mode)
if pgrep -x zeek > /dev/null; then
NEWEST_LOG_AGE=$(find "$SENSOR_ROOT/zeek-logs/current/" -name "*.log" -printf '%T@\n' 2>/dev/null | sort -rn | head -1)
if [ -n "$NEWEST_LOG_AGE" ]; then
AGE_SEC=$(($(date +%s) - ${NEWEST_LOG_AGE%.*}))
if [ "$AGE_SEC" -gt $((MIN_ZEEK_AGE_MIN * 60)) ]; then
alert "Zeek not writing logs — newest log is ${AGE_SEC}s old"
fi
fi
else
alert "Zeek process not running"
fi
# Suricata process
if ! pgrep -x suricata > /dev/null; then
alert "Suricata process not running"
fi
# Packet loss
if [ -f "$SENSOR_ROOT/zeek-logs/current/capture_loss.log" ]; then
LOSS_PCT=$(tail -1 "$SENSOR_ROOT/zeek-logs/current/capture_loss.log" | awk -F'\t' '{print $7}')
if [ -n "$LOSS_PCT" ] && awk "BEGIN{exit !($LOSS_PCT > 1.0)}"; then
alert "Packet loss above 1% — ${LOSS_PCT}%"
fi
fi
log "Health check complete. Disk=${DISK_PCT}%"Install as cron:
sudo install -m 0755 sensor-health-check.sh /opt/sensor/bin/
echo '0 * * * * root /opt/sensor/bin/sensor-health-check.sh' | sudo tee /etc/cron.d/sensor-healthThe logger call writes alerts to syslog, which your existing log-forwarding (if any) will pick up. For immediate notification, pipe the tee -a "$ALERT_LOG" output into whatever alerting channel you use — email via mail, Slack via a webhook curl, PagerDuty via their API.
Evidence Integrity During Sensor Outages
When the sensor is down during an incident — hardware failure, disk full, update gone wrong — the evidence for that window is gone. The investigation doesn't stop; it proceeds on other evidence sources plus an explicit documentation of the network-evidence gap.
Document the outage in the investigation. Every IR report section covered by network evidence must state whether the sensor was operational during the incident window. If there was an outage, document: the exact start and end time of the outage, what caused it, whether any evidence was captured during a degraded-but-running state (packet loss, reduced interfaces). Without this documentation, a defense attorney or opposing expert can argue the evidence was cherry-picked — "what else happened that you didn't capture?"
Proof-of-operation artefacts. For high-stakes cases, maintain a daily archive of sensor health — the /opt/sensor/health-log.txt from the script above plus capture_loss.log excerpts. This file, timestamped and preserved, is evidence that the sensor was operational. Format as a brief appendix to the IR report: "Network sensor operational during incident window 06:08-06:15 UTC. Packet loss 0.02% (capture_loss.log). Health check passes at 06:00 and 07:00 UTC (health-log.txt)."
Recovery from an outage during an incident. If the sensor comes back up mid-incident, capture immediately resumes but the gap persists. Pull what you have — even partial evidence is usually better than nothing — and explicitly annotate the timeline with the outage boundaries. The attacker's activity during the gap is outside your network visibility; rely on endpoint telemetry for that window and be explicit about where the evidence sources split.
Your sensor's disk is at 85% capacity. You need to free space. Options: delete the oldest 30 days of Zeek logs (reducing retention from 90 to 60 days), compress existing logs with gzip (reducing size by ~70% but making queries slower), or add a 20 GB virtual disk to the VM.
The right answer depends on your MTTD. If your organization detects incidents within 30 days, reducing to 60 days of retention still covers the investigation window. If MTTD is longer, you need the 90 days.
Compression is a middle ground — keep 90 days but accept slower queries on compressed files. For the lab, adding storage is easiest (expand the VM disk in the hypervisor).
The operational lesson: plan storage capacity before it becomes a problem. The validation script's disk usage check catches this at 80%, giving you time to act before 100%.
The sensor produces your investigation evidence. If the sensor is down, your evidence has gaps. If the rules are stale, your detections are blind. If NTP is drifted, your timestamps are wrong. These aren't IT inconveniences — they're evidence integrity failures that directly affect your investigation capability.
In many organizations, the sensor infrastructure is managed by IT operations, but the health verification is security's responsibility. The weekly 5-minute check and monthly validation are security tasks. You're not managing the hardware — you're verifying that the evidence source is reliable.
Try it: Run the complete weekly health check
Setup. Your validated sensor VM.
Task. Run all six weekly check commands from the "Weekly Tasks" section. Save the output to a file: date >> /opt/sensor/health-log.txt && (commands) >> /opt/sensor/health-log.txt.
Expected result. All checks pass: NTP synchronised, disk under 70%, no packet loss. The health log file serves as an audit trail for sensor availability.
Debugging branch. If any check fails, refer to the Troubleshooting section and fix the issue before proceeding to NF2.
You've built the sensor and mapped the evidence landscape.
NF0 established why network evidence matters when every other source is compromised. NF1 built your Zeek + Suricata sensor with the 10 investigation query patterns. From here, every module teaches protocol-specific investigation against real attack scenarios.
- DNS deep dive (NF3) — tunnelling detection, DGA analysis, passive DNS infrastructure mapping, and the INC-NE-2026-0227 AiTM phishing DNS trail
- Protocol analysis (NF4–NF7) — HTTP/HTTPS, SMB lateral movement, SSH tunnelling, and email protocol investigation with Zeek metadata and PCAP
- Detection and hunting (NF8–NF11) — Suricata rule writing, C2 beacon detection with JA3, NetFlow analytics, and proactive network threat hunting
- NSM architecture (NF13) — production sensor deployment at 1–10 Gbps with Arkime, Security Onion, and enterprise storage planning
- INC-NE-2026-0830 capstone (NF14) — multi-stage investigation using only network evidence: phishing → domain-fronted C2 → lateral movement → DNS tunnel exfiltration
Cancel anytime