In this module

NF1.9 Sensor Maintenance and Monitoring

10 hours · Module 1 · Free

What you already know

You maintain endpoint tools — update antivirus signatures, rotate logs, monitor agent health. Network sensors need the same attention. This sub covers the ongoing maintenance tasks that keep your sensor producing reliable evidence: log rotation, rule updates, disk management, NTP verification, and health monitoring.

Operational Objective

A sensor that works today can fail silently tomorrow. Disk fills, NTP drifts, rules go stale, processes crash at 03:00. The gap between "sensor is running" and "sensor is producing reliable evidence" is maintenance. NF0.3 covered the failure modes — deploy-and-forget, alert-only, no baseline, retention mismatch. This sub gives you the operational procedures to prevent each one.

Deliverable: A maintenance checklist covering daily, weekly, and monthly tasks, plus a monitoring script that detects sensor health issues before they become evidence gaps.

Estimated completion: 20 minutes

Figure NF1.9 — The three maintenance cadences. Daily tasks are automated via cron. Weekly checks are a 5-minute manual review. Monthly validation runs the full script from NF1.7.

Daily Tasks (Automated)

These run via cron without analyst intervention:

Suricata rule update. The cron job from NF1.5 runs suricata-update daily at 04:00 and reloads Suricata. Verify the cron job is present:

cat /etc/cron.d/suricata-update

Log rotation. Zeek rotates logs hourly by default (when running via zeekctl for live capture). For PCAP replay in the lab, rotation isn't relevant — each analysis produces a new set of logs. For production, verify logrotate is configured to compress and age-out old Zeek logs based on your retention target.

Disk usage monitoring. Add a simple disk check to cron that alerts when disk usage exceeds 80%:

echo '0 */6 * * * root df -h /opt/sensor/ | awk "NR>1 && int(\$5)>80 {print \"SENSOR DISK WARNING: \" \$5}" | logger' | sudo tee /etc/cron.d/sensor-disk-check

Weekly Tasks (5-Minute Manual Check)

Run these weekly — they take 5 minutes and catch drift before it becomes a gap:

# NTP synchronisation
chronyc sources | grep '^\*'

# Zeek process (if running live)
pgrep -a zeek

# Suricata process (if running live)
pgrep -a suricata

# Disk usage
df -h /opt/sensor/

# Oldest Zeek log (retention check)
ls -lt /opt/sensor/zeek-logs/ | tail -3

# Packet loss (live capture only)
tail -5 /opt/sensor/zeek-logs/capture_loss.log 2>/dev/null

If NTP shows no synchronised source (*), investigate immediately — every timestamp is unreliable until NTP is fixed. If disk usage exceeds 70%, review retention settings or add storage.

Monthly Tasks (Full Validation)

Run the validation script from NF1.7 monthly. Additionally:

OS updates. Apply security updates to the sensor VM. Schedule a maintenance window — updates may require a reboot, which interrupts live capture.

sudo apt update && sudo apt upgrade -y

Tool version check. Verify Zeek and Suricata are on supported versions. Check for security advisories:

zeek --version
suricata -V

Re-validate after updates. After any OS update, Zeek upgrade, or Suricata upgrade, run the full validation script to confirm the sensor still produces correct output.

Troubleshooting Common Issues

Zeek not producing logs. First check the process: pgrep zeek. If Zeek isn't running, look at why it exited. sudo zeekctl status gives a service-level view; tail -50 /opt/sensor/zeek-logs/current/reporter.log shows the last errors Zeek emitted. Common root causes. Interface name changed after an OS update — new kernel enumerated interfaces differently and the node.cfg still refers to the old name. Fix by updating node.cfg with the current interface (check with ip link) and running sudo zeekctl deploy. Disk full — Zeek halts writes when the filesystem is full and the process eventually exits. Fix disk first, then restart. Permissions issue after a package update — the zeek user lost access to its log directory. Verify with ls -la /opt/sensor/zeek-logs/ and restore ownership with sudo chown -R zeek:zeek /opt/sensor/zeek-logs/.

Suricata showing zero alerts for weeks. Rules may be stale even though suricata-update appears to run. Three specific checks. First, verify the cron job actually executed: grep suricata-update /var/log/syslog should show recent entries. Second, confirm the downloaded rules are current: ls -la /var/lib/suricata/rules/suricata.rules — the timestamp should be within the last 24 hours. Third, verify Suricata reloaded after the rules updated: the cron job uses suricatasc -c reload-rules or equivalent; if the reload failed, Suricata is still running old rules. Common failure: the Emerging Threats PPA GPG key expired and suricata-update now silently skips the ruleset. Fix with suricata-update update-sources && suricata-update.

Community ID mismatch after upgrade. Zeek and Suricata both generate Community ID values for connections; the point is that both tools produce identical IDs for the same flow, enabling correlation between Zeek logs and Suricata alerts. A version upgrade in either tool can change the Community ID algorithm version or seed. Re-validate with the NF1.7 script immediately after any upgrade. If IDs differ, check both tools' Community ID configuration (Communityid::seed in Zeek's local.zeek, community-id-seed in suricata.yaml) — they must match exactly.

Disk full. Zeek logs are the primary consumer of disk on most sensors. Three response options ordered from least to most operationally disruptive. One — reduce retention. Adjust the log-retention policy to delete older logs; a 90-day retention is longer than most investigations need if your MTTD is under 30 days. Two — compress older logs. find /opt/sensor/zeek-logs/ -name '*.log' -mtime +7 -exec gzip {} \; compresses logs older than 7 days; saves 70-80% space at the cost of slower queries. Three — add storage. Expand the VM disk (needs maintenance window), add a separate evidence-storage volume and symlink Zeek's log directory onto it.

The Sensor Health Monitoring Script

The weekly 5-minute manual check is fine for starting out. For ongoing operation, automate the same checks and alert when anything fails.

Save the following as /opt/sensor/bin/sensor-health-check.sh and make it executable. Run from cron hourly — it's cheap and catches drift fast.

#!/usr/bin/env bash
# Sensor health check — run hourly via cron, alert on any failure

set -u
HEALTH_LOG="/opt/sensor/health-log.txt"
ALERT_LOG="/var/log/sensor-alerts.log"
SENSOR_ROOT="/opt/sensor"
MAX_DISK_PCT=80
MIN_ZEEK_AGE_MIN=10  # newest Zeek log should be younger than this

timestamp() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
alert() { echo "$(timestamp) SENSOR_ALERT: $*" | tee -a "$ALERT_LOG" | logger -t sensor-health; }
log() { echo "$(timestamp) $*" >> "$HEALTH_LOG"; }

# NTP synchronisation
if ! chronyc sources 2>/dev/null | grep -q '^\*'; then
  alert "NTP not synchronised — all timestamps unreliable"
fi

# Disk usage
DISK_PCT=$(df -P "$SENSOR_ROOT" | awk 'NR==2 {gsub("%","",$5); print $5}')
if [ "$DISK_PCT" -gt "$MAX_DISK_PCT" ]; then
  alert "Disk at ${DISK_PCT}% (threshold ${MAX_DISK_PCT}%)"
fi

# Zeek log freshness (live capture only — skip if lab PCAP mode)
if pgrep -x zeek > /dev/null; then
  NEWEST_LOG_AGE=$(find "$SENSOR_ROOT/zeek-logs/current/" -name "*.log" -printf '%T@\n' 2>/dev/null | sort -rn | head -1)
  if [ -n "$NEWEST_LOG_AGE" ]; then
    AGE_SEC=$(($(date +%s) - ${NEWEST_LOG_AGE%.*}))
    if [ "$AGE_SEC" -gt $((MIN_ZEEK_AGE_MIN * 60)) ]; then
      alert "Zeek not writing logs — newest log is ${AGE_SEC}s old"
    fi
  fi
else
  alert "Zeek process not running"
fi

# Suricata process
if ! pgrep -x suricata > /dev/null; then
  alert "Suricata process not running"
fi

# Packet loss
if [ -f "$SENSOR_ROOT/zeek-logs/current/capture_loss.log" ]; then
  LOSS_PCT=$(tail -1 "$SENSOR_ROOT/zeek-logs/current/capture_loss.log" | awk -F'\t' '{print $7}')
  if [ -n "$LOSS_PCT" ] && awk "BEGIN{exit !($LOSS_PCT > 1.0)}"; then
    alert "Packet loss above 1% — ${LOSS_PCT}%"
  fi
fi

log "Health check complete. Disk=${DISK_PCT}%"

Install as cron:

sudo install -m 0755 sensor-health-check.sh /opt/sensor/bin/
echo '0 * * * * root /opt/sensor/bin/sensor-health-check.sh' | sudo tee /etc/cron.d/sensor-health

The logger call writes alerts to syslog, which your existing log-forwarding (if any) will pick up. For immediate notification, pipe the tee -a "$ALERT_LOG" output into whatever alerting channel you use — email via mail, Slack via a webhook curl, PagerDuty via their API.

Evidence Integrity During Sensor Outages

When the sensor is down during an incident — hardware failure, disk full, update gone wrong — the evidence for that window is gone. The investigation doesn't stop; it proceeds on other evidence sources plus an explicit documentation of the network-evidence gap.

Document the outage in the investigation. Every IR report section covered by network evidence must state whether the sensor was operational during the incident window. If there was an outage, document: the exact start and end time of the outage, what caused it, whether any evidence was captured during a degraded-but-running state (packet loss, reduced interfaces). Without this documentation, a defense attorney or opposing expert can argue the evidence was cherry-picked — "what else happened that you didn't capture?"

Proof-of-operation artefacts. For high-stakes cases, maintain a daily archive of sensor health — the /opt/sensor/health-log.txt from the script above plus capture_loss.log excerpts. This file, timestamped and preserved, is evidence that the sensor was operational. Format as a brief appendix to the IR report: "Network sensor operational during incident window 06:08-06:15 UTC. Packet loss 0.02% (capture_loss.log). Health check passes at 06:00 and 07:00 UTC (health-log.txt)."

Recovery from an outage during an incident. If the sensor comes back up mid-incident, capture immediately resumes but the gap persists. Pull what you have — even partial evidence is usually better than nothing — and explicitly annotate the timeline with the outage boundaries. The attacker's activity during the gap is outside your network visibility; rely on endpoint telemetry for that window and be explicit about where the evidence sources split.

Guided Procedure — Set Up the Maintenance Automation

Step 1. Verify the Suricata update cron job exists and runs successfully.

Expected output: `cat /etc/cron.d/suricata-update` shows the cron entry. `sudo suricata-update --no-reload` runs without errors.

If the cron file doesn't exist: Re-create it from NF1.5 Step 5.

Step 2. Create the disk monitoring cron job.

Expected output: `cat /etc/cron.d/sensor-disk-check` shows the monitoring entry. Verify it works: `df -h /opt/sensor/`.

If the awk expression fails: Ensure the quoting is correct — the cron entry uses escaped quotes inside the awk command.

Step 3. Run the weekly manual check and document the results.

Expected output: NTP synchronised, processes running (or not, if lab-only), disk under 70%, retention meeting target.

If NTP is not synchronised: `sudo systemctl restart chrony && sleep 30 && chronyc sources`. If still unsynchronised, check DNS resolution and firewall rules for NTP (UDP 123).

Decision point

Your sensor's disk is at 85% capacity. You need to free space. Options: delete the oldest 30 days of Zeek logs (reducing retention from 90 to 60 days), compress existing logs with gzip (reducing size by ~70% but making queries slower), or add a 20 GB virtual disk to the VM.

The right answer depends on your MTTD. If your organization detects incidents within 30 days, reducing to 60 days of retention still covers the investigation window. If MTTD is longer, you need the 90 days.

Compression is a middle ground — keep 90 days but accept slower queries on compressed files. For the lab, adding storage is easiest (expand the VM disk in the hypervisor).

The operational lesson: plan storage capacity before it becomes a problem. The validation script's disk usage check catches this at 80%, giving you time to act before 100%.

Compliance Myth: "Sensor maintenance is IT operations work, not security's responsibility"

The sensor produces your investigation evidence. If the sensor is down, your evidence has gaps. If the rules are stale, your detections are blind. If NTP is drifted, your timestamps are wrong. These aren't IT inconveniences — they're evidence integrity failures that directly affect your investigation capability.

In many organizations, the sensor infrastructure is managed by IT operations, but the health verification is security's responsibility. The weekly 5-minute check and monthly validation are security tasks. You're not managing the hardware — you're verifying that the evidence source is reliable.

NF1.10 — Interactive Lab. The sensor is built, validated, and maintained. The interactive lab puts everything together with a guided investigation using your sensor against an NE scenario PCAP.

Try it: Run the complete weekly health check

Setup. Your validated sensor VM.

Task. Run all six weekly check commands from the "Weekly Tasks" section. Save the output to a file: date >> /opt/sensor/health-log.txt && (commands) >> /opt/sensor/health-log.txt.

Expected result. All checks pass: NTP synchronised, disk under 70%, no packet loss. The health log file serves as an audit trail for sensor availability.

Debugging branch. If any check fails, refer to the Troubleshooting section and fix the issue before proceeding to NF2.

Checkpoint — before moving on

1. Verify that the Suricata rule update cron job is installed and runs successfully. (§ Daily Tasks)

2. Run the weekly health check and confirm NTP synchronisation, disk usage, and log retention are within targets. (§ Weekly Tasks)

3. Explain what to do when disk usage reaches 85% — the three options and how MTTD determines the choice. (§ Decision Point)

Operational Artifact — Sensor Maintenance Checklist

Daily (automated): suricata-update cron at 04:00. Disk usage alert at 80%. Log rotation (zeekctl for live capture).

Weekly (5 minutes): NTP sync (chronyc sources). Process health (pgrep zeek; pgrep suricata). Disk usage (df -h). Retention verification (oldest log date). Packet loss check (capture_loss.log).

Monthly (30 minutes): Full validation script (NF1.7). OS security updates. Tool version check. Re-validate after any update.

After any incident: Verify sensor was operational during the incident window. Check capture_loss.log for gaps. Document sensor status in the investigation report.

Extended reference — escalation procedures, automation integrations, and handoff to IT operations

Alert routing — where sensor-health alerts should go. Not every alert belongs in the same channel. Split by severity and time-sensitivity.

- Disk full, process down, NTP drift. Wake someone up. Send to the paging channel (PagerDuty, OpsGenie). These outages compound into evidence gaps within hours if unaddressed. - Packet loss above threshold, rule update failed. Next-business-day handling. Send to the SOC's standard queue. Fix during normal hours. - Disk at 70-80%, tool version behind. Weekly summary. No alert; roll into the weekly health report.

Integration with existing monitoring. If you already run a monitoring platform (Nagios, Zabbix, Datadog, Grafana), register the sensor's health-check script as a standard check. The script exits non-zero on any alert, which most monitoring platforms handle natively. Centralising sensor monitoring with general infrastructure monitoring means sensor health appears on the same dashboards IT ops already watches.

Handoff to IT operations. Most organizations split ownership: IT ops runs the sensor OS and VM; security owns the sensor content and evidence integrity. The handoff interface should be documented. IT ops is responsible for: OS patching during scheduled maintenance windows, VM hardware (CPU, memory, disk allocation), network connectivity to the span/TAP port, chrony/NTP time synchronization, storage expansion when disk fills. Security is responsible for: Zeek and Suricata software, rule content, detection queries, investigation workflow, evidence retention policy, validation after any ops-performed change. The weekly health check is a joint responsibility — both teams should see the output, with security driving the follow-up on anything that affects evidence integrity.

Backup and disaster recovery. The sensor's configuration files (node.cfg, local.zeek, suricata.yaml, rule files, validation scripts) belong in version control. The logs themselves don't — they're append-only and their volume makes versioning impractical. What you back up: the whole /opt/sensor/etc/ tree, the cron jobs in /etc/cron.d/, the maintenance scripts, the validation script output. What you don't back up: /opt/sensor/zeek-logs/ (use your retention policy instead). Recovery plan: if the sensor VM is lost, a fresh Ubuntu install plus git clone of the config repo plus 30 minutes of zeekctl deploy restores the sensor. Logs for the outage window are gone; document the gap in any affected investigation.

Live vs. replay modes. The NF1 lab runs Zeek and Suricata against PCAP files (replay mode). Maintenance for replay-mode is minimal: you don't need NTP (timestamps come from the PCAP), capture_loss.log doesn't exist (no live capture), log rotation doesn't apply (each replay produces a fresh log set). For live capture, all of these become relevant. The maintenance procedures in this sub cover both modes — replay-only operators can skip the NTP and packet-loss items.

Upgrade procedure — safely moving Zeek or Suricata to a new version. Upgrades are risky: new versions sometimes change log schemas, Community ID algorithms, or default behaviour. Safe procedure: one, take a backup of the config directory and preserve the current rule file. Two, upgrade in a test environment first (a second sensor or a local VM). Three, run the NF1.7 validation script against the upgraded environment. Four, diff the log schemas against the pre-upgrade version — check for removed fields, new field orderings, changed enum values. Five, if anything changed, update downstream consumers (SIEM field mappings, custom scripts) before cutting over. Six, perform the production upgrade during a scheduled maintenance window with the old version ready to restore if issues emerge within 24 hours.

When to rebuild vs. patch. After 18-24 months, the sensor VM accumulates configuration drift — manual tweaks, abandoned experiments, packages that are no longer needed. Rebuilding from scratch (fresh VM, config from git, validated from the NF1.7 script) is faster than untangling the drift and produces a cleaner baseline. Make rebuilding part of the annual lifecycle rather than a crisis response.

Sensor capacity planning. Watch three metrics over time: packet loss (any non-zero loss under normal traffic is a warning), CPU utilization during peak hours, disk growth rate (how many GB per day does Zeek write). All three should stay within safe margins. When any approaches its limit, start capacity planning — either resize the VM (CPU/memory) or add storage, or refine the BPF filter to reduce the captured traffic. Capacity failures during an active incident are the worst possible time for this work.

You've built the sensor and mapped the evidence landscape.

NF0 established why network evidence matters when every other source is compromised. NF1 built your Zeek + Suricata sensor with the 10 investigation query patterns. From here, every module teaches protocol-specific investigation against real attack scenarios.

DNS deep dive (NF3) — tunnelling detection, DGA analysis, passive DNS infrastructure mapping, and the INC-NE-2026-0227 AiTM phishing DNS trail
Protocol analysis (NF4–NF7) — HTTP/HTTPS, SMB lateral movement, SSH tunnelling, and email protocol investigation with Zeek metadata and PCAP
Detection and hunting (NF8–NF11) — Suricata rule writing, C2 beacon detection with JA3, NetFlow analytics, and proactive network threat hunting
NSM architecture (NF13) — production sensor deployment at 1–10 Gbps with Arkime, Security Onion, and enterprise storage planning
INC-NE-2026-0830 capstone (NF14) — multi-stage investigation using only network evidence: phishing → domain-fronted C2 → lateral movement → DNS tunnel exfiltration

Unlock the full course with Premium See Full Syllabus

Cancel anytime

← Previous Next →