8.7 Data Collection Rules: Filter, Transform, Route
Data Collection Rules: Filter, Transform, Route
Introduction
Data Collection Rules (DCRs) are the control plane for all AMA-based data ingestion. They define three things: what data to collect (the source configuration), how to transform it (KQL-based ingestion-time transformation), and where to send it (the destination workspace or custom table). DCRs are the single most powerful cost optimisation tool in Sentinel — they reduce ingestion volume before data is stored and billed.
DCR architecture
A DCR has three components:
Data sources — what to collect. For Windows Security Events: which Event IDs or collection level. For Syslog/CEF: which facilities and severity levels. For custom logs: the API endpoint configuration.
Transformations — how to modify data during ingestion. KQL queries that filter rows, remove columns, parse fields, and enrich data before it reaches the workspace. Transformations execute at ingestion time — filtered rows are never stored and never billed.
Destinations — where to send the data. Typically your Sentinel Log Analytics workspace. A single DCR can send data to multiple destinations (multi-homing), and can route different data types to different tables.
| |
Ingestion-time transformations
Transformations are KQL queries that execute during ingestion. They are the most effective cost reduction mechanism because filtered data is never ingested — you do not pay for it.
Transformation 1: Row filtering. Remove events that have no investigation value. Example: filter out routine Windows logon events from service accounts that generate thousands of events per day.
| |
Transformation 2: Column removal. Drop columns that your analytics rules and investigation queries never reference. Each removed column reduces per-event size.
| |
Transformation 3: Field parsing. Extract structured fields from unstructured data during ingestion, so KQL queries at query time do not need to re-parse.
| |
Transformation 4: Aggregation. For very high-frequency data (performance counters, heartbeats), aggregate to reduce row count.
| |
Cost impact of DCR transformations
DCR transformations are the most granular cost control available. Unlike log tier changes (which affect entire tables) or collection level changes (which use predefined sets), DCR transformations let you filter individual events and columns.
Example cost impact:
A workspace ingests 20 GB/day of SecurityEvent data at the “Common” collection level. Analysis reveals that 40% of events are Event ID 4624 (successful logon) from service accounts — routine, high-volume, and rarely investigated. A DCR transformation that filters these events reduces SecurityEvent ingestion from 20 GB/day to 12 GB/day — an 8 GB/day reduction.
At pay-as-you-go rates, 8 GB/day savings = approximately $1,200/month. The DCR transformation cost: zero (no additional charge for transformations). The implementation time: 30 minutes to write and test the KQL, 5 minutes to update the DCR.
DCR creation walkthrough: step by step
Creating a DCR in the Azure portal:
- Navigate to Azure portal → Monitor → Data Collection Rules → Create.
- Basics: name (e.g.,
dcr-securityevents-common-uksouth), resource group, region (must match the workspace region), platform type (Windows or Linux). - Resources: select the VMs/servers this DCR applies to. You can add individual VMs or use tags to select groups.
- Collect: add a data source. For Windows Security Events: select “Windows Event Logs” → “Security” → choose the collection level or enter custom XPath queries.
- Destination: select your Sentinel Log Analytics workspace.
- Review + create.
Adding a transformation to an existing DCR:
Transformations are added via the Azure portal (DCR → edit → add transformation) or via the REST API / ARM template. The transformation is a KQL query that begins with source (representing the incoming data stream).
| |
Testing a transformation before deployment: You cannot directly test a DCR transformation against live data before deploying it. Instead: write the equivalent KQL query against the target table in the workspace (replacing source with the table name), run it against historical data, and verify the results match your expectations.
| |
Workspace transformation vs DCR transformation
There are two places where KQL transformations can run: in the DCR (at ingestion) and in the workspace (as a workspace transformation on the table).
DCR transformation — runs on the AMA before data reaches the workspace. Used for agent-based data sources (Windows Security Events, Syslog, CEF). Requires the data to flow through AMA.
Workspace transformation — runs in the workspace when data arrives from any source, including service-to-service connectors. Used for Entra ID, Azure Activity, and other connectors that do not use AMA.
When to use workspace transformations: For filtering data from service-to-service connectors. Example: filter AADNonInteractiveUserSignInLogs to exclude routine token refreshes from a specific application:
Navigate to Log Analytics workspace → Tables → AADNonInteractiveUserSignInLogs → Create a transformation.
| |
This filters at the workspace level — equivalent to a DCR transformation but applicable to connectors that bypass AMA.
Multi-destination routing
A single DCR can route data to multiple destinations. Use cases:
Primary + secondary workspace: Send security events to the Sentinel workspace (for detection and investigation) and a copy to a long-term storage workspace (for compliance retention at lower cost).
Table routing: Send different event types to different tables within the same workspace. High-priority security events go to an Analytics-tier table. Low-priority operational events go to a Basic-tier table. This achieves the tier optimisation from Module 7.4 at the ingestion level.
Cross-workspace for MSSPs: An MSSP’s log forwarder can send the same CEF data to both the customer’s workspace and the MSSP’s aggregation workspace — enabling both customer-specific and cross-customer detection.
Real-world DCR transformation recipes
These are production-tested transformations that address the most common cost and quality challenges.
Recipe 1: Filter Windows service account noise. Service accounts (SYSTEM, LOCAL SERVICE, NETWORK SERVICE, and custom svc-* accounts) generate the majority of logon events on busy servers. These events are routine and rarely investigated.
| |
Typical impact: 30-50% reduction in SecurityEvent volume on servers with many service accounts. Security impact: none — service account network and service logons are routine. Interactive (type 2) and remote (type 10) logons from service accounts are retained because those are anomalous.
Recipe 2: Reduce Entra ID non-interactive sign-in volume. Token broker and authentication broker events constitute 60-80% of non-interactive sign-ins in most tenants.
| |
Typical impact: 60-80% reduction. Retains: all failures (potential token issues), all sign-ins from non-standard apps, and all sign-ins from apps not in the exclusion list.
Recipe 3: Drop verbose columns from Syslog. Some Syslog messages include columns that are never queried.
| |
Typical impact: 10-15% reduction in per-record size. Multiplied across millions of records, this saves measurable storage.
Recipe 4: Enrich and filter CEF in one transformation. Parse the Message field to extract fields not automatically mapped by CEF, and simultaneously filter low-value events.
| |
This transformation filters accept events (keeping only deny/drop/block) AND enriches remaining events with parsed custom fields — combining cost reduction with data quality improvement in a single DCR.
DCR as ARM template: infrastructure as code
For repeatable, version-controlled DCR management, define DCRs as ARM or Bicep templates. This enables: Git-based version control, pull request review for changes, CI/CD deployment across multiple workspaces, and rollback via Git revert.
Simplified Bicep structure for a SecurityEvent DCR:
| |
Store this template in the same Git repository as your analytics rules (Module 7.12 governance framework). Deploy through CI/CD pipeline. Every DCR change is tracked, reviewed, and reversible.
DCR lifecycle management
Creation. Design the DCR based on the data source requirements and cost constraints. Start with a conservative configuration (collect more, filter less). Document the rationale.
Tuning. After 2-4 weeks of data collection, analyse the ingested data to identify high-volume, low-value patterns. Add targeted transformations. Monitor the impact.
Maintenance. When analytics rules change (new rules added, old rules deprecated), review whether the DCR still collects the events the rules need. A DCR that filters Event ID 4648 is fine until someone creates a rule that detects explicit credential logons — then the filter must be removed.
Retirement. When a data source is decommissioned, disable the DCR (do not delete it immediately). After the retention period for the last ingested data expires, delete the DCR. Keep the ARM template in Git for reference.
DCR management best practices
Version control. Store DCR definitions as ARM/Bicep templates in Git alongside your analytics rules (Module 7.12 governance framework). Changes to DCR transformations should go through the same review process as analytics rule changes — a transformation that filters the wrong events creates a detection blind spot.
Testing before deployment. Before applying a DCR transformation, run the transformation query against historical data to verify it produces the expected results. “If I had applied this filter last month, would any investigation have been affected?” Only deploy after confirming no investigation-relevant data is lost.
Monitoring transformation impact. After deploying a DCR transformation, monitor the Usage table for the expected volume reduction. If the reduction is larger or smaller than expected, the transformation may be filtering more or fewer events than intended.
| |
A sudden drop on the day the DCR was deployed confirms the transformation is working. A larger-than-expected drop warrants investigation — you may be filtering events you intended to keep.
DCR transformations filter data before ingestion. Filtered events are never stored in the workspace and cannot be recovered. If an investigation requires an event that was filtered by a DCR, the event is gone. Test transformations against historical data before deployment. Start conservative (filter obviously low-value events) and expand the filter over time as you confirm which events are never needed.
Try it yourself
If you have Windows Security Events flowing into your workspace, run the Usage query above to establish your current SecurityEvent daily volume. Then examine the events to identify high-volume, low-value patterns: SecurityEvent | where TimeGenerated > ago(1d) | summarize Count = count() by EventID, Account | order by Count desc | take 20. Identify events that could be filtered without affecting investigation capability. Note: do not modify the DCR in production without testing — this exercise is analysis only.
What you should observe
The top events by volume are typically: 4624 (logon) from service accounts, 4672 (special privileges assigned) for routine administrative operations, and 8002 (NTLM authentication). These high-volume events are candidates for DCR filtering — but verify that no analytics rules query them first.
Knowledge check
Check your understanding
1. You deploy a DCR transformation that filters 40% of SecurityEvent data. Two weeks later, an investigation requires an event that was filtered. Can you recover it?