8.7 Data Collection Rules: Filter, Transform, Route

14-18 hours · Module 8

Data Collection Rules: Filter, Transform, Route

Introduction

Data Collection Rules (DCRs) are the control plane for all AMA-based data ingestion. They define three things: what data to collect (the source configuration), how to transform it (KQL-based ingestion-time transformation), and where to send it (the destination workspace or custom table). DCRs are the single most powerful cost optimisation tool in Sentinel — they reduce ingestion volume before data is stored and billed.


DCR architecture

A DCR has three components:

Data sources — what to collect. For Windows Security Events: which Event IDs or collection level. For Syslog/CEF: which facilities and severity levels. For custom logs: the API endpoint configuration.

Transformations — how to modify data during ingestion. KQL queries that filter rows, remove columns, parse fields, and enrich data before it reaches the workspace. Transformations execute at ingestion time — filtered rows are never stored and never billed.

Destinations — where to send the data. Typically your Sentinel Log Analytics workspace. A single DCR can send data to multiple destinations (multi-homing), and can route different data types to different tables.

1
2
3
4
5
6
7
// Conceptual DCR transformation pipeline:
// SOURCE  [Filter rows]  [Remove columns]  [Parse fields]  DESTINATION
//
// Example: Filter Windows Security Events to only logon failures
// source
// | where EventID == 4625
// | project-away Channel, Task, Opcode  // Remove unnecessary columns

Ingestion-time transformations

Transformations are KQL queries that execute during ingestion. They are the most effective cost reduction mechanism because filtered data is never ingested — you do not pay for it.

Transformation 1: Row filtering. Remove events that have no investigation value. Example: filter out routine Windows logon events from service accounts that generate thousands of events per day.

1
2
3
// DCR transformation: exclude service account logons
source
| where not (Account has_any ("SYSTEM", "LOCAL SERVICE", "NETWORK SERVICE"))

Transformation 2: Column removal. Drop columns that your analytics rules and investigation queries never reference. Each removed column reduces per-event size.

1
2
3
4
// DCR transformation: remove verbose columns from SecurityEvent
source
| project-away SourceComputerId, MG, ManagementGroupName,
    AuthenticationPackageName, LmPackageName, TransmittedServices

Transformation 3: Field parsing. Extract structured fields from unstructured data during ingestion, so KQL queries at query time do not need to re-parse.

1
2
3
4
5
6
7
// DCR transformation: parse custom application log into structured fields
source
| extend ParsedMessage = parse_json(RawData)
| extend UserAction = tostring(ParsedMessage.action)
| extend UserId = tostring(ParsedMessage.user_id)
| extend ResourceAccessed = tostring(ParsedMessage.resource)
| project-away RawData  // Remove the raw message after parsing

Transformation 4: Aggregation. For very high-frequency data (performance counters, heartbeats), aggregate to reduce row count.

1
2
3
// DCR transformation: aggregate performance data to 1-minute intervals
source
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 1m), Computer, ObjectName, CounterName

Cost impact of DCR transformations

DCR transformations are the most granular cost control available. Unlike log tier changes (which affect entire tables) or collection level changes (which use predefined sets), DCR transformations let you filter individual events and columns.

Example cost impact:

A workspace ingests 20 GB/day of SecurityEvent data at the “Common” collection level. Analysis reveals that 40% of events are Event ID 4624 (successful logon) from service accounts — routine, high-volume, and rarely investigated. A DCR transformation that filters these events reduces SecurityEvent ingestion from 20 GB/day to 12 GB/day — an 8 GB/day reduction.

At pay-as-you-go rates, 8 GB/day savings = approximately $1,200/month. The DCR transformation cost: zero (no additional charge for transformations). The implementation time: 30 minutes to write and test the KQL, 5 minutes to update the DCR.


DCR creation walkthrough: step by step

Creating a DCR in the Azure portal:

  1. Navigate to Azure portal → Monitor → Data Collection Rules → Create.
  2. Basics: name (e.g., dcr-securityevents-common-uksouth), resource group, region (must match the workspace region), platform type (Windows or Linux).
  3. Resources: select the VMs/servers this DCR applies to. You can add individual VMs or use tags to select groups.
  4. Collect: add a data source. For Windows Security Events: select “Windows Event Logs” → “Security” → choose the collection level or enter custom XPath queries.
  5. Destination: select your Sentinel Log Analytics workspace.
  6. Review + create.

Adding a transformation to an existing DCR:

Transformations are added via the Azure portal (DCR → edit → add transformation) or via the REST API / ARM template. The transformation is a KQL query that begins with source (representing the incoming data stream).

1
2
3
4
5
6
7
// Example: transformation that filters service account logons
// and removes unnecessary columns
source
| where not(Account has_any ("SYSTEM", "LOCAL SERVICE", "NETWORK SERVICE"))
| where not(Account startswith "svc-")
| project-away SourceComputerId, MG, ManagementGroupName,
    AuthenticationPackageName, LmPackageName

Testing a transformation before deployment: You cannot directly test a DCR transformation against live data before deploying it. Instead: write the equivalent KQL query against the target table in the workspace (replacing source with the table name), run it against historical data, and verify the results match your expectations.

1
2
3
4
5
6
7
// Test: what would the transformation filter out?
SecurityEvent
| where TimeGenerated > ago(7d)
| where Account has_any ("SYSTEM", "LOCAL SERVICE", "NETWORK SERVICE")
    or Account startswith "svc-"
| summarize FilteredEvents = count(), FilteredGB = count() * 0.001 / 1024
// This tells you how much data the transformation would remove

Workspace transformation vs DCR transformation

There are two places where KQL transformations can run: in the DCR (at ingestion) and in the workspace (as a workspace transformation on the table).

DCR transformation — runs on the AMA before data reaches the workspace. Used for agent-based data sources (Windows Security Events, Syslog, CEF). Requires the data to flow through AMA.

Workspace transformation — runs in the workspace when data arrives from any source, including service-to-service connectors. Used for Entra ID, Azure Activity, and other connectors that do not use AMA.

When to use workspace transformations: For filtering data from service-to-service connectors. Example: filter AADNonInteractiveUserSignInLogs to exclude routine token refreshes from a specific application:

Navigate to Log Analytics workspace → Tables → AADNonInteractiveUserSignInLogs → Create a transformation.

1
2
3
4
// Workspace transformation: filter routine non-interactive sign-ins
source
| where not(AppDisplayName == "Azure AD Token Broker" and ResultType == "0")
| where not(AppDisplayName == "Microsoft Authentication Broker" and ResultType == "0")

This filters at the workspace level — equivalent to a DCR transformation but applicable to connectors that bypass AMA.


Multi-destination routing

A single DCR can route data to multiple destinations. Use cases:

Primary + secondary workspace: Send security events to the Sentinel workspace (for detection and investigation) and a copy to a long-term storage workspace (for compliance retention at lower cost).

Table routing: Send different event types to different tables within the same workspace. High-priority security events go to an Analytics-tier table. Low-priority operational events go to a Basic-tier table. This achieves the tier optimisation from Module 7.4 at the ingestion level.

Cross-workspace for MSSPs: An MSSP’s log forwarder can send the same CEF data to both the customer’s workspace and the MSSP’s aggregation workspace — enabling both customer-specific and cross-customer detection.


Real-world DCR transformation recipes

These are production-tested transformations that address the most common cost and quality challenges.

Recipe 1: Filter Windows service account noise. Service accounts (SYSTEM, LOCAL SERVICE, NETWORK SERVICE, and custom svc-* accounts) generate the majority of logon events on busy servers. These events are routine and rarely investigated.

1
2
3
4
5
6
7
8
// DCR transformation for SecurityEvent
source
| where not(
    EventID == 4624
    and (Account has "SYSTEM" or Account has "LOCAL SERVICE"
         or Account has "NETWORK SERVICE" or Account startswith "svc-")
    and LogonType in (3, 5)
)

Typical impact: 30-50% reduction in SecurityEvent volume on servers with many service accounts. Security impact: none — service account network and service logons are routine. Interactive (type 2) and remote (type 10) logons from service accounts are retained because those are anomalous.

Recipe 2: Reduce Entra ID non-interactive sign-in volume. Token broker and authentication broker events constitute 60-80% of non-interactive sign-ins in most tenants.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Workspace transformation for AADNonInteractiveUserSignInLogs
source
| where not(
    ResultType == "0"
    and AppDisplayName in (
        "Azure AD Token Broker",
        "Microsoft Authentication Broker",
        "Windows Sign In",
        "Microsoft Account Controls V2")
)

Typical impact: 60-80% reduction. Retains: all failures (potential token issues), all sign-ins from non-standard apps, and all sign-ins from apps not in the exclusion list.

Recipe 3: Drop verbose columns from Syslog. Some Syslog messages include columns that are never queried.

1
2
3
4
// DCR transformation for Syslog
source
| project-away _ResourceId, _SubscriptionId, MG, ManagementGroupName,
    SourceSystem, TenantId, Type

Typical impact: 10-15% reduction in per-record size. Multiplied across millions of records, this saves measurable storage.

Recipe 4: Enrich and filter CEF in one transformation. Parse the Message field to extract fields not automatically mapped by CEF, and simultaneously filter low-value events.

1
2
3
4
5
// DCR transformation for CommonSecurityLog
source
| where DeviceAction !in ("accept", "allow", "permit", "pass")
| extend FirewallRule = extract(@"rule=(\S+)", 1, AdditionalExtensions)
| extend SessionDuration = extract(@"duration=(\d+)", 1, AdditionalExtensions)

This transformation filters accept events (keeping only deny/drop/block) AND enriches remaining events with parsed custom fields — combining cost reduction with data quality improvement in a single DCR.


DCR as ARM template: infrastructure as code

For repeatable, version-controlled DCR management, define DCRs as ARM or Bicep templates. This enables: Git-based version control, pull request review for changes, CI/CD deployment across multiple workspaces, and rollback via Git revert.

Simplified Bicep structure for a SecurityEvent DCR:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Bicep DCR  conceptual structure (not KQL)
// resource dcr 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
//   name: 'dcr-securityevents-common'
//   location: 'uksouth'
//   properties: {
//     dataSources: {
//       windowsEventLogs: [{
//         name: 'SecurityEvents'
//         streams: ['Microsoft-SecurityEvent']
//         xPathQueries: [
//           'Security!*[System[(EventID=4624 or EventID=4625 ...)]]'
//         ]
//       }]
//     }
//     dataFlows: [{
//       streams: ['Microsoft-SecurityEvent']
//       destinations: ['sentinel-workspace']
//       transformKql: 'source | where not(Account has "SYSTEM" ...)'
//     }]
//     destinations: {
//       logAnalytics: [{
//         name: 'sentinel-workspace'
//         workspaceResourceId: '/subscriptions/.../workspaces/...'
//       }]
//     }
//   }
// }

Store this template in the same Git repository as your analytics rules (Module 7.12 governance framework). Deploy through CI/CD pipeline. Every DCR change is tracked, reviewed, and reversible.


DCR lifecycle management

Creation. Design the DCR based on the data source requirements and cost constraints. Start with a conservative configuration (collect more, filter less). Document the rationale.

Tuning. After 2-4 weeks of data collection, analyse the ingested data to identify high-volume, low-value patterns. Add targeted transformations. Monitor the impact.

Maintenance. When analytics rules change (new rules added, old rules deprecated), review whether the DCR still collects the events the rules need. A DCR that filters Event ID 4648 is fine until someone creates a rule that detects explicit credential logons — then the filter must be removed.

Retirement. When a data source is decommissioned, disable the DCR (do not delete it immediately). After the retention period for the last ingested data expires, delete the DCR. Keep the ARM template in Git for reference.


DCR management best practices

Version control. Store DCR definitions as ARM/Bicep templates in Git alongside your analytics rules (Module 7.12 governance framework). Changes to DCR transformations should go through the same review process as analytics rule changes — a transformation that filters the wrong events creates a detection blind spot.

Testing before deployment. Before applying a DCR transformation, run the transformation query against historical data to verify it produces the expected results. “If I had applied this filter last month, would any investigation have been affected?” Only deploy after confirming no investigation-relevant data is lost.

Monitoring transformation impact. After deploying a DCR transformation, monitor the Usage table for the expected volume reduction. If the reduction is larger or smaller than expected, the transformation may be filtering more or fewer events than intended.

1
2
3
4
5
6
7
// Monitor daily ingestion volume before and after DCR change
Usage
| where TimeGenerated > ago(14d)
| where DataType == "SecurityEvent"
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| order by TimeGenerated asc

A sudden drop on the day the DCR was deployed confirms the transformation is working. A larger-than-expected drop warrants investigation — you may be filtering events you intended to keep.

Every filtered event is permanently lost

DCR transformations filter data before ingestion. Filtered events are never stored in the workspace and cannot be recovered. If an investigation requires an event that was filtered by a DCR, the event is gone. Test transformations against historical data before deployment. Start conservative (filter obviously low-value events) and expand the filter over time as you confirm which events are never needed.

Try it yourself

If you have Windows Security Events flowing into your workspace, run the Usage query above to establish your current SecurityEvent daily volume. Then examine the events to identify high-volume, low-value patterns: SecurityEvent | where TimeGenerated > ago(1d) | summarize Count = count() by EventID, Account | order by Count desc | take 20. Identify events that could be filtered without affecting investigation capability. Note: do not modify the DCR in production without testing — this exercise is analysis only.

What you should observe

The top events by volume are typically: 4624 (logon) from service accounts, 4672 (special privileges assigned) for routine administrative operations, and 8002 (NTLM authentication). These high-volume events are candidates for DCR filtering — but verify that no analytics rules query them first.


Knowledge check

Check your understanding

1. You deploy a DCR transformation that filters 40% of SecurityEvent data. Two weeks later, an investigation requires an event that was filtered. Can you recover it?

No. DCR transformations filter data before ingestion. Filtered events are never stored in the workspace and cannot be recovered from Sentinel. If the event was also captured by another system (the source device's local event log, a backup, or a secondary SIEM), it may be recoverable from that source. This is why DCR transformations must be tested against historical data before deployment — every filter decision is irreversible.
Yes — restore from the Archive tier
Yes — run a search job against filtered data
Yes — Microsoft retains filtered data for 30 days