2.4 Parsing Semi-Structured Data

90 minutes · Module 2 · Free

Parsing Semi-Structured Data

Security logs are messy. Many fields contain JSON objects, pipe-delimited strings, or nested arrays stuffed into a single column. Conditional access results are a JSON array inside a sign-in log field. Device details are a JSON object. URL parameters are query strings. To investigate effectively, you need to pull structured data out of these unstructured fields.

This is where most analysts get stuck

Writing where and summarize queries is straightforward. Parsing a JSON array nested inside a dynamic column is where queries start failing and analysts start guessing. Master this subsection and you will be able to investigate fields that most of your colleagues skip entirely.

parse_json() — Working with JSON fields

Many M365 security tables store complex data as JSON strings in single columns. The most common ones you will encounter:

1
2
3
4
5
6
7
8
9
SigninLogs
| where TimeGenerated > ago(1d)
| where ResultType == 0
| extend Country = tostring(LocationDetails.countryOrRegion)
| extend City = tostring(LocationDetails.city)
| extend Browser = tostring(DeviceDetail.browser)
| extend OS = tostring(DeviceDetail.operatingSystem)
| extend IsCompliant = tostring(DeviceDetail.isCompliant)
| project TimeGenerated, UserPrincipalName, Country, City, Browser, OS, IsCompliant

When a column is already typed as dynamic (like LocationDetails in SigninLogs), you can access properties directly with dot notation. When a column is stored as a string, wrap it in parse_json() first:

1
2
| extend ParsedDetails = parse_json(SomeStringColumn)
| extend Value = tostring(ParsedDetails.propertyName)
Always wrap with tostring()

Dynamic field access returns a dynamic type. Most operators expect strings. Wrapping with tostring() ensures your comparisons and filters work correctly. Without it, where Country == "GB" may silently fail because you are comparing a dynamic value to a string.

Conditional access policy details — a real parsing challenge

The ConditionalAccessPolicies field in SigninLogs is a JSON array containing every conditional access policy that evaluated during the sign-in. Each element has the policy name, result (success, failure, notApplied), and enforcement (enforced, reportOnly). This is critical data — but it is buried in a nested array.

1
2
3
4
5
6
7
8
SigninLogs
| where TimeGenerated > ago(1d)
| where ConditionalAccessStatus == "failure"
| mv-expand CAPolicy = parse_json(ConditionalAccessPolicies)
| extend PolicyName = tostring(CAPolicy.displayName)
| extend PolicyResult = tostring(CAPolicy.result)
| where PolicyResult == "failure"
| project TimeGenerated, UserPrincipalName, PolicyName, PolicyResult, IPAddress

This query: finds sign-ins blocked by conditional access, expands the policy array so each policy gets its own row, extracts the policy name and result, and filters to only the policies that actually blocked the sign-in. This tells you exactly which policy blocked which user and from where.

mv-expand — Expanding arrays into rows

mv-expand takes a column containing an array and creates one row per element. Without it, you cannot filter or aggregate on individual array values.

1
2
3
4
5
6
7
8
SecurityAlert
| where TimeGenerated > ago(7d)
| mv-expand Entity = parse_json(Entities)
| extend EntityType = tostring(Entity.Type)
| extend EntityValue = tostring(Entity.Name)
| where EntityType == "account"
| summarize AlertCount = dcount(AlertName) by EntityValue
| sort by AlertCount desc

This expands the Entities array in security alerts, filters to account entities, and counts how many different alert types each account appears in. A user appearing in 5 different alert types is a stronger compromise indicator than a user in just 1.

parse_url() — URL decomposition

URLs contain structured information that regex makes painful to extract. parse_url() handles it cleanly:

1
2
3
4
5
6
7
8
9
DeviceNetworkEvents
| where TimeGenerated > ago(1d)
| extend ParsedUrl = parse_url(RemoteUrl)
| extend Domain = tostring(ParsedUrl.Host)
| extend Path = tostring(ParsedUrl.Path)
| extend QueryParams = tostring(ParsedUrl.Query)
| summarize ConnectionCount = count() by Domain
| sort by ConnectionCount desc
| take 20

extract() — Regex extraction (use sparingly)

When parse_json() and parse_url() do not fit, extract() uses regex to pull data from strings:

1
| extend Domain = extract(@"@(.+)$", 1, UserPrincipalName)

This extracts the domain from an email address. The @(.+)$ regex captures everything after the @ symbol.

Regex is the last resort, not the first tool

Regex is powerful but slow, error-prone, and hard to maintain. Use parse_json() for JSON, parse_url() for URLs, split() for delimited strings, and extract() only when nothing else works. Your future self (and your teammates) will thank you for readable queries.

Try it yourself

Write a query that extracts the browser and operating system from every successful sign-in in the last 7 days, then counts sign-ins per browser/OS combination. Which combination is most common in your environment?
1
2
3
4
5
6
7
SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == 0
| extend Browser = tostring(DeviceDetail.browser)
| extend OS = tostring(DeviceDetail.operatingSystem)
| summarize SigninCount = count() by Browser, OS
| sort by SigninCount desc

In most environments, Chrome on Windows dominates. Unusual combinations (like curl on Linux, or Python-urllib on Windows) can indicate automated tooling or attacker activity — especially if they appear with failed sign-ins from unfamiliar IPs.

Check your understanding

1. A sign-in log field contains a JSON array of conditional access policy results. Which operator do you use to create one row per policy?

parse_json()
extend
mv-expand — it expands each array element into its own row

2. Why should you always wrap dynamic field access with tostring()?

Dynamic values cannot be compared to strings without explicit type conversion — string comparisons may silently fail
It makes the query run faster
It is required by KQL syntax

3. When should you use extract() with regex instead of parse_json() or parse_url()?

Always — regex is more powerful
For any string manipulation
Only when dedicated parsing functions do not cover your case — regex is slower, harder to read, and more error-prone