2.4 Parsing Semi-Structured Data
Parsing Semi-Structured Data
Security logs are messy. Many fields contain JSON objects, pipe-delimited strings, or nested arrays stuffed into a single column. Conditional access results are a JSON array inside a sign-in log field. Device details are a JSON object. URL parameters are query strings. To investigate effectively, you need to pull structured data out of these unstructured fields.
Writing where and summarize queries is straightforward. Parsing a JSON array nested inside a dynamic column is where queries start failing and analysts start guessing. Master this subsection and you will be able to investigate fields that most of your colleagues skip entirely.
parse_json() — Working with JSON fields
Many M365 security tables store complex data as JSON strings in single columns. The most common ones you will encounter:
| |
When a column is already typed as dynamic (like LocationDetails in SigninLogs), you can access properties directly with dot notation. When a column is stored as a string, wrap it in parse_json() first:
| |
Dynamic field access returns a dynamic type. Most operators expect strings. Wrapping with tostring() ensures your comparisons and filters work correctly. Without it, where Country == "GB" may silently fail because you are comparing a dynamic value to a string.
Conditional access policy details — a real parsing challenge
The ConditionalAccessPolicies field in SigninLogs is a JSON array containing every conditional access policy that evaluated during the sign-in. Each element has the policy name, result (success, failure, notApplied), and enforcement (enforced, reportOnly). This is critical data — but it is buried in a nested array.
| |
This query: finds sign-ins blocked by conditional access, expands the policy array so each policy gets its own row, extracts the policy name and result, and filters to only the policies that actually blocked the sign-in. This tells you exactly which policy blocked which user and from where.
mv-expand — Expanding arrays into rows
mv-expand takes a column containing an array and creates one row per element. Without it, you cannot filter or aggregate on individual array values.
| |
This expands the Entities array in security alerts, filters to account entities, and counts how many different alert types each account appears in. A user appearing in 5 different alert types is a stronger compromise indicator than a user in just 1.
parse_url() — URL decomposition
URLs contain structured information that regex makes painful to extract. parse_url() handles it cleanly:
| |
extract() — Regex extraction (use sparingly)
When parse_json() and parse_url() do not fit, extract() uses regex to pull data from strings:
| |
This extracts the domain from an email address. The @(.+)$ regex captures everything after the @ symbol.
Regex is powerful but slow, error-prone, and hard to maintain. Use parse_json() for JSON, parse_url() for URLs, split() for delimited strings, and extract() only when nothing else works. Your future self (and your teammates) will thank you for readable queries.
Try it yourself
| |
In most environments, Chrome on Windows dominates. Unusual combinations (like curl on Linux, or Python-urllib on Windows) can indicate automated tooling or attacker activity — especially if they appear with failed sign-ins from unfamiliar IPs.
Check your understanding
1. A sign-in log field contains a JSON array of conditional access policy results. Which operator do you use to create one row per policy?
2. Why should you always wrap dynamic field access with tostring()?
3. When should you use extract() with regex instead of parse_json() or parse_url()?