6.4 Work with String Data in KQL
Work with String Data in KQL
Domain 4 — Manage Security Threats: "Create custom hunting queries by using KQL." String manipulation questions appear in scenarios where you must extract data from unstructured log fields or parse nested JSON.
Introduction
Security log data is not always neatly structured. Email headers arrive as raw text strings. Syslog messages contain IP addresses buried in free-text sentences. JSON fields nest useful data three levels deep. User principal names contain both the username and the domain in a single field. To investigate effectively, you need to extract specific values from these messy formats and turn them into queryable columns.
This subsection teaches you the string manipulation tools in KQL — the operators that split, extract, parse, and transform text data into structured, filterable columns. These skills are used constantly in Modules 8 (connecting log sources that produce unstructured data), 9 (building analytics rules that extract IOCs from alerts), and 11-13 (investigation queries that parse evidence from complex log entries).
Here is what you will learn:
split()— breaking a string into parts at a delimiter (extracting domains from emails, paths from URLs)extract()— pulling values using regular expressions (extracting IPs from free text, patterns from URLs)parse— pattern-based extraction that creates named columns from text (parsing structured messages without regex)substring()— extracting a portion of a string by position- String functions —
tolower(),toupper(),strlen(),strcat(),replace_string(),trim() - Working with JSON/dynamic columns — navigating nested structures with dot notation and
tostring()
split() — breaking strings at a delimiter
What it is
split() takes a string and a delimiter character, and returns an array of parts. You access individual parts using array indexing ([0] for the first part, [1] for the second, etc.).
Why it matters
User principal names are formatted as username@domain. URLs are formatted as protocol://domain/path. Email headers contain key-value pairs separated by specific characters. split() extracts the piece you need.
| |
| UserPrincipalName | Username | Domain |
|---|---|---|
| j.morrison@northgateeng.com | j.morrison | northgateeng.com |
| s.patel@northgateeng.com | s.patel | northgateeng.com |
| where Domain != "northgateeng.com" finds external or federated accounts. | summarize count() by Domain reveals which domains are signing in. During an investigation, finding a sign-in from an unexpected domain is an immediate red flag.split() output in tostring()split() returns a dynamic array. Indexing into it ([0]) returns a dynamic value, not a string. Many subsequent operators (like has, ==) expect strings. Wrapping in tostring() converts the dynamic value to a string, preventing type mismatch errors.
extract() — regex-based extraction
What it is
extract() uses a regular expression to find and return a matching pattern from a string. It is more powerful than split() because it can match complex patterns, not just delimiter positions.
Why it matters
Syslog messages, alert descriptions, and free-text log fields contain embedded data that you cannot extract with split(). An IP address buried in a sentence like “Connection from 198.51.100.44 denied by firewall rule FW-001” requires pattern matching to pull out.
| |
The regex pattern \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} matches any IPv4 address. The 1 in extract() means “return the first capture group” (the content inside parentheses).
You need exactly three regex patterns for 90% of security string extraction: the IPv4 pattern above, an email pattern ([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}), and a URL pattern (https?://[^\s"']+). Keep these as templates. For everything else, use parse (below) which does not require regex.
parse — pattern-based extraction without regex
What it is
parse extracts values from a string using a text pattern with named placeholders. It is simpler than extract() because you define the pattern using the actual text structure, not regular expressions.
Why it matters
Many log messages follow a consistent structure: “User john.smith from IP 198.51.100.44 failed authentication at 14:32.” If the structure is consistent, parse extracts the values without regex.
| |
The parse pattern uses * for “skip anything” and quoted strings for literal text. The unquoted names (ParsedUser, ParsedIP) become new columns containing the extracted values.
parse first, fall back to extract()parse is easier to read, easier to write, and easier to debug than regex. Use it whenever the text has a consistent structure. Fall back to extract() only when the pattern is too variable for parse (e.g., an IP address that could appear anywhere in an arbitrary message).
Working with JSON and dynamic columns
What it is
Many Sentinel tables contain columns of type dynamic — JSON objects or arrays. LocationDetails in SigninLogs is a common example: it contains nested data like {"city": "London", "state": "England", "countryOrRegion": "GB"}. To use these values in filters or projections, you must navigate the JSON structure.
How it works
Use dot notation to navigate nested JSON, and tostring() to convert the result to a usable string:
| |
For deeper nesting, chain dot notation: tostring(AuthenticationDetails[0].authenticationMethod) accesses the first element of an array and extracts the authenticationMethod property.
tostring(), dynamic fields cause type errorsDynamic columns look like strings in the output, but KQL treats them as a different type. where LocationDetails.city == "London" may fail or produce unexpected results. Always wrap in tostring(): where tostring(LocationDetails.city) == "London". This is the most common source of "it looks right but doesn't work" bugs in KQL.
Try it yourself
| |
A user who normally signs in from “GB” showing a sign-in from “RU” is an immediate investigation trigger. This baseline query is the foundation for the impossible travel detection pattern covered in subsection 6.5.
Check your understanding
1. You need to extract the domain from an email address field. Which approach is most appropriate?
split(). Pattern in variable text = extract(). Structured message format = parse. Choose the simplest tool that works.2. Your query filtering on LocationDetails.city == "London" returns 0 results, but you can see "London" in the raw data. Why?
tostring() when comparing or filtering dynamic column values.