6.4 Work with String Data in KQL

8-12 hours · Module 6 · Free

Work with String Data in KQL

SC-200 Exam Objective

Domain 4 — Manage Security Threats: "Create custom hunting queries by using KQL." String manipulation questions appear in scenarios where you must extract data from unstructured log fields or parse nested JSON.

Introduction

Security log data is not always neatly structured. Email headers arrive as raw text strings. Syslog messages contain IP addresses buried in free-text sentences. JSON fields nest useful data three levels deep. User principal names contain both the username and the domain in a single field. To investigate effectively, you need to extract specific values from these messy formats and turn them into queryable columns.

This subsection teaches you the string manipulation tools in KQL — the operators that split, extract, parse, and transform text data into structured, filterable columns. These skills are used constantly in Modules 8 (connecting log sources that produce unstructured data), 9 (building analytics rules that extract IOCs from alerts), and 11-13 (investigation queries that parse evidence from complex log entries).

Here is what you will learn:

  • split() — breaking a string into parts at a delimiter (extracting domains from emails, paths from URLs)
  • extract() — pulling values using regular expressions (extracting IPs from free text, patterns from URLs)
  • parse — pattern-based extraction that creates named columns from text (parsing structured messages without regex)
  • substring() — extracting a portion of a string by position
  • String functionstolower(), toupper(), strlen(), strcat(), replace_string(), trim()
  • Working with JSON/dynamic columns — navigating nested structures with dot notation and tostring()

split() — breaking strings at a delimiter

What it is

split() takes a string and a delimiter character, and returns an array of parts. You access individual parts using array indexing ([0] for the first part, [1] for the second, etc.).

Why it matters

User principal names are formatted as username@domain. URLs are formatted as protocol://domain/path. Email headers contain key-value pairs separated by specific characters. split() extracts the piece you need.

1
2
3
4
5
6
SigninLogs
| where TimeGenerated > ago(7d)
| extend Username = tostring(split(UserPrincipalName, "@")[0])
| extend Domain = tostring(split(UserPrincipalName, "@")[1])
| project TimeGenerated, UserPrincipalName, Username, Domain
| take 5
Expected Output
UserPrincipalNameUsernameDomain
j.morrison@northgateeng.comj.morrisonnorthgateeng.com
s.patel@northgateeng.coms.patelnorthgateeng.com
What to look for: The UPN is split into its two components. Now you can filter or group by domain: | where Domain != "northgateeng.com" finds external or federated accounts. | summarize count() by Domain reveals which domains are signing in. During an investigation, finding a sign-in from an unexpected domain is an immediate red flag.
Always wrap split() output in tostring()

split() returns a dynamic array. Indexing into it ([0]) returns a dynamic value, not a string. Many subsequent operators (like has, ==) expect strings. Wrapping in tostring() converts the dynamic value to a string, preventing type mismatch errors.


extract() — regex-based extraction

What it is

extract() uses a regular expression to find and return a matching pattern from a string. It is more powerful than split() because it can match complex patterns, not just delimiter positions.

Why it matters

Syslog messages, alert descriptions, and free-text log fields contain embedded data that you cannot extract with split(). An IP address buried in a sentence like “Connection from 198.51.100.44 denied by firewall rule FW-001” requires pattern matching to pull out.

1
2
3
4
5
6
7
// Extract an IPv4 address from any text field
Syslog
| where TimeGenerated > ago(24h)
| extend ExtractedIP = extract(@"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", 1, SyslogMessage)
| where isnotempty(ExtractedIP)
| project TimeGenerated, ExtractedIP, SyslogMessage
| take 10

The regex pattern \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} matches any IPv4 address. The 1 in extract() means “return the first capture group” (the content inside parentheses).

You do not need to be a regex expert

You need exactly three regex patterns for 90% of security string extraction: the IPv4 pattern above, an email pattern ([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}), and a URL pattern (https?://[^\s"']+). Keep these as templates. For everything else, use parse (below) which does not require regex.


parse — pattern-based extraction without regex

What it is

parse extracts values from a string using a text pattern with named placeholders. It is simpler than extract() because you define the pattern using the actual text structure, not regular expressions.

Why it matters

Many log messages follow a consistent structure: “User john.smith from IP 198.51.100.44 failed authentication at 14:32.” If the structure is consistent, parse extracts the values without regex.

1
2
3
4
5
6
7
// Parse structured Syslog messages
Syslog
| where TimeGenerated > ago(24h)
| where SyslogMessage has "failed authentication"
| parse SyslogMessage with * "User " ParsedUser " from IP " ParsedIP " failed" *
| where isnotempty(ParsedUser)
| project TimeGenerated, ParsedUser, ParsedIP, SyslogMessage

The parse pattern uses * for “skip anything” and quoted strings for literal text. The unquoted names (ParsedUser, ParsedIP) become new columns containing the extracted values.

Use parse first, fall back to extract()

parse is easier to read, easier to write, and easier to debug than regex. Use it whenever the text has a consistent structure. Fall back to extract() only when the pattern is too variable for parse (e.g., an IP address that could appear anywhere in an arbitrary message).


Working with JSON and dynamic columns

What it is

Many Sentinel tables contain columns of type dynamic — JSON objects or arrays. LocationDetails in SigninLogs is a common example: it contains nested data like {"city": "London", "state": "England", "countryOrRegion": "GB"}. To use these values in filters or projections, you must navigate the JSON structure.

How it works

Use dot notation to navigate nested JSON, and tostring() to convert the result to a usable string:

1
2
3
4
5
6
7
8
SigninLogs
| where TimeGenerated > ago(7d)
| extend City = tostring(LocationDetails.city)
| extend State = tostring(LocationDetails.state)
| extend Country = tostring(LocationDetails.countryOrRegion)
| extend OS = tostring(DeviceDetail.operatingSystem)
| extend Browser = tostring(DeviceDetail.browser)
| project TimeGenerated, UserPrincipalName, City, Country, OS, Browser

For deeper nesting, chain dot notation: tostring(AuthenticationDetails[0].authenticationMethod) accesses the first element of an array and extracts the authenticationMethod property.

Without tostring(), dynamic fields cause type errors

Dynamic columns look like strings in the output, but KQL treats them as a different type. where LocationDetails.city == "London" may fail or produce unexpected results. Always wrap in tostring(): where tostring(LocationDetails.city) == "London". This is the most common source of "it looks right but doesn't work" bugs in KQL.

Try it yourself

Write a query against SigninLogs that extracts the username (before the @) from UserPrincipalName, the country from LocationDetails, and the operating system from DeviceDetail. Filter to the last 7 days and show successful sign-ins only. Group by username and country, counting sign-ins per combination. This reveals which users sign in from which countries — a baseline for impossible travel detection.
1
2
3
4
5
6
7
SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType == "0"
| extend Username = tostring(split(UserPrincipalName, "@")[0])
| extend Country = tostring(LocationDetails.countryOrRegion)
| summarize SigninCount = count() by Username, Country
| order by Username asc, SigninCount desc

A user who normally signs in from “GB” showing a sign-in from “RU” is an immediate investigation trigger. This baseline query is the foundation for the impossible travel detection pattern covered in subsection 6.5.

Check your understanding

1. You need to extract the domain from an email address field. Which approach is most appropriate?

tostring(split(EmailAddress, "@")[1]) — the @ symbol is a consistent delimiter in all email addresses. split() is the simplest and most readable approach. extract() with regex would also work but adds unnecessary complexity for this case.
extract() with an email regex pattern
parse with a text pattern
substring() with a fixed position

2. Your query filtering on LocationDetails.city == "London" returns 0 results, but you can see "London" in the raw data. Why?

LocationDetails is a dynamic (JSON) column. Comparing a dynamic value to a string without type conversion fails silently. Fix: tostring(LocationDetails.city) == "London". The tostring() converts the dynamic value to a string, making the comparison work.
London is spelled differently in the data
LocationDetails does not contain a city field
You need to use has instead of ==