1.5 Data Handling, Privacy, and Operational Security

2-3 hours · Module 1 · Free

Data Handling, Privacy, and Operational Security

Every time you paste a log entry, upload a CSV, or type an investigation note into an AI tool, data leaves your security perimeter and enters a third-party system. The data handling controls of that system — what it stores, who can access it, whether it trains on your input, and how long it retains your data — determine whether your AI usage is a productivity gain or a data breach.

This subsection establishes the data handling framework for AI-assisted security operations. The output is a data classification matrix that defines what data can be processed by which AI tools on which plans.

What happens to your data — by platform and plan tier

Each AI platform handles data differently depending on the plan tier. The differences are significant and often misunderstood.

Platform	Plan	Input used for training?	Vendor staff access?	Data retention	Zero retention option?
Claude	Free	Yes (default)	Yes (safety review)	Retained	No
Claude	Pro	Yes (default, opt-out available)	Yes (safety review)	Retained	No
Claude	Team	No (default)	Limited (safety review only)	90 days	No
Claude	Enterprise	No	No	Configurable	Yes
ChatGPT	Free	Yes (default)	Yes (safety review)	Retained	No
ChatGPT	Plus	Yes (default, opt-out available)	Yes (safety review)	Retained	No
ChatGPT	Team	No (default)	Limited	90 days	No
ChatGPT	Enterprise	No	No	Configurable	Yes
Copilot for Security	Enterprise	No	Limited	Configurable	Tenant-controlled
Gemini	Free	Yes (default)	Yes	Retained (up to 3 years)	No
Gemini	Workspace	No (default)	Limited	Configurable	Yes

The critical distinction: On Free and Pro/Plus plans, assume the vendor can see your input and may use it for model training. On Team plans, the data handling is more restrictive but vendor staff may still access data for safety reviews. On Enterprise plans, data handling is the most restrictive with contractual guarantees.

What this means for security operations: If you are processing sign-in logs, alert details, investigation evidence, or incident data, the plan tier determines whether that data is exposed to the vendor. On Free and Pro plans: sanitize all data before uploading. On Team plans: sanitization is still recommended as a defense-in-depth measure. On Enterprise plans with zero retention: the risk is lowest, but organizational policy should still govern what data types are processed.

The data classification matrix for AI tools

Not all security data carries the same sensitivity. The matrix below classifies common security data types and maps them to the minimum AI platform tier required for processing.

Data Type	Sensitivity	Minimum Tier	Sanitization Required?	Examples
Public threat intelligence	Low	Any	No	Published CVE details, MITRE ATT&CK techniques, vendor documentation
Generic security queries (no org data)	Low	Any	No	“Write a KQL query for brute force detection” with no tenant-specific details
Anonymized log data	Medium	Team or above	Yes — replace all identifiers	Sign-in logs with usernames, IPs, and domains replaced with fictional values
Raw production log data	High	Enterprise (ZDR preferred)	No — processing raw data requires the platform to handle it securely	Sign-in logs, email events, device events with real identifiers
Incident investigation evidence	High	Enterprise (ZDR preferred)	Context-dependent — sanitize if possible	Alert details, investigation timelines, containment decisions
PII (employee or customer)	Very High	Enterprise with ZDR only	Strongly recommended even on ZDR	Names, addresses, national IDs, financial details appearing in logs
Legally privileged material	Restricted	Do not process in external AI	N/A	Attorney-client communications, legal hold documents, regulatory correspondence
Classified material	Restricted	Do not process in external AI	N/A	Government classified data at any level

Rule of thumb: If the data would cause organizational damage if exposed to a third party, it requires Enterprise tier or sanitization before processing. If the data would cause legal consequences if exposed, do not process it in an external AI tool regardless of tier.

Sanitization methodology

When processing operational data on non-Enterprise plans, sanitize before uploading. The goal is to preserve analytical value while removing identifying information.

Replace, do not redact. Blurred or redacted data breaks analytical context. Replacement with consistent fictional values preserves relationships between data points.

Data element	Sanitization method	Example
Usernames / UPNs	Replace with fictional names from a consistent list	j.morrison@northgateeng.com, s.chen@northgateeng.com
IP addresses	Replace with RFC 5737 documentation ranges	192.0.2.x (internal), 198.51.100.x (external), 203.0.113.x (attacker)
Domain names	Replace with fictional domain	northgateeng.com
Device names	Replace with generic pattern	DESKTOP-NGE001, LAPTOP-NGE042
Tenant identifiers	Replace or remove	GUIDs, tenant domain names
Timestamps	Shift by a consistent offset or keep as-is	Timestamps are usually not identifying on their own

Consistency is critical. The same real user must map to the same fictional user across all data you upload. If j.morrison@northgateeng.com appears in both sign-in logs and email events, the AI can correlate across data sources — just as it would with real data.

What to preserve: The analytical structure that makes the data useful. Time sequencing (event order), relationship patterns (same user across multiple events), behavioral patterns (unusual times, unusual locations), and field values that indicate anomalies (error codes, risk levels, authentication methods).

The shadow AI problem

Shadow AI is the AI equivalent of shadow IT: employees using AI tools without organizational knowledge, approval, or governance. For security teams, shadow AI is both a threat to defend against and a behavior to govern within the team itself.

The organizational risk: An employee pastes customer PII into a free-tier ChatGPT account to “quickly analyze” a customer complaint. That data is now stored on OpenAI’s infrastructure, potentially used for model training, and accessible to OpenAI staff for safety reviews. Under GDPR, this may constitute unauthorized processing of personal data — a reportable incident.

The security team risk: An analyst pastes production sign-in logs into a personal Claude account to speed up an investigation. The investigation data — real usernames, real IPs, real timestamps — is now on a third-party platform without organizational awareness. If the investigation relates to an insider threat case, the evidence chain may be compromised.

Detection approaches:

Network-level: Monitor DNS queries and web traffic for known AI service domains (claude.ai, chat.openai.com, gemini.google.com, copilot.microsoft.com). Use your web proxy or CASB to identify users accessing these services. If your organization has not approved these services, any access is shadow AI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Shadow AI detection — web proxy or firewall logs
// Adapt table name and field names to your environment
WebProxyLogs
| where TimeGenerated > ago(7d)
| where DestinationDomain in (
    "claude.ai",
    "api.anthropic.com",
    "chat.openai.com",
    "api.openai.com",
    "gemini.google.com",
    "copilot.microsoft.com",
    "copilot.cloud.microsoft"
)
| summarize
    SessionCount = count(),               // Total requests to AI services
    UniqueUsers = dcount(UserPrincipalName), // How many people
    DataSentBytes = sum(BytesSent)        // Volume of data uploaded
    by DestinationDomain
| sort by SessionCount desc

Endpoint-level: Use Defender for Endpoint device timeline or process events to detect AI application usage (desktop apps, browser extensions).

Governance response: Detection alone is not sufficient. The governance response includes: establishing which AI tools are approved (Module 7), communicating the approved tools and acceptable use policy to all staff, providing approved AI access to employees who need it (reducing the motivation for shadow AI), and monitoring for unauthorized usage with the queries above deployed as analytics rules.

Try it yourself

Build your data classification matrix. List the 10 most common data types your team processes in security operations (sign-in logs, email events, device telemetry, incident notes, etc.). For each, assign a sensitivity level and the minimum AI platform tier required. Then identify: which of these data types is your team currently processing in AI tools? Is the platform tier appropriate for the sensitivity level?

If any data type is being processed on a tier below what your matrix requires, that is a governance gap that needs immediate attention.

Most teams discover 2-3 governance gaps during this exercise. Common findings: analysts using personal Pro accounts to process data that should require Team or Enterprise tier. Alert summaries being generated on Free accounts because analysts find it faster than the approved tool. Incident notes containing real usernames being processed on plans where the vendor trains on input. Each gap is an action item for your AI governance framework (Module 7).

Check your understanding

1. An analyst on your team uploads a CSV containing 500 rows of raw SigninLogs data — including real UPNs, IP addresses, and timestamps — to a Claude Pro account for investigation analysis. What governance concerns does this raise?

Multiple concerns. First: on Claude Pro, input data may be used for model training (unless the user has opted out) and is accessible to Anthropic staff for safety reviews. Raw sign-in logs contain PII (usernames, IP addresses) that may be subject to GDPR or other privacy regulations. Second: the data classification matrix likely places raw production logs at "High" sensitivity, requiring Enterprise tier or sanitization before processing. Third: if the organization has not approved Claude Pro for processing operational data, this constitutes shadow AI usage. Immediate actions: verify the analyst's opt-out status, assess whether the data triggers a reportable privacy incident, and establish clear policy for AI tool usage in investigations.

No concerns — sign-in logs are not sensitive

Ask the analyst to delete the conversation

Multiple governance concerns: training data exposure, PII processing on an inappropriate tier, potential privacy incident, and possible shadow AI usage. This scenario is exactly what the data classification matrix and shadow AI governance prevent.

2. You need to analyze sign-in logs for a suspected compromise, but your organization only has Claude Team accounts (not Enterprise). The logs contain real UPNs and IP addresses. What is the correct approach?

Sanitize the data before uploading. Replace real UPNs with fictional names from your standard list (j.morrison@northgateeng.com, s.chen@northgateeng.com), replace IPs with RFC 5737 documentation ranges, and replace the tenant domain. Use consistent replacements so the AI can correlate across data points. Claude Team does not train on your data by default, but sanitization provides defense in depth — and ensures compliance regardless of vendor policy changes. The analytical value is preserved because the relationships between events (same user, same IP, same time window) remain intact.

Upload the raw data — Team accounts do not train on input

Do not use AI for this investigation

Sanitize before uploading. Team accounts have better data handling than Pro, but defense in depth means sanitizing regardless. The analytical value is preserved through consistent replacement. This is the standard practice for non-Enterprise AI usage in security operations.

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus

← 1.4 Evaluating AI Tools for Security Operations 1.6 Building Your AI Operations Foundation →