1.5 Data Handling, Privacy, and Operational Security
Data Handling, Privacy, and Operational Security
Every time you paste a log entry, upload a CSV, or type an investigation note into an AI tool, data leaves your security perimeter and enters a third-party system. The data handling controls of that system — what it stores, who can access it, whether it trains on your input, and how long it retains your data — determine whether your AI usage is a productivity gain or a data breach.
This subsection establishes the data handling framework for AI-assisted security operations. The output is a data classification matrix that defines what data can be processed by which AI tools on which plans.
What happens to your data — by platform and plan tier
Each AI platform handles data differently depending on the plan tier. The differences are significant and often misunderstood.
| Platform | Plan | Input used for training? | Vendor staff access? | Data retention | Zero retention option? |
|---|---|---|---|---|---|
| Claude | Free | Yes (default) | Yes (safety review) | Retained | No |
| Claude | Pro | Yes (default, opt-out available) | Yes (safety review) | Retained | No |
| Claude | Team | No (default) | Limited (safety review only) | 90 days | No |
| Claude | Enterprise | No | No | Configurable | Yes |
| ChatGPT | Free | Yes (default) | Yes (safety review) | Retained | No |
| ChatGPT | Plus | Yes (default, opt-out available) | Yes (safety review) | Retained | No |
| ChatGPT | Team | No (default) | Limited | 90 days | No |
| ChatGPT | Enterprise | No | No | Configurable | Yes |
| Copilot for Security | Enterprise | No | Limited | Configurable | Tenant-controlled |
| Gemini | Free | Yes (default) | Yes | Retained (up to 3 years) | No |
| Gemini | Workspace | No (default) | Limited | Configurable | Yes |
The critical distinction: On Free and Pro/Plus plans, assume the vendor can see your input and may use it for model training. On Team plans, the data handling is more restrictive but vendor staff may still access data for safety reviews. On Enterprise plans, data handling is the most restrictive with contractual guarantees.
What this means for security operations: If you are processing sign-in logs, alert details, investigation evidence, or incident data, the plan tier determines whether that data is exposed to the vendor. On Free and Pro plans: sanitize all data before uploading. On Team plans: sanitization is still recommended as a defense-in-depth measure. On Enterprise plans with zero retention: the risk is lowest, but organizational policy should still govern what data types are processed.
The data classification matrix for AI tools
Not all security data carries the same sensitivity. The matrix below classifies common security data types and maps them to the minimum AI platform tier required for processing.
| Data Type | Sensitivity | Minimum Tier | Sanitization Required? | Examples |
|---|---|---|---|---|
| Public threat intelligence | Low | Any | No | Published CVE details, MITRE ATT&CK techniques, vendor documentation |
| Generic security queries (no org data) | Low | Any | No | “Write a KQL query for brute force detection” with no tenant-specific details |
| Anonymized log data | Medium | Team or above | Yes — replace all identifiers | Sign-in logs with usernames, IPs, and domains replaced with fictional values |
| Raw production log data | High | Enterprise (ZDR preferred) | No — processing raw data requires the platform to handle it securely | Sign-in logs, email events, device events with real identifiers |
| Incident investigation evidence | High | Enterprise (ZDR preferred) | Context-dependent — sanitize if possible | Alert details, investigation timelines, containment decisions |
| PII (employee or customer) | Very High | Enterprise with ZDR only | Strongly recommended even on ZDR | Names, addresses, national IDs, financial details appearing in logs |
| Legally privileged material | Restricted | Do not process in external AI | N/A | Attorney-client communications, legal hold documents, regulatory correspondence |
| Classified material | Restricted | Do not process in external AI | N/A | Government classified data at any level |
Rule of thumb: If the data would cause organizational damage if exposed to a third party, it requires Enterprise tier or sanitization before processing. If the data would cause legal consequences if exposed, do not process it in an external AI tool regardless of tier.
Sanitization methodology
When processing operational data on non-Enterprise plans, sanitize before uploading. The goal is to preserve analytical value while removing identifying information.
Replace, do not redact. Blurred or redacted data breaks analytical context. Replacement with consistent fictional values preserves relationships between data points.
| Data element | Sanitization method | Example |
|---|---|---|
| Usernames / UPNs | Replace with fictional names from a consistent list | j.morrison@northgateeng.com, s.chen@northgateeng.com |
| IP addresses | Replace with RFC 5737 documentation ranges | 192.0.2.x (internal), 198.51.100.x (external), 203.0.113.x (attacker) |
| Domain names | Replace with fictional domain | northgateeng.com |
| Device names | Replace with generic pattern | DESKTOP-NGE001, LAPTOP-NGE042 |
| Tenant identifiers | Replace or remove | GUIDs, tenant domain names |
| Timestamps | Shift by a consistent offset or keep as-is | Timestamps are usually not identifying on their own |
Consistency is critical. The same real user must map to the same fictional user across all data you upload. If j.morrison@northgateeng.com appears in both sign-in logs and email events, the AI can correlate across data sources — just as it would with real data.
What to preserve: The analytical structure that makes the data useful. Time sequencing (event order), relationship patterns (same user across multiple events), behavioral patterns (unusual times, unusual locations), and field values that indicate anomalies (error codes, risk levels, authentication methods).
The shadow AI problem
Shadow AI is the AI equivalent of shadow IT: employees using AI tools without organizational knowledge, approval, or governance. For security teams, shadow AI is both a threat to defend against and a behavior to govern within the team itself.
The organizational risk: An employee pastes customer PII into a free-tier ChatGPT account to “quickly analyze” a customer complaint. That data is now stored on OpenAI’s infrastructure, potentially used for model training, and accessible to OpenAI staff for safety reviews. Under GDPR, this may constitute unauthorized processing of personal data — a reportable incident.
The security team risk: An analyst pastes production sign-in logs into a personal Claude account to speed up an investigation. The investigation data — real usernames, real IPs, real timestamps — is now on a third-party platform without organizational awareness. If the investigation relates to an insider threat case, the evidence chain may be compromised.
Detection approaches:
Network-level: Monitor DNS queries and web traffic for known AI service domains (claude.ai, chat.openai.com, gemini.google.com, copilot.microsoft.com). Use your web proxy or CASB to identify users accessing these services. If your organization has not approved these services, any access is shadow AI.
| |
Endpoint-level: Use Defender for Endpoint device timeline or process events to detect AI application usage (desktop apps, browser extensions).
Governance response: Detection alone is not sufficient. The governance response includes: establishing which AI tools are approved (Module 7), communicating the approved tools and acceptable use policy to all staff, providing approved AI access to employees who need it (reducing the motivation for shadow AI), and monitoring for unauthorized usage with the queries above deployed as analytics rules.
Try it yourself
Build your data classification matrix. List the 10 most common data types your team processes in security operations (sign-in logs, email events, device telemetry, incident notes, etc.). For each, assign a sensitivity level and the minimum AI platform tier required. Then identify: which of these data types is your team currently processing in AI tools? Is the platform tier appropriate for the sensitivity level?
If any data type is being processed on a tier below what your matrix requires, that is a governance gap that needs immediate attention.
Most teams discover 2-3 governance gaps during this exercise. Common findings: analysts using personal Pro accounts to process data that should require Team or Enterprise tier. Alert summaries being generated on Free accounts because analysts find it faster than the approved tool. Incident notes containing real usernames being processed on plans where the vendor trains on input. Each gap is an action item for your AI governance framework (Module 7).
Check your understanding
1. An analyst on your team uploads a CSV containing 500 rows of raw SigninLogs data — including real UPNs, IP addresses, and timestamps — to a Claude Pro account for investigation analysis. What governance concerns does this raise?
2. You need to analyze sign-in logs for a suspected compromise, but your organization only has Claude Team accounts (not Enterprise). The logs contain real UPNs and IP addresses. What is the correct approach?
You're reading the free modules of this course
The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.