1.5 Data Handling, Privacy, and Operational Security

2-3 hours · Module 1 · Free

Data Handling, Privacy, and Operational Security

Every time you paste a log entry, upload a CSV, or type an investigation note into an AI tool, data leaves your security perimeter and enters a third-party system. The data handling controls of that system — what it stores, who can access it, whether it trains on your input, and how long it retains your data — determine whether your AI usage is a productivity gain or a data breach.

This subsection establishes the data handling framework for AI-assisted security operations. The output is a data classification matrix that defines what data can be processed by which AI tools on which plans.


What happens to your data — by platform and plan tier

Each AI platform handles data differently depending on the plan tier. The differences are significant and often misunderstood.

PlatformPlanInput used for training?Vendor staff access?Data retentionZero retention option?
ClaudeFreeYes (default)Yes (safety review)RetainedNo
ClaudeProYes (default, opt-out available)Yes (safety review)RetainedNo
ClaudeTeamNo (default)Limited (safety review only)90 daysNo
ClaudeEnterpriseNoNoConfigurableYes
ChatGPTFreeYes (default)Yes (safety review)RetainedNo
ChatGPTPlusYes (default, opt-out available)Yes (safety review)RetainedNo
ChatGPTTeamNo (default)Limited90 daysNo
ChatGPTEnterpriseNoNoConfigurableYes
Copilot for SecurityEnterpriseNoLimitedConfigurableTenant-controlled
GeminiFreeYes (default)YesRetained (up to 3 years)No
GeminiWorkspaceNo (default)LimitedConfigurableYes

The critical distinction: On Free and Pro/Plus plans, assume the vendor can see your input and may use it for model training. On Team plans, the data handling is more restrictive but vendor staff may still access data for safety reviews. On Enterprise plans, data handling is the most restrictive with contractual guarantees.

What this means for security operations: If you are processing sign-in logs, alert details, investigation evidence, or incident data, the plan tier determines whether that data is exposed to the vendor. On Free and Pro plans: sanitize all data before uploading. On Team plans: sanitization is still recommended as a defense-in-depth measure. On Enterprise plans with zero retention: the risk is lowest, but organizational policy should still govern what data types are processed.


The data classification matrix for AI tools

Not all security data carries the same sensitivity. The matrix below classifies common security data types and maps them to the minimum AI platform tier required for processing.

Data TypeSensitivityMinimum TierSanitization Required?Examples
Public threat intelligenceLowAnyNoPublished CVE details, MITRE ATT&CK techniques, vendor documentation
Generic security queries (no org data)LowAnyNo“Write a KQL query for brute force detection” with no tenant-specific details
Anonymized log dataMediumTeam or aboveYes — replace all identifiersSign-in logs with usernames, IPs, and domains replaced with fictional values
Raw production log dataHighEnterprise (ZDR preferred)No — processing raw data requires the platform to handle it securelySign-in logs, email events, device events with real identifiers
Incident investigation evidenceHighEnterprise (ZDR preferred)Context-dependent — sanitize if possibleAlert details, investigation timelines, containment decisions
PII (employee or customer)Very HighEnterprise with ZDR onlyStrongly recommended even on ZDRNames, addresses, national IDs, financial details appearing in logs
Legally privileged materialRestrictedDo not process in external AIN/AAttorney-client communications, legal hold documents, regulatory correspondence
Classified materialRestrictedDo not process in external AIN/AGovernment classified data at any level

Rule of thumb: If the data would cause organizational damage if exposed to a third party, it requires Enterprise tier or sanitization before processing. If the data would cause legal consequences if exposed, do not process it in an external AI tool regardless of tier.


Sanitization methodology

When processing operational data on non-Enterprise plans, sanitize before uploading. The goal is to preserve analytical value while removing identifying information.

Replace, do not redact. Blurred or redacted data breaks analytical context. Replacement with consistent fictional values preserves relationships between data points.

Data elementSanitization methodExample
Usernames / UPNsReplace with fictional names from a consistent listj.morrison@northgateeng.com, s.chen@northgateeng.com
IP addressesReplace with RFC 5737 documentation ranges192.0.2.x (internal), 198.51.100.x (external), 203.0.113.x (attacker)
Domain namesReplace with fictional domainnorthgateeng.com
Device namesReplace with generic patternDESKTOP-NGE001, LAPTOP-NGE042
Tenant identifiersReplace or removeGUIDs, tenant domain names
TimestampsShift by a consistent offset or keep as-isTimestamps are usually not identifying on their own

Consistency is critical. The same real user must map to the same fictional user across all data you upload. If j.morrison@northgateeng.com appears in both sign-in logs and email events, the AI can correlate across data sources — just as it would with real data.

What to preserve: The analytical structure that makes the data useful. Time sequencing (event order), relationship patterns (same user across multiple events), behavioral patterns (unusual times, unusual locations), and field values that indicate anomalies (error codes, risk levels, authentication methods).


The shadow AI problem

Shadow AI is the AI equivalent of shadow IT: employees using AI tools without organizational knowledge, approval, or governance. For security teams, shadow AI is both a threat to defend against and a behavior to govern within the team itself.

The organizational risk: An employee pastes customer PII into a free-tier ChatGPT account to “quickly analyze” a customer complaint. That data is now stored on OpenAI’s infrastructure, potentially used for model training, and accessible to OpenAI staff for safety reviews. Under GDPR, this may constitute unauthorized processing of personal data — a reportable incident.

The security team risk: An analyst pastes production sign-in logs into a personal Claude account to speed up an investigation. The investigation data — real usernames, real IPs, real timestamps — is now on a third-party platform without organizational awareness. If the investigation relates to an insider threat case, the evidence chain may be compromised.

Detection approaches:

Network-level: Monitor DNS queries and web traffic for known AI service domains (claude.ai, chat.openai.com, gemini.google.com, copilot.microsoft.com). Use your web proxy or CASB to identify users accessing these services. If your organization has not approved these services, any access is shadow AI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Shadow AI detection  web proxy or firewall logs
// Adapt table name and field names to your environment
WebProxyLogs
| where TimeGenerated > ago(7d)
| where DestinationDomain in (
    "claude.ai",
    "api.anthropic.com",
    "chat.openai.com",
    "api.openai.com",
    "gemini.google.com",
    "copilot.microsoft.com",
    "copilot.cloud.microsoft"
)
| summarize
    SessionCount = count(),               // Total requests to AI services
    UniqueUsers = dcount(UserPrincipalName), // How many people
    DataSentBytes = sum(BytesSent)        // Volume of data uploaded
    by DestinationDomain
| sort by SessionCount desc

Endpoint-level: Use Defender for Endpoint device timeline or process events to detect AI application usage (desktop apps, browser extensions).

Governance response: Detection alone is not sufficient. The governance response includes: establishing which AI tools are approved (Module 7), communicating the approved tools and acceptable use policy to all staff, providing approved AI access to employees who need it (reducing the motivation for shadow AI), and monitoring for unauthorized usage with the queries above deployed as analytics rules.

Try it yourself

Build your data classification matrix. List the 10 most common data types your team processes in security operations (sign-in logs, email events, device telemetry, incident notes, etc.). For each, assign a sensitivity level and the minimum AI platform tier required. Then identify: which of these data types is your team currently processing in AI tools? Is the platform tier appropriate for the sensitivity level?

If any data type is being processed on a tier below what your matrix requires, that is a governance gap that needs immediate attention.

Most teams discover 2-3 governance gaps during this exercise. Common findings: analysts using personal Pro accounts to process data that should require Team or Enterprise tier. Alert summaries being generated on Free accounts because analysts find it faster than the approved tool. Incident notes containing real usernames being processed on plans where the vendor trains on input. Each gap is an action item for your AI governance framework (Module 7).

Check your understanding

1. An analyst on your team uploads a CSV containing 500 rows of raw SigninLogs data — including real UPNs, IP addresses, and timestamps — to a Claude Pro account for investigation analysis. What governance concerns does this raise?

Multiple concerns. First: on Claude Pro, input data may be used for model training (unless the user has opted out) and is accessible to Anthropic staff for safety reviews. Raw sign-in logs contain PII (usernames, IP addresses) that may be subject to GDPR or other privacy regulations. Second: the data classification matrix likely places raw production logs at "High" sensitivity, requiring Enterprise tier or sanitization before processing. Third: if the organization has not approved Claude Pro for processing operational data, this constitutes shadow AI usage. Immediate actions: verify the analyst's opt-out status, assess whether the data triggers a reportable privacy incident, and establish clear policy for AI tool usage in investigations.
No concerns — sign-in logs are not sensitive
Ask the analyst to delete the conversation

2. You need to analyze sign-in logs for a suspected compromise, but your organization only has Claude Team accounts (not Enterprise). The logs contain real UPNs and IP addresses. What is the correct approach?

Sanitize the data before uploading. Replace real UPNs with fictional names from your standard list (j.morrison@northgateeng.com, s.chen@northgateeng.com), replace IPs with RFC 5737 documentation ranges, and replace the tenant domain. Use consistent replacements so the AI can correlate across data points. Claude Team does not train on your data by default, but sanitization provides defense in depth — and ensures compliance regardless of vendor policy changes. The analytical value is preserved because the relationships between events (same user, same IP, same time window) remain intact.
Upload the raw data — Team accounts do not train on input
Do not use AI for this investigation

You're reading the free modules of this course

The full course continues with advanced topics, production detection rules, worked investigation scenarios, and deployable artifacts. Premium subscribers get access to all courses.

View Pricing See Full Syllabus