How Do AI Systems Process Personal Data?
AI systems process personal data in two distinct phases — a training phase, where the model learns statistical patterns from datasets that may include personal information, and an inference phase, where the deployed model generates outputs based on new inputs that may also contain personal data. Both phases constitute data processing under GDPR Article 4(2), both require a valid legal basis, and both trigger the full obligations of applicable privacy law.
The consequence: an AI system is not a tool that sits outside the data protection framework until it does something obviously risky. Every AI system that touches personal data at any point in its operation is a data processing activity — from the moment training data is assembled through every query a user submits in production.
Why AI Systems Are Different From Traditional Software
Traditional software processes personal data in predictable, defined ways: a CRM stores a name and email; a payment processor transmits card data; an analytics tool counts page views. The data in, data out relationship is auditable and bounded.
AI systems process personal data differently on three dimensions that create novel compliance challenges:
1. Data is absorbed into model weights, not stored as retrievable records. When an AI model trains on a dataset containing personal information — names, behavioral patterns, health indicators, purchase history — that information is encoded as statistical patterns across the model's parameters. It is not stored in a database row that can be located and deleted. It is diffused through millions or billions of numerical weights. This creates the erasure problem: when a data subject exercises a right to deletion, there is no record to remove — only a model to retrain.
2. AI models can memorize and reproduce training data. Research has consistently demonstrated that large language models can reproduce verbatim text from their training datasets, including personal information. In 2022, a study of GPT-2 found that the model could be prompted to reproduce names, phone numbers, email addresses, and other personal data from its training corpus. The EDPB's Opinion 28/2024 explicitly addresses this: for an AI model to be considered anonymous, the likelihood of direct extraction of personal data and the likelihood of obtaining personal data through queries must both be assessed as insignificant. The EDPB sets a high threshold — most currently deployed models do not meet it.
3. Inference creates new personal data. An AI system making a hiring recommendation, a credit decision, or a medical diagnosis is not just processing the input data — it is generating a new data point about an individual that has legal significance. That output — a score, a recommendation, a classification — is itself personal data, and the individual has rights in relation to it under GDPR Article 22.
As Tiffany Li, Associate Professor of Law at the University of San Francisco School of Law, stated in MIT Technology Review: "Even if someone finds out their data was used in a training dataset and exercises their right to deletion, technically the law is unclear about what that means. If the organization only deletes data from the training datasets — but does not delete or retrain the already trained model — then the harm will nonetheless be done."
Phase 1: Data Collection and Training
What Happens Technically
The training phase begins with data assembly. A training dataset is built from sources that may include: web-scraped text and images, licensed third-party datasets, internal enterprise records, user-generated content, sensor data, transaction logs, and API feeds. The model is then exposed to this data iteratively, adjusting its internal parameters to minimize prediction error across the training examples.
For most commercially deployed AI systems — recommendation engines, language models, classification systems — this training corpus contains personal data. Web-scraped text includes names, email addresses, and biographical details. Internal records contain customer histories and employee data. User-generated content includes self-disclosed personal information. The presence of personal data in training data is the rule, not the exception.
CNIL, France's data protection authority, distinguishes two phases clearly: the learning phase (creating and training the AI system, resulting in a model that encodes the system's learnings from training data) and the production phase (actual deployment and use of the developed model). These phases have different objectives and, from a data protection perspective, should be treated as distinct processing activities — each requiring its own documented legal basis and purpose.
The Legal Basis Question for Training Data
GDPR Article 6 requires a valid legal basis for every processing activity. For AI training on personal data, four bases are relevant:
Consent — the data subject has explicitly agreed to their data being used to train an AI model. In practice, consent is rarely viable as the sole basis for training at scale: collecting specific, informed consent from every individual whose data appears in a training corpus is operationally impossible for large datasets, and consent must be as easy to withdraw as to give.
Legitimate interest — the controller's interest in developing the AI system is genuine, necessary, and does not override the data subjects' rights. This is the most commonly invoked basis for AI training. The EDPB's Opinion 28/2024 confirmed that legitimate interest is available as a legal basis for AI model development — but requires a genuine three-part assessment: the interest must be legitimate, the processing must be necessary for that interest, and the controller's interests must not be overridden by the data subjects' rights and expectations.
Critically, the EDPB found that if the training data was unlawfully processed — collected without legal basis — that unlawfulness taints the trained model itself. A model built on unlawfully processed data cannot be legitimized retroactively by finding a valid basis for the deployment phase. The unlawfulness follows the data.
Contract — processing is necessary to perform a contract with the data subject. This basis applies narrowly: a system trained specifically to deliver services to a customer whose data it trains on may qualify. It does not apply to general-purpose model development.
Legal obligation — processing is required to comply with a legal requirement. This applies in narrow regulatory contexts (fraud detection obligations, AML screening) rather than general AI development.
The November 2025 Digital Omnibus update proposed an explicit GDPR amendment recognizing legitimate interest as a valid basis for AI training, with safeguards — a significant clarification that had previously been a source of regulatory uncertainty. Subject to European Parliament and Council approval (expected mid-2026), this reduces legal risk for organizations relying on legitimate interest for training data processing.
Special Category Data in AI Training
GDPR Article 9 prohibits processing of special category data — health information, racial or ethnic origin, political opinions, religious beliefs, sexual orientation, biometric data, genetic data — unless a specific Article 9(2) exception applies.
AI training datasets frequently contain special category data implicitly: a photograph of a person contains biometric data; a text corpus scraped from health forums contains medical information; behavioral patterns can reveal political affiliation or sexual orientation. Organizations cannot simply declare that their training data is "general purpose" and assume Article 9 does not apply — the EDPB has been explicit that incidental inclusion of special category data still triggers Article 9 obligations.
The Digital Omnibus proposal added one specific exception: providers and deployers of AI systems may process special category personal data to detect and correct bias — subject to defined safeguards. This is a targeted exemption, not a general license.
Phase 2: Inference — Personal Data in Production
What Happens Technically
When a deployed AI system receives a user input — a query, a document, a data record — it processes that input through its trained parameters to generate an output. This is inference. The input is preprocessed into a numerical representation the model can operate on; the model applies its learned weights to that representation; and an output — a prediction, a classification, a generated response, a recommendation — is produced.
At each step of this process, personal data may be present:
- In the input: a user submits their name and symptoms to a health AI; an HR system feeds a candidate's CV to a screening model; a customer submits a support query containing their account details
- In intermediate representations: the input is encoded into embeddings — numerical vectors that represent semantic meaning. Research published in 2020 demonstrated that embedding models leak significant personal information: attacks on popular sentence embedding models recovered 50–70% of input words (F1 scores of 0.5–0.7), including sensitive attributes such as authorship and potentially health-related information
- In the output: the model's prediction may itself be personal data — a credit score, a risk classification, a recommended action
Key term: Embedding — a numerical vector representation of input data (text, image, or other content) that captures semantic meaning in a format AI models can process. Where an embedding can be linked to an identified or identifiable individual — directly or because it encodes information that enables inference about a person — it constitutes personal data under GDPR Article 4(1) and must be governed accordingly.
Inference Logging and Data Minimization
Production AI systems typically log inference inputs and outputs for monitoring, debugging, and quality improvement. Each log entry may contain personal data. Governance requirements for inference logs:
- Purpose limitation: logs collected for technical monitoring cannot be repurposed for model retraining without a separately assessed legal basis
- Retention minimization: inference logs should be treated as temporary; the CNIL guidelines and DigitalOcean's 2025 AI privacy analysis both note that prompts should not be retained longer than required for immediate operational purposes
- Access controls: inference logs containing personal data must be subject to the same access controls as any personal data store — not left as unsecured flat files in an engineering team's logging infrastructure
- Third-party processing: when inference runs on a third-party API (an external LLM, a cloud-hosted model), the input data is transmitted to a third party. That third party becomes a data processor under GDPR, requiring a Data Processing Agreement before any personal data is transmitted
Privacy Attack Vectors: Where Personal Data Escapes AI Systems
Understanding how AI systems process personal data requires understanding the attack surfaces through which that data can be extracted or inferred — because regulators now expect organizations to assess these risks proactively.
Membership inference attacks — an attacker queries the model with data about a specific individual to determine whether that individual's data was used in training. Research consistently shows that models trained without differential privacy protections are vulnerable to membership inference at rates significantly above random chance. The EDPB has incorporated this attack vector into its anonymization assessment framework.
Model inversion attacks — an attacker reconstructs training data from model outputs and gradients. In a landmark 2015 study, researchers demonstrated that model inversion could reconstruct recognizable facial images from a face recognition system's confidence scores. In medical AI contexts, inversion attacks have been used to reconstruct approximate patient records from diagnostic models.
Data extraction from generative models — large language models can reproduce verbatim training data, including personal information, when prompted in specific ways. A 2023 Google DeepMind study extracted over 10,000 examples of memorized training data from ChatGPT, including names, contact details, and passages from private documents.
These attack vectors require specific technical and organizational measures: differential privacy during training, output filtering for personal data patterns, prompt injection detection, and rate limiting on queries that could be used for systematic data extraction. The CNIL's AI guidelines explicitly state that a model facing a successful privacy attack that exposes personal data constitutes a data breach requiring notification.
Data Subject Rights and AI Systems
GDPR's data subject rights — access, erasure, rectification, portability, restriction, objection, and rights related to automated decision-making — apply to AI systems processing personal data. Each right creates specific operational challenges in the AI context.
Right of Access (Article 15)
Individuals can request confirmation that their data is being processed and a copy of that data. For AI systems, this raises the question: does the training corpus count? The EDPB's position is that if personal data was used in training, it was processed — and the individual's right of access applies to that processing activity, even if the data is no longer stored in a retrievable format.
Right to Erasure (Article 17)
The right to erasure in AI systems is one of the most technically demanding compliance obligations in the current landscape. Options include:
- Retraining without the subject's data — viable for smaller models; operationally impractical for large foundation models
- Machine unlearning — algorithmic techniques that attempt to reduce a specific individual's influence on model weights without full retraining. Still experimental; no vendor can currently guarantee effective unlearning for large-scale models
- Anonymization demonstration — showing, via the EDPB's two-part test, that the likelihood of identification from the model is insignificant. Difficult to demonstrate convincingly for most current models
Organizations that cannot demonstrate one of these approaches face a genuine compliance gap. Best practice, confirmed across regulatory guidance, is prevention: robust training data provenance tracking that identifies which individuals' data is in the training corpus before training, enabling removal requests to be honored before the model is built.
Automated Decision-Making Rights (Article 22)
GDPR Article 22 gives individuals the right not to be subject to decisions based solely on automated processing — including profiling — that produces significant effects on them. This right applies to any AI system making consequential automated decisions: loan approvals, hiring screens, insurance pricing, content moderation, clinical pathway recommendations.
The obligations triggered by Article 22 include:
- Providing meaningful information about the logic involved
- Providing the ability to request human review of the automated decision
- Providing the ability to contest the decision
- Ensuring the automated processing itself is lawful (with valid consent or necessity for contract performance)
Right to Object to Profiling
Where AI systems use personal data to build behavioral profiles — for targeted advertising, content personalization, or predictive scoring — individuals have the right to object under GDPR Article 21. Consent management infrastructure must capture these preferences and signal them to the AI system's decision logic, not just to the marketing stack.
The Regulatory Framework: What Applies, When, and to Whom
GDPR and the EU AI Act: Simultaneous Obligations
As DLA Piper noted in their 2025 GDPR enforcement survey: "GDPR is now being used as the primary enforcement tool for AI regulation as AI-specific rules are still being phased in." This means AI enforcement is happening now — not waiting for the EU AI Act's full implementation.
The EU AI Act's high-risk obligations (Annex III systems) impose a second, parallel compliance layer from August 2, 2026: technical documentation, risk management systems, human oversight, conformity assessment, and post-market monitoring — all running alongside GDPR, not replacing it.
Organizations deploying high-risk AI systems that process personal data must satisfy both frameworks simultaneously. A hiring AI must comply with GDPR Article 22 automated decision-making rights and EU AI Act Annex III human oversight requirements. Treating these as separate workstreams creates gaps. The practical approach is a unified compliance infrastructure addressing both.
Enforcement Precedents
| Organization | Regulator | Finding | Penalty | |||
|---|---|---|---|---|---|---|
OpenAI (ChatGPT) | Italy Garante | Unlawful training data processing; GDPR Article 6 violation | €15 million (2025) | |||
Clearview AI | France CNIL | Web scraping for facial recognition training without lawful basis | €20 million (2022) | |||
Clearview AI | Italy Garante | Biometric data processing without lawful basis for AI training | €20 million (2022) | |||
OpenAI | Poland UODO | Complaint upheld re: accuracy and access rights for AI-generated content | Investigation ongoing (2025) | |||
Multiple companies | CPPA (California) | Automated decision-making technology without proper disclosures | Enforcement sweep launched 2025 |
The pattern is consistent: regulators are not waiting for AI-specific laws to enforce against AI processing. GDPR's existing framework — lawful basis, data minimization, purpose limitation, transparency, data subject rights — is being applied directly to AI systems now.
U.S. State Law Landscape
In 2025, U.S. states introduced 1,208 AI-related bills and enacted 145 of them (LinkedIn AI Regulation Analysis, February 2026). The most operationally significant:
- Colorado AI Act (SB 24-205): Effective June 30, 2026. Developers and deployers of high-risk AI making consequential decisions in employment, housing, and lending must implement risk management programs, conduct impact assessments, and provide consumer disclosures
- California CPPA ADMT regulations: Automated decision-making technology regulations requiring pre-use notice and opt-out rights for profiling in significant decisions
Texas AI law (HB 149): Effective January 1, 2026. Requires developers of high-risk AI
Technical Controls for Privacy-Compliant AI
Privacy by design in AI systems means embedding data protection controls into the architecture before training begins, not retrofitting them after a regulatory inquiry. GDPR Article 25 requires it. The EU AI Act requires it. The technical controls that make it operational:
Differential privacy — a mathematical technique that adds calibrated noise to training data or model outputs such that any individual's contribution to the training dataset cannot be statistically distinguished. It provides a provable privacy guarantee that supports the anonymization assessments the EDPB requires.
Data minimization in feature engineering — selecting only the features necessary for the model's stated purpose before training, and eliminating fields that could enable re-identification or introduce bias. The EU AI Act for CTOs guidance from Secure Privacy notes: "For AI systems, this means training data minimisation, feature selection that avoids processing personal data where statistical patterns can be learned without it, and inference-time controls that limit what data the model is exposed to."
Federated learning — training a model across distributed devices or data silos without centralizing raw personal data. Each device trains locally on its own data; only model updates (gradients), not raw data, are shared. Reduces exposure from a single large training corpus, though does not eliminate all privacy risks.
Data anonymization and pseudonymization — removing or replacing direct identifiers before training. Pseudonymized data is still personal data under GDPR (because re-identification is possible with additional information); genuinely anonymized data falls outside GDPR entirely. The EDPB's high threshold for anonymization means most "anonymized" training datasets remain personal data.
Output filtering — scanning model outputs for personal data patterns (email addresses, phone numbers, names, account numbers) and redacting or blocking outputs that reproduce training data verbatim.
Generative AI: Personal Data at the Prompt Layer
Generative AI — large language models, image generators, and multimodal systems — introduces a personal data processing layer that did not exist in traditional predictive AI: the user prompt.
Every prompt submitted to an LLM may contain personal data. A user drafting an HR letter includes employee names and performance details. A user asking for contract assistance pastes client information. A developer submitting code for review includes API keys and internal system identifiers.
AI chatbot data governance and RAG systems introduce additional complexity: Retrieval-Augmented Generation systems that pull from internal document stores before generating responses must govern the retrieval sources with the same controls as the model's training data. Germany's Datenschutzkonferenz issued guidance in October 2025 specifically on RAG systems, finding that unrestricted retrieval from external databases containing personal data creates the highest GDPR exposure category.
Operational requirements for generative AI personal data processing:
- User disclosure: individuals must be informed when their inputs will be processed and whether those inputs may be used to improve the model
- Input filtering: systems should flag and handle inputs containing high-risk personal data categories differently from general-purpose text
- Vendor DPA: a Data Processing Agreement must be executed with every LLM API provider before personal data flows through the integration
- Cross-border transfer mechanisms: if the LLM provider processes data outside the EEA, a valid transfer mechanism (Standard Contractual Clauses, adequacy decision) must be in place before prompts containing personal data are transmitted
- Prompt log governance: inference logs containing personal data are subject to purpose limitation, access controls, and retention schedules — not treated as unstructured operational telemetry
How Secure Privacy Supports Privacy-Compliant AI
Secure Privacy provides the infrastructure that connects AI governance obligations to operational privacy management — covering the shared compliance layer that GDPR and the EU AI Act both require.
AI system inventory and DPIA workflows: Structured Data Protection Impact Assessment processes for AI systems processing personal data, satisfying both GDPR Article 35 and EU AI Act Article 27 (FRIA) requirements. Built-in DPO review routing, sign-off tracking, and version-controlled documentation.
Training data provenance documentation: Records of training dataset sources, legal bases, consent provenance, and data subject coverage — the documentation that satisfies both GDPR Article 30 (Records of Processing Activities) and EU AI Act technical documentation requirements in a single evidence store.
Consent management for AI systems: For AI systems that make decisions based on user behavioral data, Secure Privacy's AI Governance platform ensures that consent signals flow into the AI system's processing logic in real time. Users who have not consented to profiling cannot have their data used in AI personalization — enforced at the infrastructure level, not just declared in a privacy policy.
Data subject request handling for AI processing: The DSAR module handles requests involving AI-processed personal data — including access requests about AI-generated inferences and deletion requests that require documented responses about training data handling — with audit-ready records for regulatory inquiry.
Vendor management for AI providers: Data Processing Agreement workflows covering LLM API providers and AI service vendors, with cross-border transfer mechanism documentation and periodic reassessment triggers.
AI governance framework tools: Post-deployment monitoring dashboards covering DPIA completion rates, data minimization compliance, and incident status across all registered AI systems — making the continuous governance obligation operational rather than theoretical.
Frequently Asked Questions
Is training data always "personal data" under GDPR?
Not automatically — but in practice, yes for most commercially relevant training datasets. GDPR's definition of personal data is broad: any information relating to an identified or identifiable natural person. Web-scraped text, behavioral logs, enterprise records, and user-generated content all typically contain personal data. Data that has been genuinely anonymized — meeting the EDPB's high threshold of insignificant re-identification likelihood — falls outside GDPR. Pseudonymized data does not. Organizations should not assume their training data is anonymous without conducting the specific assessment the EDPB describes in Opinion 28/2024.
Do we need a DPIA for every AI system we deploy?
A DPIA is mandatory under GDPR Article 35 for any processing "likely to result in a high risk to the rights and freedoms of natural persons." The criteria that trigger this threshold, per EDPB guidelines, include systematic processing of sensitive data, large-scale processing, automated decision-making with significant effects, and monitoring of publicly accessible areas. Most AI systems processing personal data at scale meet at least one of these criteria. The practical guidance from supervisory authorities is to conduct a DPIA for any AI system that makes or contributes to decisions affecting individuals, uses biometric or special category data, or processes behavioral data at scale.
Can we use publicly available data to train AI models without GDPR issues?
Public availability does not equal lawful processing. The France CNIL's fine against Clearview AI was specifically for web-scraping publicly available facial images for AI training without a valid lawful basis. The EDPB and multiple supervisory authorities have confirmed that "publicly available" is not a legal basis under GDPR. A valid Article 6 basis must be established independently of whether the data was publicly accessible at the time of collection.
What is the right approach when a data subject requests erasure of their data from a trained model?
The GDPR does not provide a technical specification for this scenario because it predates modern AI. The regulatory expectation, based on DPA guidance and enforcement decisions, is: first, document all training data provenance so you can assess whether a specific individual's data is in scope; second, honor the erasure right to the extent technically feasible — removing data from training datasets before training where possible, or retraining where not; third, where full erasure is technically infeasible, document why and implement compensating controls. The "harm will nonetheless be done" framing from Tiffany Li captures the regulatory direction of travel: claiming technical impossibility is becoming a less viable defense as machine unlearning techniques mature.
How do we manage GDPR compliance when we use a third-party LLM API?
Three steps are required before any personal data flows to a third-party LLM: (1) execute a Data Processing Agreement with the LLM provider covering GDPR Article 28 requirements — scope of processing, security obligations, sub-processor disclosure, and deletion obligations; (2) establish a valid cross-border transfer mechanism if the provider processes data outside the EEA — Standard Contractual Clauses are the default; (3) assess whether the provider's model improvement practices (whether your prompts may be used to improve their model) are compatible with your users' reasonable expectations, and disclose accordingly in your privacy notice.
What does GDPR Article 22 mean in practice for our AI systems?
Article 22 applies when an AI system makes decisions "based solely on automated processing" that "produces legal or similarly significant effects" on an individual. The practical test: could the output of the AI system, without human review, cause material consequences — financial, employment-related, health-related, legal, or equivalent? If yes, Article 22 requires: proactive disclosure to individuals before the processing occurs; a mechanism for individuals to request human review; a mechanism to contest decisions; and a valid Article 22(2) exemption (consent, contract necessity, or legal authorization). The human review mechanism must be genuine — a human who actually reviews the case — not a nominal "human in the loop" who rubber-stamps every automated output.
Summary: Personal Data Processing Across the AI Lifecycle
| Stage | Personal Data Present | Key GDPR Obligation | Key Risk | |||
|---|---|---|---|---|---|---|
Data collection | Training corpus contents | Lawful basis (Art. 6); RoPA (Art. 30) | Unlawful scraping; purpose mismatch | |||
Training | Encoded in model weights | Data minimization; DPIA if high-risk | Memorization; unlawful processing tainting the model | |||
Embedding/representation | Inference-time vectors | Access controls; DPA with embedding API providers | Information leakage from embedding attacks | |||
Inference (input) | User prompts; input records | Transparency; purpose limitation | Prompt logging; unintended data collection | |||
Inference (output) | AI-generated predictions about individuals | Article 22 automated decision-making rights | No human oversight; no contestability mechanism | |||
Logging | Input/output logs | Retention minimization; access controls | Purpose creep; uncontrolled log access | |||
Model retirement | Residual personal data in weights | Erasure obligations; documentation retention | Incomplete deletion; undocumented residual risk |
Secure Privacy is a unified consent management and privacy governance platform supporting GDPR, EU AI Act, and 65+ privacy regulations. Its AI governance capabilities cover DPIA workflows, training data documentation, consent management for AI-driven personalization, and DSAR handling for AI-processed data. Start free or contact the team to discuss privacy-compliant AI for your organization.