Privacy and AI risk management to protect personal data

October 30, 2024

It is undeniable that there is a close relationship between AI models and the protection of personal data, and it is to be expected that this relationship will be explored in ever greater depth as the use of AI becomes more widespread.

In this regard, recently The Hamburg Commissioner for Data Protection and Freedom of Information published a discussion paper analyzing and ruling on the applicability of the General Data Protection Regulation (GDPR) to large-scale language models (LLMs).

The Authority argues that these models do not store data in their original format, since the text information used in their training is transformed into tokens or text fragments, which makes them not directly identifiable.

Discussion on data protection in language models

Tokenization is a process by which text is converted into sequences of smaller fragments that do not contain the original information in its entirety, making it virtually impossible to identify a person directly from these fragments.

Once the training is finished, the model only retains mathematical patterns represented by the weights of the neural connections, without storing the original text. Based on the above, it states that the mere storage of information by an LLM does not constitute processing of personal data within the meaning of the GDPR.

Although LLMs do not store full texts, they can generate information that matches personal data.

This approach, however, has been criticized by experts such as David Rosenthal who point out that, although LLMs do not explicitly store data, they are capable of generating consistent information that matches personal data if these have repeatedly appeared in the training data, which raises risks associated with the use of AI even if there is no direct storage.

Risk identification

In terms of regulatory compliance, a common point of the European privacy regulation (GDPR) and the recent AI regulation is the risk management-oriented approach, which is going to require organizations to adapt this management to new realities that arise hand in hand with the adoption of emerging technologies such as Artificial Intelligence.

Lack of transparency and interpretability in LLMs complicates GDPR auditing and makes analysis of how personal data is processed difficult.

A reference framework when identifying risks in this context is the one developed by the MIT (Massachusetts Institute of Technology) in its AI Risk Repository, a comprehensive and living database of more than 700 AI risks categorized by their cause and risk domain on the basis of 43 AI-related risk frameworks. These risks include:

Compromise of privacy by leaking sensitive information: as mentioned above, although LLMs do not store full text, they can generate information that matches personal data.
False or misleading information: LLMs can generate incorrect or misinterpreted content about individuals, affecting their reputation and the integrity of the personal data processed.
Disinformation and manipulation: LLMs can be used in malicious campaigns to manipulate personal information or influence people's behavior, which, although unintentional, can result in improper data processing.
Lack of transparency and interpretability: LLMs function as “black boxes”, making it difficult to analyze how personal data is processed and complicating auditing in terms of GDPR compliance.
Security vulnerabilities: LLMs can be vulnerable to attacks that exploit weaknesses in their infrastructure, which could result in unauthorized access and exposure of personal data.

Compliance and risk reduction

Based on these general risks taken from a reference framework, doubts may arise regarding compliance with the principles relating to processing set out in Article 5 of the GDPR, such as the accuracy principle: if LLMs can generate incorrect information based on the data they were trained on, there could be a spread of inaccurate personal data.

In addition, there is no easy mechanism to correct errors in the data generated by the model, which contravenes the accuracy principle.

Another question that can be raised is how the minimization principle fits in when these models are usually trained with huge amounts of data (often from unstructured datasets that are difficult to control) and whether it is possible to determine that the personal information used in training the model was only that necessary to achieve the model's objectives.

By integrating minimization and access limitation principles by design, risks can be identified and mitigated preemptively.

This reinforces the idea that AI risk management systems will need to be aligned with those already being developed for GDPR compliance.

It will therefore be crucial to keep in mind the principle of privacy by design and by default (PbD) to implement measures that ensure that only the personal data necessary for each specific purpose is processed. Integrating principles of data minimization, access limitation and adequate conservation from the beginning of the design of systems and processes enables the identification and mitigation of risks in a preventive way.

✅ At Govertis, part of Telefónica Tech, we recommend the consideration of the practical application of the PbD (privacy by design and by default) in the adoption of technology that includes AI.

Image: DC Studio / Freepik.