Regulator consults on web scraping to train generative AI
The UK’s data protection regulator has launched a consultation on generative AI. The Information Commissioner’s Office will have a key role in regulating AI in the UK, so this consultation is a useful signpost for the direction the ICO is heading, as well as an opportunity to influence it.
The first phase of the consultation focuses on the use of web scraping to train generative AI models. AI models are often trained on data scraped from publicly accessible web pages. The developers either scrape the data directly or use databases compiled by third parties who have used the same method.
There are ongoing legal claims relating to AI and web scraping, but they have mostly focused on copyright infringement issues. Getty Images’ ongoing claim against Stable Diffusion is one example.
The ICO’s consultation is about data protection compliance and the issues raised when personal data is scraped from the web for AI training purposes. The ICO shares its initial thoughts about whether there is a lawful basis under UK data protection laws to use personal data in that way.
The ICO’s initial conclusion is that yes, there is potentially a lawful basis. However, the ICO seems to have some serious reservations about that.
Developers will need to show that they have a ‘legitimate interest’ in using the personal data to train the model. Significantly, the ICO seems to think wanting to build a model is not a legitimate interest in itself; whether the interest is legitimate might depend on what the model is intended to be used for.
The consultation repeatedly highlights the risks of individuals losing control of their data once it is used to train an AI model, as well as the potential risks to those individuals from subsequent use of the model. Those risks have to be balanced against the interest in training the model.
The consultation also highlights that the risks to individuals may differ depending on how a developer makes the model available to third parties:
- Developers who make their models available via API can potentially exercise greater technical control over how the model is deployed. For example, the API could be designed to prevent the model from responding to the types of queries most likely to cause data protection risks.
- Developers who adopt an open-source model have few options for controlling how it is used. If developers simply rely on their contracts with customers, the ICO will apparently expect to see evidence that the contracts are complied with.
The consultation concludes that developers using personal data from web scraping will need to:
- Evidence and identify a valid and clear legitimate interest.
- Consider the potential impact on individuals’ rights particularly carefully when they do not or cannot exercise meaningful control over the use of the model.
- Demonstrate how the interest they have identified will be realised and how the risks to individuals will be meaningfully mitigated, including their access to their information rights.
You can read the consultation and submit your views here.