When developing their models, AI providers use various data sets. Sometimes these are provided by their clients, as in the case of tailor-made chatbots, and sometimes the models are trained on licensed or even publicly available data.
In both situations, the data sets almost always include personal data. Thus, AI developers should carefully consider their obligations under the GDPR as well as local privacy law, depending on what applies to them.
The million-dollar question
Privacy compliance, amongst other things, should be considered as soon as training data is collected. Even if publicly available data is used for training purposes (e.g. data published on YouTube), it does not mean that such data can be freely used. This is a standard misconception amongst AI developers. Training on data sets that include personal data can take place only if the developers have a lawful legal basis for processing such data. Under the GDPR, this usually comes down to two things: consent or legitimate interest. While this may appear impossible or challenging, all AI providers training on personal data should consider privacy concerns very carefully.
The first EU guidance on the lawfulness of web scraping was encompassed under the EDPB's Report of ChatGPT's Taskforce issued in May 2024. This report indirectly supports innovation and stresses that legitimate interest might be considered as the only possible legal basis for data processing under web scraping techniques, provided that certain safeguards are applied. A couple of months prior to this, in March 2024, the UK Information Commissioner's Office (the "ICO") issued a consultation and explored the issues around the legality of web scraping. It also concluded that legitimate interest is the only remaining lawful basis for web scraping.
Pursuant to both the ICO and the EDBP's report, legitimate interest might be considered as a lawful basis for web scraping if the following criteria are met: (i) a legitimate interest exists; (ii) the processing is necessary, with personal data being adequate, relevant and limited to what is required for the purposes for which they are processed; and (iii) the interests are balanced.
Legitimate interest can serve as a lawful basis for data processing only if the interest is clearly defined and justified. Thus, when training AI models on web scraped data, this interest should not be broadly defined or vague. According to the EDPB, it is necessary not only to recognise the interest, but also to concretely justify it in terms of the purpose for which the data is collected. If the intended use of the model cannot be clearly defined in advance, it becomes challenging to justify it.
Web scraping is often considered necessary due to the volume of data required to train these models. However, according to the EDPB, even when large data sets are used, it must be ensured that unnecessary data is not collected, especially data that is not relevant to the specific training purposes. Therefore, the EDPB emphasises the importance of applying measures during data collection and excluding certain types of data from the collection process, such as public social media profiles.
Balancing interests is perhaps the most complex criterion. It is necessary to assess whether the rights and freedoms of individuals outweigh the legitimate interests of the controller. Web scraping is an invisible processing activity, where people are often unaware that their data has been collected and processed in this way. This means that individuals may lose control over their data, which can compromise their privacy rights. This necessitates the mandatory application of technical and organisational measures, such as data filtering during collection and excluding certain sources from the process.
Special approach for special categories of personal data
A particular issue arises with the scraping of special categories of personal data, such as data related to health, political views and religious beliefs. Processing this data requires the explicit consent of the individual, which further complicates the legality of web scraping. Without clear and explicit consent, processing such data may directly violate the GDPR, which strictly demands respect for privacy and individual rights.
One example where this issue arises is search engine scraping. This is what Google engages in when it collects data for the sole purpose of indexing and enabling searches. Unique to search engines, this form of scraping may be considered justified in the context of the public's right to information, as recognised by the Charter of Fundamental Rights of the European Union, but each case must still be carefully evaluated to ensure that the fundamental rights of individuals are not violated. This exception can only be justified with strict protective measures and a clear framework that limits processing to what is necessary to achieve legitimate objectives.
But that's not all
One of the key elements in ensuring GDPR compliance, especially in the context of web scraping, is the obligation to inform individuals whose data are being collected, even when consent is not the basis for processing. Article 13 of the GDPR clearly mandates that individuals must be informed prior to the processing of their data collected directly from them. However, when data is collected through web scraping, which often involves gathering data from publicly available sources, Article 14 of the GDPR (or Article 24 of the Serbian privacy law) applies. This article governs the obligation to inform individuals about the processing of their data, even when the processing is not immediately apparent or is indirect, as is the case with web scraping.
Depending on the AI product itself, the provider of AI systems might also have other obligations under the GDPR and/or local privacy laws. These obligations include legitimate interest assessment (LIA) and data protection impact assessment (DPIA), possibly with the obligation to acquire prior approval from the competent authority (depending on the AI system itself).
Final remarks
In an era of rapid AI development and widespread digitalisation, the legality of web scraping has become a critical question for AI developers. Despite the potential for innovation that web scraping offers, it is all too often forgotten that every step in this process is deeply rooted in a complex legal framework designed to protect individuals' privacy. Given that the EU AI Act will become applicable for generative AI models within a year in the EU (or three years depending on whether the models were placed on the market before 2 August 2025), or outside of the EU in specific situations, developers collecting data through web scraping should carefully analyse whether their products will be affected by this law. If yes, their products and business operations must be promptly adjusted to reflect these developments.
By Marija Vlajkovic, Partner, and Marija Lukic, Senior Associate, Schoenherr