Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

In both situations, the data sets almost always include personal data. Thus, AI developers should carefully consider their obligations under the GDPR as well as local privacy law, depending on what applies to them.

The million-dollar question

Privacy compliance, amongst other things, should be considered as soon as training data is collected. Even if publicly available data is used for training purposes (e.g. data published on YouTube), it does not mean that such data can be freely used. This is a standard misconception amongst AI developers. Training on data sets that include personal data can take place only if the developers have a lawful legal basis for processing such data. Under the GDPR, this usually comes down to two things: consent or legitimate interest. While this may appear impossible or challenging, all AI providers training on personal data should consider privacy concerns very carefully.

The first EU guidance on the lawfulness of web scraping was encompassed under the EDPB's Report of ChatGPT's Taskforce issued in May 2024. This report indirectly supports innovation and stresses that legitimate interest might be considered as the only possible legal basis for data processing under web scraping techniques, provided that certain safeguards are applied. A couple of months prior to this, in March 2024, the UK Information Commissioner's Office (the "ICO") issued a consultation and explored the issues around the legality of web scraping. It also concluded that legitimate interest is the only remaining lawful basis for web scraping.

Pursuant to both the ICO and the EDBP's report, legitimate interest might be considered as a lawful basis for web scraping if the following criteria are met: (i) a legitimate interest exists; (ii) the processing is necessary, with personal data being adequate, relevant and limited to what is required for the purposes for which they are processed; and (iii) the interests are balanced.

Legitimate interest can serve as a lawful basis for data processing only if the interest is clearly defined and justified. Thus, when training AI models on web scraped data, this interest should not be broadly defined or vague. According to the EDPB, it is necessary not only to recognise the interest, but also to concretely justify it in terms of the purpose for which the data is collected. If the intended use of the model cannot be clearly defined in advance, it becomes challenging to justify it.

Web scraping is often considered necessary due to the volume of data required to train these models. However, according to the EDPB, even when large data sets are used, it must be ensured that unnecessary data is not collected, especially data that is not relevant to the specific training purposes. Therefore, the EDPB emphasises the importance of applying measures during data collection and excluding certain types of data from the collection process, such as public social media profiles.

Balancing interests is perhaps the most complex criterion. It is necessary to assess whether the rights and freedoms of individuals outweigh the legitimate interests of the controller. Web scraping is an invisible processing activity, where people are often unaware that their data has been collected and processed in this way. This means that individuals may lose control over their data, which can compromise their privacy rights. This necessitates the mandatory application of technical and organisational measures, such as data filtering during collection and excluding certain sources from the process.

Special approach for special categories of personal data

A particular issue arises with the scraping of special categories of personal data, such as data related to health, political views and religious beliefs. Processing this data requires the explicit consent of the individual, which further complicates the legality of web scraping. Without clear and explicit consent, processing such data may directly violate the GDPR, which strictly demands respect for privacy and individual rights.

One example where this issue arises is search engine scraping. This is what Google engages in when it collects data for the sole purpose of indexing and enabling searches. Unique to search engines, this form of scraping may be considered justified in the context of the public's right to information, as recognised by the Charter of Fundamental Rights of the European Union, but each case must still be carefully evaluated to ensure that the fundamental rights of individuals are not violated. This exception can only be justified with strict protective measures and a clear framework that limits processing to what is necessary to achieve legitimate objectives.

But that's not all

One of the key elements in ensuring GDPR compliance, especially in the context of web scraping, is the obligation to inform individuals whose data are being collected, even when consent is not the basis for processing. Article 13 of the GDPR clearly mandates that individuals must be informed prior to the processing of their data collected directly from them. However, when data is collected through web scraping, which often involves gathering data from publicly available sources, Article 14 of the GDPR (or Article 24 of the Serbian privacy law) applies. This article governs the obligation to inform individuals about the processing of their data, even when the processing is not immediately apparent or is indirect, as is the case with web scraping.

Depending on the AI product itself, the provider of AI systems might also have other obligations under the GDPR and/or local privacy laws. These obligations include legitimate interest assessment (LIA) and data protection impact assessment (DPIA), possibly with the obligation to acquire prior approval from the competent authority (depending on the AI system itself).

Final remarks

In an era of rapid AI development and widespread digitalisation, the legality of web scraping has become a critical question for AI developers. Despite the potential for innovation that web scraping offers, it is all too often forgotten that every step in this process is deeply rooted in a complex legal framework designed to protect individuals' privacy. Given that the EU AI Act will become applicable for generative AI models within a year in the EU (or three years depending on whether the models were placed on the market before 2 August 2025), or outside of the EU in specific situations, developers collecting data through web scraping should carefully analyse whether their products will be affected by this law. If yes, their products and business operations must be promptly adjusted to reflect these developments.

By Marija Vlajkovic, Partner, and Marija Lukic, Senior Associate, Schoenherr

Sidebar

Navigation

Act Legal Poland Advises Adventum International on Industrial Property Sale-and-Leaseback Deal

Zornada and Krehic Advise Croatia Osiguranje on Share Transactions with Adris Grupa

Greenberg Traurig Advises on USD 7.9 Billion Refinancing for Bausch Health

Rymarz Zdort Maruta Represents Centralny Port Komunikacyjny in Successful Appeals Proceedings

Harrisons Advises EBRD on EUR 13.4 Million Dual Loan Package for Inn-Flex’s Expansion in Serbia

Closing: Alpha Bank and Alpha International Holdings' Sale of Alpha Leasing Romania IFN and Alpha Insurance Brokers to Vista Bank Romania Now Closed

Karacam & Sir and Donmez Law Advise Remus Enerji on Investment Round

Allen Overy Shearman Sterling Advises on EUR 2.15 Billion Financing for Allwyn

CMS Advises Central European Petroleum on Wolin East Project

Contentious Reforms in Lithuania: A Buzz Interview with Aiste Mikociuniene of Widen Legal

Hot Practice in Poland: Andrzej Wysokinski on Greenberg Traurig's Banking & Finance Practice

Increased Regulators' Scrutiny in Turkiye: A Buzz Interview with Sinan Diniz of KST Law

North Macedonia's Digital Agenda: A Buzz Interview with Elena Nikodinovska Miftari of Law Office Emil Miftari

Three Things Marketers Wish Managing Partners Did Differently

Throwing A Wrench in Hungarian M&As: A Buzz Interview with Jozsef Bulcsu Fenyvesi of Oppenheim

Future-Proofing Legal Operations: Insights into AI, LLMs, and Next-Gen Tools

2025 Turkish GC Summit Sneak Peek: Interview with Kerem Turunc of Turunc

Cybersecurity in the AI Age

Inside Insight: Interview with Mihaela Scarlatescu of Farmexim

Inside Insight: Interview with Ana Zakovska of IT Labs

Inside Insight: Simone Quantschnigg of Vamed Care

Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

Tools

Typography

Serbia Knowledge Partner

Our Latest Issue

News Categories

Latest News

More Analysis

Latest Analysis and Commentary

In-House Categories

Latest In-House

Tools

Typography

Share This

Serbia Knowledge Partner

Our Latest Issue