Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

In both situations, the data sets almost always include personal data. Thus, AI developers should carefully consider their obligations under the GDPR as well as local privacy law, depending on what applies to them.

The million-dollar question

Privacy compliance, amongst other things, should be considered as soon as training data is collected. Even if publicly available data is used for training purposes (e.g. data published on YouTube), it does not mean that such data can be freely used. This is a standard misconception amongst AI developers. Training on data sets that include personal data can take place only if the developers have a lawful legal basis for processing such data. Under the GDPR, this usually comes down to two things: consent or legitimate interest. While this may appear impossible or challenging, all AI providers training on personal data should consider privacy concerns very carefully.

The first EU guidance on the lawfulness of web scraping was encompassed under the EDPB's Report of ChatGPT's Taskforce issued in May 2024. This report indirectly supports innovation and stresses that legitimate interest might be considered as the only possible legal basis for data processing under web scraping techniques, provided that certain safeguards are applied. A couple of months prior to this, in March 2024, the UK Information Commissioner's Office (the "ICO") issued a consultation and explored the issues around the legality of web scraping. It also concluded that legitimate interest is the only remaining lawful basis for web scraping.

Pursuant to both the ICO and the EDBP's report, legitimate interest might be considered as a lawful basis for web scraping if the following criteria are met: (i) a legitimate interest exists; (ii) the processing is necessary, with personal data being adequate, relevant and limited to what is required for the purposes for which they are processed; and (iii) the interests are balanced.

Legitimate interest can serve as a lawful basis for data processing only if the interest is clearly defined and justified. Thus, when training AI models on web scraped data, this interest should not be broadly defined or vague. According to the EDPB, it is necessary not only to recognise the interest, but also to concretely justify it in terms of the purpose for which the data is collected. If the intended use of the model cannot be clearly defined in advance, it becomes challenging to justify it.

Web scraping is often considered necessary due to the volume of data required to train these models. However, according to the EDPB, even when large data sets are used, it must be ensured that unnecessary data is not collected, especially data that is not relevant to the specific training purposes. Therefore, the EDPB emphasises the importance of applying measures during data collection and excluding certain types of data from the collection process, such as public social media profiles.

Balancing interests is perhaps the most complex criterion. It is necessary to assess whether the rights and freedoms of individuals outweigh the legitimate interests of the controller. Web scraping is an invisible processing activity, where people are often unaware that their data has been collected and processed in this way. This means that individuals may lose control over their data, which can compromise their privacy rights. This necessitates the mandatory application of technical and organisational measures, such as data filtering during collection and excluding certain sources from the process.

Special approach for special categories of personal data

A particular issue arises with the scraping of special categories of personal data, such as data related to health, political views and religious beliefs. Processing this data requires the explicit consent of the individual, which further complicates the legality of web scraping. Without clear and explicit consent, processing such data may directly violate the GDPR, which strictly demands respect for privacy and individual rights.

One example where this issue arises is search engine scraping. This is what Google engages in when it collects data for the sole purpose of indexing and enabling searches. Unique to search engines, this form of scraping may be considered justified in the context of the public's right to information, as recognised by the Charter of Fundamental Rights of the European Union, but each case must still be carefully evaluated to ensure that the fundamental rights of individuals are not violated. This exception can only be justified with strict protective measures and a clear framework that limits processing to what is necessary to achieve legitimate objectives.

But that's not all

One of the key elements in ensuring GDPR compliance, especially in the context of web scraping, is the obligation to inform individuals whose data are being collected, even when consent is not the basis for processing. Article 13 of the GDPR clearly mandates that individuals must be informed prior to the processing of their data collected directly from them. However, when data is collected through web scraping, which often involves gathering data from publicly available sources, Article 14 of the GDPR (or Article 24 of the Serbian privacy law) applies. This article governs the obligation to inform individuals about the processing of their data, even when the processing is not immediately apparent or is indirect, as is the case with web scraping.

Depending on the AI product itself, the provider of AI systems might also have other obligations under the GDPR and/or local privacy laws. These obligations include legitimate interest assessment (LIA) and data protection impact assessment (DPIA), possibly with the obligation to acquire prior approval from the competent authority (depending on the AI system itself).

Final remarks

In an era of rapid AI development and widespread digitalisation, the legality of web scraping has become a critical question for AI developers. Despite the potential for innovation that web scraping offers, it is all too often forgotten that every step in this process is deeply rooted in a complex legal framework designed to protect individuals' privacy. Given that the EU AI Act will become applicable for generative AI models within a year in the EU (or three years depending on whether the models were placed on the market before 2 August 2025), or outside of the EU in specific situations, developers collecting data through web scraping should carefully analyse whether their products will be affected by this law. If yes, their products and business operations must be promptly adjusted to reflect these developments.

By Marija Vlajkovic, Partner, and Marija Lukic, Senior Associate, Schoenherr

Sidebar

Navigation

Linklaters and Wozniak Legal Advise on Mirova's EUR 50 Million Investment in GreenWay

Maciej Georg Joins Crido Legal as Partner

Kinstellar Advises EMMA Capital on Acquisition of Diamedix in Romania, Moldova, Bulgaria, and Ukraine

Wolf Theiss Advises Wabtec on USD 960 Million Acquisition of Dellner Couplers

Bexley Beaumont Advises Vodeno on License and Service Agreement Revision

Cytowski & Partners Advises SplxAI on USD 7 Million Series Seed Financing with LAUNCHub Ventures

2025 CEELM Deal of the Year Awards Banquet: And the Winner Is…

Taylor Wessing Announces Alliance with Orsingher Ortu – Avvocati Associati

Schoenherr Advises UniCredit Bank on D&B Refurbishment of Headquarters in Hungary

Slovenia's in Search of Upgrades: A Buzz Interview with Tine Misic of ODI Law

Staying Happy, Healthy, and Green in Croatia: A Buzz Interview with Tarja Krehic of Krehic & Zornada

Similar Volume, Lower Values: A CMS CEE M&A Report

Guest Editorial: Lawyering in Greece – Opportunities, Challenges, and the Path Forward

The Corner Office: 2024 in (Volume) Review

Private Healthcare in CEE

Inside Insight: Natalia Lysa of Nestle

Inside Insight: Filip Knezevic of Vezuv

Ukrainian GCs on Trends in Hiring Local Counsels and Use of Legaltech

2025 CEE General Counsel Summit Sneak Peak: Interview with Davor Majstorovic of AMB Legal

2025 Regional CEE GC Summit Sneak Peek: Interview with Marton Eorsi of Addleshaw Goddard

Inside Insight: Pawel Szczepaniak of mBank

Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

Tools

Typography

Our Latest Issue

2025 CEE General Counsel Summit Sneak Peak: Interview with Davor Majstorovic of AMB Legal

Hungary to Open Doors for New Power Plant Projects as New Capacity Allocation System Takes Shape

Serbia Renewable Energy Auctions Surpass Expectations

Hot Practice in CEE: Marton Eorsi on Addleshaw Goddard's Infrastructure and Energy Practice

Manuela Iurascu and Raluca Gabor Join Stratulat Albulescu Partnership Ranks

Linklaters and Wozniak Legal Advise on Mirova's EUR 50 Million Investment in GreenWay

Maciej Georg Joins Crido Legal as Partner

Kinstellar Advises EMMA Capital on Acquisition of Diamedix in Romania, Moldova, Bulgaria, and Ukraine

Wolf Theiss Advises Wabtec on USD 960 Million Acquisition of Dellner Couplers

Bexley Beaumont Advises Vodeno on License and Service Agreement Revision

News Categories

Latest News

More Analysis

Latest Analysis and Commentary

In-House Categories

Latest In-House

Tools

Typography

Share This

Our Latest Issue