Back to blog

How to Navigate AI, Legal, and Web Scraping: Asking a Professional

Roberta Aukstikalnyte

2025-01-076 min read
Share

In recent years, there has been a lot of debate surrounding AI and legal (or ethical) concerns. With web scraping being one of the main ways to acquire data for, say, AI model training, it became a big part of the conversation, too. Is it legal to use scraped data to train AI models? Does scraping publicly available data violate privacy laws like GDPR or CCPA?

We've sat down with Viktorija Lapėnytė, Head of Product Legal Counsel at Oxylabs, who answered some of the most common questions asked in the context of web scraping, AI, and legal. By the end of this interview, you'll understand the key legal risks of web scraping, how they relate to AI, and practical ways to navigate legal regulations while adopting ethical, compliant practices.

Web scraping operates within a dynamic and ever-evolving legal framework that varies across jurisdictions. With no universal law regulating the practice, its legality depends on multiple factors: nature of the data being scraped, methods used, and the intended use of the extracted data. 

If we’re talking about more specific laws and regulations, the most common ones would be related with:

  • Copyright law,

  • Violations of contractual terms, such as website terms of service,

  • Privacy regulations (GDPR in the EU, the CCPA in California),

In the U.S., unauthorized access laws, such as the Computer Fraud and Abuse Act (CFAA), may also play a role in determining whether web scraping activities cross legal boundaries.

The swift rise of AI comes with new legal and ethical challenges, particularly in web scraping, as scraped data often trains AI models. In response, countries are introducing AI regulations to address issues like data privacy and protection, bias and discrimination, safety, intellectual property and others. The EU’s AI Act, now in force and set to become fully effective by 2026, is one of the most comprehensive frameworks to date. Around the world, governments are following suit, reflecting AI’s growing impact and the need for oversight. 

There are a few key risks that are most common and should be carefully considered when implementing web scraping strategies. 

One of the most prevalent concerns is copyright infringement. It arises when companies scrape and reuse protected content, such as articles, images, or databases in some cases. In contrast, scraping factual data, such as product prices or  technical specifications, typically involves a lower legal risk.

2) Breach of Terms of service

Terms of service violations present significant challenges, as well. Generally, courts are more inclined to enforce agreements where users have actively consented to the terms – they checked a box or clicked “Agree” which are also known as "clickwrap" agreements. 

In contrast, "browsewrap" agreements, which are typically passive terms located at the bottom of a website, are less likely to be enforced. Therefore, limiting web scraping activities to publicly available data only can be a very effective strategy for mitigating this risk.

3) Privacy violations 

Another common and tricky challenge is privacy violations, in case personal data (like names or contact details or other information) is involved. 

Imagine scraping a publicly available contacts database, thinking it’s fair game, only to discover that under GDPR, even publicly available personal information is still protected. This means businesses need to be careful to avoid violations. It’s important that you can justify why you’re collecting and processing personal data (if any). You also have to establish a clear lawful basis, implement all necessary requirements, and even after all this - minimize the amount of data collected while ensuring it is securely handled.

Meanwhile, CCPA in California works differently: it does not apply to personal data that individuals have made public themselves. As a result, scraping publicly available personal data under CCPA carries fewer risks. 

4) Ethical concerns

Ethical concerns should also not be overlooked. Companies engaging in web scraping should act as responsible citizens of the web, ensuring their actions neither strain nor disrupt the targeted website. Overloading servers and extracting excessive amounts of data not only crosses ethical lines but can also significantly harm the company’s reputation.

Data is power, and with great power comes great responsibility. Ethical web scraping isn’t just about adhering to the rules; it’s about respecting the ecosystem you’re benefiting from. This involves sticking to website usage limits, choosing ethical and responsible use cases, and designing systems that prioritize fairness and accountability.

The use of scraped data for training AI models has certainly sparked a wave of legal uncertainty. At the heart of the debate lies the question of whether using copyrighted material in AI training qualifies as “fair use” under U.S. copyright law. Courts consider several questions in making this determination, such as:

  • Is the use transformative?

  • What is the quantity of data used?

  • What is the impact of the original work?

  • … and more. 

Several high-profile lawsuits are currently addressing the complex intersection of AI and copyright law. The Authors Guild has sued OpenAI, alleging that ChatGPT was trained on copyrighted works without authorization. In the meantime, Getty Images is pursuing legal action against Stability AI for allegedly using its images without permission to train the Stable Diffusion model. 

Similarly, three artists have filed a class-action lawsuit against Stability AI, MidJourney, and DeviantArt, claiming their AI tools used billions of images scraped from the internet without consent. The New York Times has also sued OpenAI and Microsoft, accusing them of training AI models on its copyrighted articles without approval.

These cases highlight the legal uncertainty surrounding AI training and copyright, with copyright holders accusing AI developers of misusing protected content like images, text, and code to train their models without proper authorization or compensation. Meanwhile, proponents of AI argue that training models constitutes fair use, as AI systems analyze and learn from data rather than directly replicating it.  

They emphasize that access to vast amounts of data is crucial for the development of advanced AI models and the advancement of technology and innovation. Until courts or legislators establish clearer guidelines, the tension between promoting innovation and protecting intellectual property rights will remain one of the most contentious and closely watched issues in the AI field.

The best way to manage legal risks when scraping data for AI is to take a proactive, legally informed, and ethical approach.

My advice is to first clearly identify the type of data you’re scraping -whether it’s publicly available, personal, or copyrighted -and understanding the specific laws that apply.

Additionally, you should clearly define the purpose and scale of your scraping activities. For instance, while publicly available data may seem safe to use, privacy laws like GDPR can still impose strict requirements. Similarly, using copyrighted material for AI training demands careful evaluation of fair use or licensing obligations, as legal standards in this area are still evolving.

5. As a scraping company, can you help your clients follow the rules?

I’d say it requires an active effort from both the company and the client. 

Web scraping companies play an important role by providing high-quality products designed with compliance in mind, including ethically sourced proxies. 

However, it’s also crucial for clients to be able to make informed decisions about their own specific use case – understanding what they intend to scrape and why. This involves having an understanding of the legal regulations that apply to their actions. By fostering open communication and a commitment to compliance, both parties can minimize risks and maximize the value of scraped data in a lawful and sustainable way.

6. What are your future predictions for the AI/web scraping/legal landscape?

With the rapid evolution of AI and its reliance on vast amounts of data, the legal landscape is struggling to keep pace. Questions surrounding the legality and ethics of scraping data are causing intense debates in courtrooms and legislatures worldwide. Many of these unresolved issues, such as AI lawsuits over copyright infringement, are bound to be addressed soon, and I believe we'll have more answers and see greater clarity in the legal landscape in the near future.

The road ahead may be complex, but it also offers opportunities for businesses to lead responsibly. Companies that prioritize compliance, transparency, and ethical practices now will be better prepared to thrive as the rules evolve. As courts and lawmakers address the current ambiguities, this transformative period will ultimately define how we balance innovation with legal and ethical boundaries in the AI-driven world.

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested