Connect with us

Hi, what are you looking for?

SecurityWeekSecurityWeek

Artificial Intelligence

Cloudflare Puts a Default Block on AI Web Scraping

The move could reshape how LLM developers gather information — and force new deals between creators and AI companies.

Cloudflare Blocks AI Bots

Cloudflare has reversed its block on AI-crawling from optional to default, allowing finer grained crawling but only with agreement from all parties concerned.

LLMs are what they learn. From their inception the biggest source of learning has been the internet, so there has been a natural tendency for AI developers to scrape the internet as widely as possible.

Cloudflare has now introduced an option for their customers to accept or reject website scraping by AI vendors. Hitherto, internet scraping has been a major part of gathering training data for large LLM (gen-AI) developers; but the process has raised questions and objections over legality, copyright infringement, and accuracy.

The legality of web scraping is complex, with subtle but important differences between the US and Europe. Apart from questions of copyright (which largely revolve around data usage), the collection itself is often excused by scrapers asserting ‘a legitimate interest’ to dull accusations of privacy invasiveness.

The difficulty in finding common ground for a clear ruling for or against internet scraping to feed large LLM AI models can be seen in the recent history of Meta scraping Facebook and Instagram for its Llama training. On May 21, 2025, the Irish DPA declined to oppose Meta’s plans. On May 23, 2025, the Higher Regional Court of Cologne (Oberlandesgericht Köln) in Germany rejected an emergency injunction against Meta sought by consumer protection groups. On May 27, 2025, it was reported the German DPA had dropped its own case against Meta, saying. “”Given the forthcoming EU-wide evaluation of Meta’s practices, an isolated urgency procedure for Germany is not a suitable path.”

Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay

This isn’t the end of the story, but it shows how difficult it is for bureaucratic legislators to get their ducks aligned for a formal decision. Meanwhile, many social media users object to their personal data being scraped, and many website owners object to their material, possibly copyrighted, being made available to LLM-using third parties who have never visited the websites. 

This makes it an economic problem as well as a copyright infringement issue: if people use LLMs rather than visits to websites, visitor numbers will fall, and advertisers will be less likely to buy advertising space.

Advertisement. Scroll to continue reading.

Now Cloudflare has stepped in with a global plan to put the crawling choice in the hands of website owners and AI developers rather than globally disjointed bureaucratic decision-makers. “Cloudflare,” announced the company, “is now the first internet infrastructure provider to block AI-crawlers accessing content without permission or compensation, by default.”

A Cloudflare block isn’t new. In September 2024, it gave its users an option to block AI-crawlers, and more than one million customers did so. This is now reversed – the default for all customers is block, and the option is allow. The result is a permission-based system, where AI companies must get the website owners’ permission to crawl – likely to be based on the purpose: for training, inference, or search. Many websites are content with ‘search’, but dubious over ’training’. 

“If the Internet is going to survive the age of AI, we need to give publishers the control they deserve and build a new economic model that works for everyone – creators, consumers, tomorrow’s AI founders, and the future of the web itself,” explains Matthew Prince, co-founder and CEO at Cloudflare. “Original content is what makes the Internet one of the greatest inventions in the last century, and we have to come together to protect it.” 

AI-crawlers, he continues, have been scraping content without limits. “Our goal is to put the power back in the hands of creators, while still helping AI companies innovate. This is about safeguarding the future of a free and vibrant Internet with a new model that works for everyone.” 

The Cloudflare option works for website owners and their sites’ original content. It’s doubtful whether it would have much effect on the social media sites, since the major operators are also major LLM developers: LinkedIn (Microsoft’s own AI models and ultimately GPT), Facebook (Llama), X (Grok). However, social media is hardly known as a source of truth, and the quality of the LLM would rapidly deteriorate with too heavy a reliance on social media content – an LLM is grounded by what it uses for training data.

Grok provides several examples, but a major falsehood can be found in its May 2025 amplification of accusations of white genocide in South Africa (generally considered to be untrue). Such inaccuracies are almost certainly founded on X social media content.

Cloudflare’s move to give control over original content published by individual websites back to the publishers highlights several facts about modern technology: firstly, technology advances faster than lawmakers can respond, and secondly, technology is global while law is local. Cloudflare’s action can be seen as an example of global technology policing itself for the good of all.

Related: Cloudflare Tunnels Abused in New Malware Campaign

Related: Surge in Cyberattacks Targeting Journalists: Cloudflare

Related: ClickFix Attack Exploits Fake Cloudflare Turnstile to Deliver Malware

Related: New Cloudflare Service Provides Real-Time Threat Intelligence

Written By

Kevin Townsend is a Senior Contributor at SecurityWeek. He has been writing about high tech issues since before the birth of Microsoft. For the last 15 years he has specialized in information security; and has had many thousands of articles published in dozens of different magazines – from The Times and the Financial Times to current and long-gone computer magazines.

Trending

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Learn how the LOtL threat landscape has evolved, why traditional endpoint hardening methods fall short, and how adaptive, user-aware approaches can reduce risk.

Watch Now

Join the summit to explore critical threats to public cloud infrastructure, APIs, and identity systems through discussions, case studies, and insights into emerging technologies like AI and LLMs.

Register

People on the Move

Jessica Newman has joined Sophos as General Manager of Global Cyber Insurance.

Breach and attack simulation solutions provider AttackIQ has appointed Pete Luban as Field Chief Information Security Officer.

Matthew Cowell has assumed the role of VP of Strategic Alliances at Nozomi Networks. He previously served in the same role at Dragos.

More People On The Move

Expert Insights

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest cybersecurity news, threats, and expert insights. Unsubscribe at any time.