Cloudflare has reversed its block on AI-crawling from optional to default, allowing finer grained crawling but only with agreement from all parties concerned.
LLMs are what they learn. From their inception the biggest source of learning has been the internet, so there has been a natural tendency for AI developers to scrape the internet as widely as possible.
Cloudflare has now introduced an option for their customers to accept or reject website scraping by AI vendors. Hitherto, internet scraping has been a major part of gathering training data for large LLM (gen-AI) developers; but the process has raised questions and objections over legality, copyright infringement, and accuracy.
The legality of web scraping is complex, with subtle but important differences between the US and Europe. Apart from questions of copyright (which largely revolve around data usage), the collection itself is often excused by scrapers asserting ‘a legitimate interest’ to dull accusations of privacy invasiveness.
The difficulty in finding common ground for a clear ruling for or against internet scraping to feed large LLM AI models can be seen in the recent history of Meta scraping Facebook and Instagram for its Llama training. On May 21, 2025, the Irish DPA declined to oppose Meta’s plans. On May 23, 2025, the Higher Regional Court of Cologne (Oberlandesgericht Köln) in Germany rejected an emergency injunction against Meta sought by consumer protection groups. On May 27, 2025, it was reported the German DPA had dropped its own case against Meta, saying. “”Given the forthcoming EU-wide evaluation of Meta’s practices, an isolated urgency procedure for Germany is not a suitable path.”
Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay
This isn’t the end of the story, but it shows how difficult it is for bureaucratic legislators to get their ducks aligned for a formal decision. Meanwhile, many social media users object to their personal data being scraped, and many website owners object to their material, possibly copyrighted, being made available to LLM-using third parties who have never visited the websites.
This makes it an economic problem as well as a copyright infringement issue: if people use LLMs rather than visits to websites, visitor numbers will fall, and advertisers will be less likely to buy advertising space.
Now Cloudflare has stepped in with a global plan to put the crawling choice in the hands of website owners and AI developers rather than globally disjointed bureaucratic decision-makers. “Cloudflare,” announced the company, “is now the first internet infrastructure provider to block AI-crawlers accessing content without permission or compensation, by default.”
A Cloudflare block isn’t new. In September 2024, it gave its users an option to block AI-crawlers, and more than one million customers did so. This is now reversed – the default for all customers is block, and the option is allow. The result is a permission-based system, where AI companies must get the website owners’ permission to crawl – likely to be based on the purpose: for training, inference, or search. Many websites are content with ‘search’, but dubious over ’training’.
“If the Internet is going to survive the age of AI, we need to give publishers the control they deserve and build a new economic model that works for everyone – creators, consumers, tomorrow’s AI founders, and the future of the web itself,” explains Matthew Prince, co-founder and CEO at Cloudflare. “Original content is what makes the Internet one of the greatest inventions in the last century, and we have to come together to protect it.”
AI-crawlers, he continues, have been scraping content without limits. “Our goal is to put the power back in the hands of creators, while still helping AI companies innovate. This is about safeguarding the future of a free and vibrant Internet with a new model that works for everyone.”
The Cloudflare option works for website owners and their sites’ original content. It’s doubtful whether it would have much effect on the social media sites, since the major operators are also major LLM developers: LinkedIn (Microsoft’s own AI models and ultimately GPT), Facebook (Llama), X (Grok). However, social media is hardly known as a source of truth, and the quality of the LLM would rapidly deteriorate with too heavy a reliance on social media content – an LLM is grounded by what it uses for training data.
Grok provides several examples, but a major falsehood can be found in its May 2025 amplification of accusations of white genocide in South Africa (generally considered to be untrue). Such inaccuracies are almost certainly founded on X social media content.
Cloudflare’s move to give control over original content published by individual websites back to the publishers highlights several facts about modern technology: firstly, technology advances faster than lawmakers can respond, and secondly, technology is global while law is local. Cloudflare’s action can be seen as an example of global technology policing itself for the good of all.
Related: Cloudflare Tunnels Abused in New Malware Campaign
Related: Surge in Cyberattacks Targeting Journalists: Cloudflare
Related: ClickFix Attack Exploits Fake Cloudflare Turnstile to Deliver Malware
Related: New Cloudflare Service Provides Real-Time Threat Intelligence
