Artificial Intelligence

Cloudflare Puts a Default Block on AI Web Scraping

The move could reshape how LLM developers gather information — and force new deals between creators and AI companies.

| July 1, 2025 (8:00 AM ET)

Cloudflare has reversed its block on AI-crawling from optional to default, allowing finer grained crawling but only with agreement from all parties concerned.

LLMs are what they learn. From their inception the biggest source of learning has been the internet, so there has been a natural tendency for AI developers to scrape the internet as widely as possible.

Cloudflare has now introduced an option for their customers to accept or reject website scraping by AI vendors. Hitherto, internet scraping has been a major part of gathering training data for large LLM (gen-AI) developers; but the process has raised questions and objections over legality, copyright infringement, and accuracy.

The legality of web scraping is complex, with subtle but important differences between the US and Europe. Apart from questions of copyright (which largely revolve around data usage), the collection itself is often excused by scrapers asserting ‘a legitimate interest’ to dull accusations of privacy invasiveness.

The difficulty in finding common ground for a clear ruling for or against internet scraping to feed large LLM AI models can be seen in the recent history of Meta scraping Facebook and Instagram for its Llama training. On May 21, 2025, the Irish DPA declined to oppose Meta’s plans. On May 23, 2025, the Higher Regional Court of Cologne (Oberlandesgericht Köln) in Germany rejected an emergency injunction against Meta sought by consumer protection groups. On May 27, 2025, it was reported the German DPA had dropped its own case against Meta, saying. “”Given the forthcoming EU-wide evaluation of Meta’s practices, an isolated urgency procedure for Germany is not a suitable path.”

Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay

Advertisement. Scroll to continue reading.

This isn’t the end of the story, but it shows how difficult it is for bureaucratic legislators to get their ducks aligned for a formal decision. Meanwhile, many social media users object to their personal data being scraped, and many website owners object to their material, possibly copyrighted, being made available to LLM-using third parties who have never visited the websites.

This makes it an economic problem as well as a copyright infringement issue: if people use LLMs rather than visits to websites, visitor numbers will fall, and advertisers will be less likely to buy advertising space.

Now Cloudflare has stepped in with a global plan to put the crawling choice in the hands of website owners and AI developers rather than globally disjointed bureaucratic decision-makers. “Cloudflare,” announced the company, “is now the first internet infrastructure provider to block AI-crawlers accessing content without permission or compensation, by default.”

A Cloudflare block isn’t new. In September 2024, it gave its users an option to block AI-crawlers, and more than one million customers did so. This is now reversed – the default for all customers is block, and the option is allow. The result is a permission-based system, where AI companies must get the website owners’ permission to crawl – likely to be based on the purpose: for training, inference, or search. Many websites are content with ‘search’, but dubious over ’training’.

“If the Internet is going to survive the age of AI, we need to give publishers the control they deserve and build a new economic model that works for everyone – creators, consumers, tomorrow’s AI founders, and the future of the web itself,” explains Matthew Prince, co-founder and CEO at Cloudflare. “Original content is what makes the Internet one of the greatest inventions in the last century, and we have to come together to protect it.”

AI-crawlers, he continues, have been scraping content without limits. “Our goal is to put the power back in the hands of creators, while still helping AI companies innovate. This is about safeguarding the future of a free and vibrant Internet with a new model that works for everyone.”

The Cloudflare option works for website owners and their sites’ original content. It’s doubtful whether it would have much effect on the social media sites, since the major operators are also major LLM developers: LinkedIn (Microsoft’s own AI models and ultimately GPT), Facebook (Llama), X (Grok). However, social media is hardly known as a source of truth, and the quality of the LLM would rapidly deteriorate with too heavy a reliance on social media content – an LLM is grounded by what it uses for training data.

Grok provides several examples, but a major falsehood can be found in its May 2025 amplification of accusations of white genocide in South Africa (generally considered to be untrue). Such inaccuracies are almost certainly founded on X social media content.

Cloudflare’s move to give control over original content published by individual websites back to the publishers highlights several facts about modern technology: firstly, technology advances faster than lawmakers can respond, and secondly, technology is global while law is local. Cloudflare’s action can be seen as an example of global technology policing itself for the good of all.

Written By Kevin Townsend

Kevin Townsend is a Senior Contributor at SecurityWeek. He has been writing about high tech issues since before the birth of Microsoft. For the last 15 years he has specialized in information security; and has had many thousands of articles published in dozens of different magazines – from The Times and the Financial Times to current and long-gone computer magazines.

Latest News

Webinar: Why Email Security Keeps Failing (And What Has to Change)

July 8, 2026

Join this live webinar as we break down why email-layer defenses alone can't keep pace with the modern phishing ecosystem, how agentic AI is changing the capacity equation for security teams, and more.

Virtual Event: 2026 Cloud Security Summit

July 16, 2026

This year's summit will help organizations learn how to utilize tools, controls, and design models needed to properly secure cloud environments. Interact with leading solution providers and other end users facing similar challenges in securing a variety of cloud deployments.

SECURITYWEEK NETWORK:

ICS:

SecurityWeek

Artificial Intelligence

Cloudflare Puts a Default Block on AI Web Scraping

More from Kevin Townsend

Latest News

Trending

Webinar: Why Email Security Keeps Failing (And What Has to Change)

Virtual Event: 2026 Cloud Security Summit

People on the Move

Expert Insights

How to Conduct a Successful Audit of AI-Driven Software Development

Frontier AI: Six Questions Every Enterprise Should Ask Security Vendors

The AI Token Costs That Can Break Cybersecurity

When Information Becomes the Attack Surface – Understanding AI Agent Traps

What the Latest ShinyHunters Breaches Reveal About Modern Cyberattacks

SECURITYWEEK NETWORK:

ICS:

Daily Briefing Newsletter

More from Kevin Townsend

Latest News

Trending

Daily Briefing Newsletter

Webinar: Why Email Security Keeps Failing (And What Has to Change)

Virtual Event: 2026 Cloud Security Summit

People on the Move

Expert Insights

How to Conduct a Successful Audit of AI-Driven Software Development

Frontier AI: Six Questions Every Enterprise Should Ask Security Vendors

The AI Token Costs That Can Break Cybersecurity

When Information Becomes the Attack Surface – Understanding AI Agent Traps

What the Latest ShinyHunters Breaches Reveal About Modern Cyberattacks

Daily Briefing Newsletter