Published in AI

Cloudflare stops AI scraping web content

by on04 July 2024

Using AI to do it

Cloudflare introduced a new feature in its content delivery network (CDN) that stops AI developers from scraping web content.

According to Cloudflare, the feature is available for free and paid users and ironically uses AI to detect automated content extraction attempts.

Cloudflare says its software can identify bots that scrape content for LLM training projects, even when they try to avoid detection.

"Sadly, we've observed bot operators attempt to appear as though they are real browsers by using a spoofed user agent," Cloudflare engineers wrote in a blog post today. We've monitored this activity over time and are proud to say that our global machine-learning model has consistently recognised this activity as a bot."

One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI, a well-funded search engine startup.

 Last month, Wired reported that the way the bot scrapes websites makes its requests look like regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content.

Cloudflare's feature assigns every website visit that its platform processes a score from 1 to 99, with the lower the number, the more likely it is that a bot generated the request.

According to the company bog, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30. This empowers website operators, giving them the control to identify and block such bots, thereby protecting their content.

"When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare's engineers detailed. "For every fingerprint we see, we use Cloudflare's network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint," the blog said.

Cloudflare's proactive approach includes updating the feature over time to address changes in AI scraping bots' technical fingerprints and the emergence of new crawlers.

As part of the initiative, the company said it is rolling out a tool that will enable website operators to report any new bots they may encounter.

Last modified on 04 July 2024
Rate this item
(1 Vote)