Point 1: Cloudflare, the cloud protection service provider, pointed out that the AI search engine Perplexity used stealth crawlers to bypass robots.txt and WAF rules to scrape restricted webpage content.
Point 2: The crawler disguises itself as a Chrome browser and frequently changes IPs and ASNs, sending approximately 20 to 25 million requests daily to global websites.
Point 3: Cloudflare has revoked Perplexity's 'Verified Bot' status and added regulatory rules to block its stealth crawling behavior.
On the 4th, Cloudflare (the cloud protection service provider) released the latest observations, revealing that after being blocked, Perplexity resorted to using a disguised browser as an undeclared crawler to bypass robots.txt and WAF restrictions, still successfully scraping content originally prohibited from being extracted. This act not only violates the internet consensus established by RFC 9309, but also undermines the basic trust mechanism websites have for legitimate crawlers.
In other words, after being blocked by the website, Perplexity did not comply and stop, but instead allowed a crawler masquerading as a regular Chrome browser to secretly scrape data.
Cloudflare pointed out that both PerplexityBot and Perplexity - User crawlers ignore the 'prohibit data scraping' directive written in robots.txt, and even attempt to bypass the blocks of WAF (Web Application Firewall).
The result is: the content that the website originally did not want to be scraped was still scraped; and this deliberate violation of internet agreements (see RFC 9309) would weaken the website's trust mechanism that 'compliant crawlers will self-identify and adhere to restrictions'.
In response to the above allegations, Perplexity spokesperson Jesse Dwyer countered that the related posts were merely Cloudflare's 'sales pitch' and denied that the crawlers belonged to them. However, Perplexity did indeed find itself embroiled in a 'plagiarism' controversy last year due to media reports (such as Wired) indicating unauthorized full citations, and CEO Aravind Srinivas faced questions about the vague definitions of 'plagiarism'.
In recent months, Cloudflare launched the 'Pay-Per-Request AI Crawler' feature, allowing publishers and websites to set clear pricing for data access; at the same time, new blocking rules for AI-bots were added to its free 'Bot Fight Mode', enabling websites to reject or throttle unpaid crawlers with a single click, thereby regaining negotiation power over 'pay for data access'.
How did Cloudflare capture Perplexity?
Cloudflare first received feedback from multiple customers, pointing out that even though Perplexity's official crawler was blocked in the 'prohibit bot scraping' robots.txt file and the firewall (WAF), website content was still being scraped. After confirming that customer settings were correct, the company further purchased several completely hidden test domains and issued the same 'completely prohibit crawling' command.
Subsequently, Cloudflare directly inquired with Perplexity about the content of these test domains, only to receive detailed responses confirming that undeclared crawlers had entered. Finally, through traffic analysis comparison, Cloudflare discovered that these crawlers disguised themselves as regular browsers, sending approximately 20 million requests daily to global websites, frequently changing IPs and network identifiers.
Based on the three pieces of evidence above, Cloudflare determined that Perplexity violated public crawler regulations, promptly revoked its 'Verified Bot' status, and added related traffic to the blocklist.
Perplexity removed from the 'Verified Bot' list
Cloudflare subsequently removed Perplexity from the 'Verified Bot' list and pushed new signatures across all plans to assist sites in automatically blocking or challenging such stealth crawlers. The company also urged industry players to adhere to five good crawler principles such as 'transparency, moderation, and single purpose', and to jointly promote an expanded version of robots.txt with the IETF.
The official also affirmed that OpenAI's ChatGPT - User strictly respects blocking instructions and uses Web Bot Auth identity signatures as a good industry example.
According to Cloudflare, compliant crawlers must meet the following five principles:
Transparency: Actively label the dedicated User-Agent, publicly disclose IP ranges or use Web Bot Auth authentication, and provide contact information for easy tracking and communication with the site.
Good Netizenship: Do not flood traffic excessively, do not scrape sensitive data, and do not use hidden or disguised means to evade detection.
Clear Purpose: Every crawler should clearly state which service it is used for, such as voice assistants, price comparison, accessibility support, etc., so that the site can determine whether to allow access.
Separate Bots for Separate Activities: Different functions should be executed by different crawlers to avoid putting the site in a dilemma of 'fully open or fully closed'.
Follow the Rules: Must check and respect robots.txt, maintain a reasonable rate, and must not bypass WAF or other security protections.
These five points encompass the core requirements for websites to be 'identifiable, manageable, and trustworthy', and are the basis for Cloudflare's evaluation of whether a crawler is worthy of being granted 'Verified Bot' status.
This article is reprinted with permission from: (Digital Era)
Original title: (Is Perplexity a data thief? Cloudflare reveals 'disguised Chrome crawler', sending over 20 million requests daily)
Original author: Li Xiantai
'Is Perplexity a data thief? Exposed as a disguised crawler, sending over 20 million requests daily' was first published in 'Encrypted City'