Perplexity Accused of Scraping Websites That Explicitly Blocked AI Crawlers, Cloudflare Says
Sources: https://techcrunch.com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping, techcrunch.com
TL;DR
- Cloudflare alleges Perplexity crawled and scraped websites that had explicit blocks against AI scraping.
- The company purportedly masked its crawler by altering user-agent signals and ASN, and even impersonated a Chrome-like browser when blocked.
- The activity is described as occurring across tens of thousands of domains with millions of requests per day; Perplexity disputes that content was accessed.
- Cloudflare has de-listed Perplexity’s bots and launched a marketplace to charge AI scrapers, as well as a free tool to prevent bot scraping.
Context and background
Perplexity, an AI startup, is accused by Cloudflare of crawling and scraping content from sites that signaled they did not want their pages accessed by AI systems. The Cloudflare research notes the behavior persisted even after publishers added robots.txt rules and explicit blocks targeting Perplexity’s known bots. According to TechCrunch, Cloudflare observed Perplexity obscuring its identity during scraping efforts and using techniques to bypass stated preferences on a large scale. This situation sits within a broader industry pattern in which AI products rely on large-scale data from the open web, often without express permission. The debate over data access for AI training has intensified as publishers increasingly try to defend their content through robots.txt and related controls, with mixed results to date. For background, Perplexity has previously faced public questions about the originality of its content, including criticism from outlets such as Wired last year and questions about plagiarism at a Disrupt 2024 interview.
What’s new
Cloudflare published findings indicating that Perplexity not only used its declared user-agent but also switched to a generic browser signature to imitate Google Chrome on macOS when its crawler was blocked. The research states the activity was observed across tens of thousands of domains and involved millions of requests daily. In response, Cloudflare de-listed Perplexity’s bots from its verified list and added new techniques to block them. Perplexity’s spokesman characterized the blog post as a “sales pitch” and claimed in a follow-up message that the screenshots did not show content being accessed and that the bot named in the post isn’t theirs. TechCrunch notes that Cloudflare has recently taken public steps against AI crawlers, including launching a marketplace where website owners can charge AI scrapers, and offering a free tool to help block scraping bots.
Why it matters (impact for developers/enterprises)
For websites and publishers, the interplay between AI data needs and content protection is increasingly critical. Cloudflare’s actions reflect a broader trend where operators seek to monetize or restrict access by AI entities that harvest data for training or product development. The use of robots.txt and other access controls has proven imperfect in stopping sophisticated scrapers, prompting platforms and service providers to develop new defenses. For developers and enterprises building or deploying AI that relies on web data, the case highlights ongoing tensions around data provenance, content licensing, and the reliability of public data sources.
Technical details or Implementation
The core claims center on techniques used to evade site-level protections:
- Declared user-agent manipulation: Perplexity reportedly used its own crawler identity but supplemented it with signals that resembled typical browsers when blocked.
- ASN changes: The crawler reportedly altered its Autonomous System Number to obscure origin and make blocking harder.
- Behavior across many domains: Cloudflare says the activity spanned tens of thousands of domains and involved millions of requests per day, indicating a broad, automated operation.
- Fingerprinting methods: Cloudflare states it could fingerprint the crawler using a combination of machine learning and network signals to identify Perplexity’s activity even when standard blocks were in place.
- Publisher responses: Some sites reportedly implemented Robots.txt rules or blocked known Perplexity bots, yet the activity continued according to Cloudflare. | Signal | Description |--- |--- |Declared user-agent | Perplexity’s own crawler identity used, with instances of impersonating a generic browser when blocked. |ASN changes | Altering network origin to mask identity and complicate blocking efforts. | Implementation notes: Cloudflare says it observed the behavior through public-facing web traffic and tested blocking rules to confirm circumvention. The company also stated it has adjusted its systems to block Perplexity’s evasive techniques and withheld the bot from its verified lists.
Key takeaways
- Some AI data-collection practices remain opaque and contested, raising questions about data provenance and licensing.
- Even when publishers signal disallowance via robots.txt or similar controls, determined crawlers may attempt to evade blocks.
- Cloudflare’s stance reflects a growing emphasis on protecting publisher content and monetizing AI scraping activity.
- Perplexity denies that content was accessed and disputes the blog post’s representation of its bot.
- The tech industry continues to explore tools and marketplaces aimed at charging or blocking AI scrapers to address business model disruption for publishers.
FAQ
-
What is the core allegation against Perplexity?
Cloudflare alleges Perplexity crawled and scraped sites that explicitly blocked AI scraping, using evasion techniques such as changing user-agent signals and ASN to hide its identity.
-
How did Perplexity allegedly evade blocks?
By altering its crawler identity and sometimes impersonating a generic browser (Chrome on macOS) when its declared crawler was blocked, according to Cloudflare.
-
What has Cloudflare done in response?
Cloudflare says it de-listed Perplexity’s bots from its verified list and added new techniques to block them; Cloudflare has also promoted a marketplace for charging AI scrapers and released a free bot-blocking tool.
-
How has Perplexity Responded?
A Perplexity spokesperson dismissed the blog post as a sales pitch and claimed the screenshots did not show content access; the company also contested that the bot named in the post is not theirs.
-
Are there broader implications for publishers?
Yes, publishers are increasingly using robots.txt and other mechanisms to defend content, while AI providers seek access to data for training, fueling ongoing debates about data rights and monetization.
References
- TechCrunch article: Perplexity accused of scraping websites that explicitly blocked AI scraping
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.