On August 4, 2025, Cloudflare published a detailed post accusing Perplexity AI of deploying stealth crawlers — bots that ignore robots.txt
, don’t identify themselves, and mimic real browsers to evade detection.
Cloudflare claims they have collected traffic logs showing access to sites that had explicitly disallowed PerplexityBot via robots.txt
. These bots reportedly used generic browser headers and IP rotation — techniques commonly associated with crawlers attempting not to be seen. In response, a Perplexity spokesperson denied the claim, describing the allegations as misleading and commercially motivated. Perplexity also created its write-up about the same.
This post isn’t about one company or crawler. It’s a reminder that robots.txt
is a voluntary convention, and in 2025, bots have little incentive to ask permission.
A Brief History of robots.txt
The robots.txt
file was created in 1994 after a bot took down a website. Developer Martijn Koster proposed a simple fix: let sites place a plain text file at their root (/robots.txt
) to declare what bots shouldn't touch.
It wasn't a firewall. It wasn’t enforced. It was just...a sign.
A digital version of “Please don’t enter.”
Early search engines and crawlers adhered to it. Over time, it became an unofficial part of the web’s standard.
As scraping proved profitable — initially for search, then for SEO, and later for pricing and aggregation — the gap between bots that ask for data and those that take it began to grow.
This Isn’t the First Time
Cloudflare’s blog post is not unique. It’s the latest in a long pattern.
- In 2000, eBay sued Bidder’s Edge for scraping auction listings using automated bots, despite their compliance with robots.txt restrictions.
- In 2010, Pete Warden was sued by Facebook for crawling and not obtaining prior written permission.
- In 2017, the Internet Archive warned that retroactive
robots.txt
restrictions were erasing public history. - In 2017, Intoli analyzed over 1 million
robots.txt
files, finding vague and inconsistent usage. - In 2019, Martijn Koster marked 25 years of the standard and questioned its relevance.
- In 2024, nDash explored its fragility in the age of AI — where web content isn't just indexed, it’s ingested.
Each example ends the same way: bots evolve, the robots.txt
stays the same.
The Incentives Are Changing
What’s different now is what the bots want.
In the 2000s, they indexed pages for humans. In the 2010s, they scraped prices and text for SEO and competitive intelligence. In the 2020s, they harvest content to train AI models.
The volume is higher. The value is greater. The appetite is limitless.
And the polite rules of 1994 can’t stop them.
There is no audit trail. There is no digital signature confirming which crawler is which. No API key to identify access. No penalty for ignoring a disallow. Just a voluntary file in a public directory — as easy to ignore as it is to find.
Bot Protection Services: Cloudflare Is Not a Neutral Actor
While Cloudflare’s evidence is compelling, it’s important to acknowledge context: they sell bot protection products. Calling out stealth bots reinforces their narrative — that robots.txt
isn’t enough, and their tools are the solution.
That doesn’t make them wrong. But it does highlight the need for a broader conversation that doesn’t start and end with vendor marketing.
Because the real question isn’t whether this one crawler played fair.
It’s: what happens when nobody does?
The Web Was Built on Trust
The robots.txt
file was designed for an internet of goodwill — and goodwill alone. It has no enforcement. No validation. No incentive model beyond reputation.
But the modern internet runs on different fuel: competition, data extraction, and invisible automation.
Stealth crawling isn’t new. The difference in 2025 is that it's no longer fringe — it’s foundational. Entire business models now quietly depend on it.
And the longer we rely on a polite "Please don’t enter" sign to guard the front door, the more we guarantee it gets ignored.
What Comes Next?
If robots.txt
is too weak to enforce no-crawl policies, what replaces it?
Some ideas being explored or proposed include:
- Authenticated Crawling: Verified bots using cryptographic tokens or signed headers to prove identity. Similar to mTLS or signed API requests.
🔗 https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot
🔗 https://developers.cloudflare.com/bots/ - Machine-Readable Licensing: Embed copyright and usage rules directly in page metadata or HTTP headers. Inspired by CC licenses and W3C ODRL.
🔗 https://www.w3.org/TR/odrl-model/
🔗 https://wiki.creativecommons.org/wiki/CC_REL - Crawler Registries: Public directories of known, verified bots — optionally tied to reputation scores or crawl behaviors.
🔗 https://certificate.transparency.dev/ - Temporal Access Controls: Define what bots can crawl now vs. what can be archived or used later. A middle path between live privacy and historical preservation.
🔗 https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ - Legal Frameworks: Treat large-scale scraping as subject to copyright, fair use, or digital trespass laws — especially in the context of AI training.
🔗 https://artificialintelligenceact.eu/
🔗 https://crfm.stanford.edu/fmti/May-2024/index.html
🔗 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5159436
Sources
- Cloudflare: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
- Perplexity: Agents or Bots? Making Sense of AI on the Open Web
- TechCrunch: Perplexity accused of scraping websites that explicitly blocked AI scraping
- Wired: EBay Fights Spiders on the Web
- Pete Warden: How I got sued by Facebook
- Martijn Koster: Robots.txt is 25 years old
- Intoli: Analyzing One Million Robots.txt Files
- Internet Archive: Robots.txt meant for search engines don’t work well for web archives
- nDash: The Role of Robots.txt in Modern AI and Web Governance
- robotstxt.org: About /robots.txt
Additional Reading
- Arvind Narayanan & Sayash Kapoor: https://www.aisnakeoil.com/t/ai-safety
- Cory Doctorow: https://doctorow.medium.com/https-pluralistic-net-2024-04-04-teach-me-how-to-shruggie-kagi-caaa88c221f2
- Stanford HAI: https://hai.stanford.edu/news/introducing-foundation-model-transparency-index