Robots.txt in 2025

0:00

/0:59

On August 4, 2025, Cloudflare published a detailed post accusing Perplexity AI of deploying stealth crawlers — bots that ignore robots.txt, don’t identify themselves, and mimic real browsers to evade detection.

Cloudflare claims they have collected traffic logs showing access to sites that had explicitly disallowed PerplexityBot via robots.txt. These bots reportedly used generic browser headers and IP rotation — techniques commonly associated with crawlers attempting not to be seen. In response, a Perplexity spokesperson denied the claim, describing the allegations as misleading and commercially motivated. Perplexity also created its write-up about the same.

This post isn’t about one company or crawler. It’s a reminder that robots.txt is a voluntary convention, and in 2025, bots have little incentive to ask permission.

A Brief History of `robots.txt`

The robots.txt file was created in 1994 after a bot took down a website. Developer Martijn Koster proposed a simple fix: let sites place a plain text file at their root (/robots.txt) to declare what bots shouldn't touch.

It wasn't a firewall. It wasn’t enforced. It was just...a sign.

A digital version of “Please don’t enter.”

Early search engines and crawlers adhered to it. Over time, it became an unofficial part of the web’s standard.

As scraping proved profitable — initially for search, then for SEO, and later for pricing and aggregation — the gap between bots that ask for data and those that take it began to grow.

This Isn’t the First Time

Cloudflare’s blog post is not unique. It’s the latest in a long pattern.

In 2000, eBay sued Bidder’s Edge for scraping auction listings using automated bots, despite their compliance with robots.txt restrictions.
In 2010, Pete Warden was sued by Facebook for crawling and not obtaining prior written permission.
In 2017, the Internet Archive warned that retroactive robots.txt restrictions were erasing public history.
In 2017, Intoli analyzed over 1 million robots.txt files, finding vague and inconsistent usage.
In 2019, Martijn Koster marked 25 years of the standard and questioned its relevance.
In 2024, nDash explored its fragility in the age of AI — where web content isn't just indexed, it’s ingested.

Each example ends the same way: bots evolve, the robots.txt stays the same.

The Incentives Are Changing

What’s different now is what the bots want.

In the 2000s, they indexed pages for humans. In the 2010s, they scraped prices and text for SEO and competitive intelligence. In the 2020s, they harvest content to train AI models.

The volume is higher. The value is greater. The appetite is limitless.

And the polite rules of 1994 can’t stop them.

There is no audit trail. There is no digital signature confirming which crawler is which. No API key to identify access. No penalty for ignoring a disallow. Just a voluntary file in a public directory — as easy to ignore as it is to find.

Bot Protection Services: Cloudflare Is Not a Neutral Actor

While Cloudflare’s evidence is compelling, it’s important to acknowledge context: they sell bot protection products. Calling out stealth bots reinforces their narrative — that robots.txt isn’t enough, and their tools are the solution.

That doesn’t make them wrong. But it does highlight the need for a broader conversation that doesn’t start and end with vendor marketing.

Because the real question isn’t whether this one crawler played fair.

It’s: what happens when nobody does?

The Web Was Built on Trust

The robots.txt file was designed for an internet of goodwill — and goodwill alone. It has no enforcement. No validation. No incentive model beyond reputation.

But the modern internet runs on different fuel: competition, data extraction, and invisible automation.

Stealth crawling isn’t new. The difference in 2025 is that it's no longer fringe — it’s foundational. Entire business models now quietly depend on it.

And the longer we rely on a polite "Please don’t enter" sign to guard the front door, the more we guarantee it gets ignored.

What Comes Next?

If robots.txt is too weak to enforce no-crawl policies, what replaces it?

Some ideas being explored or proposed include:

Authenticated Crawling: Verified bots using cryptographic tokens or signed headers to prove identity. Similar to mTLS or signed API requests.
🔗 https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot
🔗 https://developers.cloudflare.com/bots/
Machine-Readable Licensing: Embed copyright and usage rules directly in page metadata or HTTP headers. Inspired by CC licenses and W3C ODRL.
🔗 https://www.w3.org/TR/odrl-model/
🔗 https://wiki.creativecommons.org/wiki/CC_REL
Crawler Registries: Public directories of known, verified bots — optionally tied to reputation scores or crawl behaviors.
🔗 https://certificate.transparency.dev/
Temporal Access Controls: Define what bots can crawl now vs. what can be archived or used later. A middle path between live privacy and historical preservation.
🔗 https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
Legal Frameworks: Treat large-scale scraping as subject to copyright, fair use, or digital trespass laws — especially in the context of AI training.
🔗 https://artificialintelligenceact.eu/
🔗 https://crfm.stanford.edu/fmti/May-2024/index.html
🔗 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5159436

Sources

Cloudflare: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
Perplexity: Agents or Bots? Making Sense of AI on the Open Web
TechCrunch: Perplexity accused of scraping websites that explicitly blocked AI scraping
Wired: EBay Fights Spiders on the Web
Pete Warden: How I got sued by Facebook
Martijn Koster: Robots.txt is 25 years old
Intoli: Analyzing One Million Robots.txt Files
Internet Archive: Robots.txt meant for search engines don’t work well for web archives
nDash: The Role of Robots.txt in Modern AI and Web Governance
robotstxt.org: About /robots.txt

Additional Reading

Arvind Narayanan & Sayash Kapoor: https://www.aisnakeoil.com/t/ai-safety
Cory Doctorow: https://doctorow.medium.com/https-pluralistic-net-2024-04-04-teach-me-how-to-shruggie-kagi-caaa88c221f2
Stanford HAI: https://hai.stanford.edu/news/introducing-foundation-model-transparency-index

Robots.txt in 2025

A Brief History of `robots.txt`

This Isn’t the First Time

The Incentives Are Changing

Bot Protection Services: Cloudflare Is Not a Neutral Actor

The Web Was Built on Trust

What Comes Next?

Sources

Additional Reading

Windsurf IDE vs. JetBrains Junie

Too Big to Fail, Too Fragile to Blindly Trust

Robots.txt in 2025

A Brief History of robots.txt

This Isn’t the First Time

The Incentives Are Changing

Bot Protection Services: Cloudflare Is Not a Neutral Actor

The Web Was Built on Trust

What Comes Next?

Sources

Additional Reading

Windsurf IDE vs. JetBrains Junie

Too Big to Fail, Too Fragile to Blindly Trust

A Brief History of `robots.txt`