Robots.txt in 2025

0:00
/0:59

On August 4, 2025, Cloudflare published a detailed post accusing Perplexity AI of deploying stealth crawlers — bots that ignore robots.txt, don’t identify themselves, and mimic real browsers to evade detection.

Cloudflare claims they have collected traffic logs showing access to sites that had explicitly disallowed PerplexityBot via robots.txt. These bots reportedly used generic browser headers and IP rotation — techniques commonly associated with crawlers attempting not to be seen. In response, a Perplexity spokesperson denied the claim, describing the allegations as misleading and commercially motivated. Perplexity also created its write-up about the same.

This post isn’t about one company or crawler. It’s a reminder that robots.txt is a voluntary convention, and in 2025, bots have little incentive to ask permission.


A Brief History of robots.txt


The robots.txt file was created in 1994 after a bot took down a website. Developer Martijn Koster proposed a simple fix: let sites place a plain text file at their root (/robots.txt) to declare what bots shouldn't touch.

It wasn't a firewall. It wasn’t enforced. It was just...a sign.

A digital version of “Please don’t enter.”

Early search engines and crawlers adhered to it. Over time, it became an unofficial part of the web’s standard.

As scraping proved profitable — initially for search, then for SEO, and later for pricing and aggregation — the gap between bots that ask for data and those that take it began to grow.


This Isn’t the First Time

Cloudflare’s blog post is not unique. It’s the latest in a long pattern.

  • In 2000, eBay sued Bidder’s Edge for scraping auction listings using automated bots, despite their compliance with robots.txt restrictions.
  • In 2010, Pete Warden was sued by Facebook for crawling and not obtaining prior written permission.
  • In 2017, the Internet Archive warned that retroactive robots.txt restrictions were erasing public history.
  • In 2017, Intoli analyzed over 1 million robots.txt files, finding vague and inconsistent usage.
  • In 2019, Martijn Koster marked 25 years of the standard and questioned its relevance.
  • In 2024, nDash explored its fragility in the age of AI — where web content isn't just indexed, it’s ingested.

Each example ends the same way: bots evolve, the robots.txt stays the same.


The Incentives Are Changing

What’s different now is what the bots want.

In the 2000s, they indexed pages for humans. In the 2010s, they scraped prices and text for SEO and competitive intelligence. In the 2020s, they harvest content to train AI models.

The volume is higher. The value is greater. The appetite is limitless.

And the polite rules of 1994 can’t stop them.

There is no audit trail. There is no digital signature confirming which crawler is which. No API key to identify access. No penalty for ignoring a disallow. Just a voluntary file in a public directory — as easy to ignore as it is to find.


Bot Protection Services: Cloudflare Is Not a Neutral Actor

While Cloudflare’s evidence is compelling, it’s important to acknowledge context: they sell bot protection products. Calling out stealth bots reinforces their narrative — that robots.txt isn’t enough, and their tools are the solution.

That doesn’t make them wrong. But it does highlight the need for a broader conversation that doesn’t start and end with vendor marketing.

Because the real question isn’t whether this one crawler played fair.

It’s: what happens when nobody does?


The Web Was Built on Trust

The robots.txt file was designed for an internet of goodwill — and goodwill alone. It has no enforcement. No validation. No incentive model beyond reputation.

But the modern internet runs on different fuel: competition, data extraction, and invisible automation.

Stealth crawling isn’t new. The difference in 2025 is that it's no longer fringe — it’s foundational. Entire business models now quietly depend on it.

And the longer we rely on a polite "Please don’t enter" sign to guard the front door, the more we guarantee it gets ignored.


What Comes Next?

If robots.txt is too weak to enforce no-crawl policies, what replaces it?

Some ideas being explored or proposed include:


Sources

Additional Reading

You've successfully subscribed to Amitk.io
Great! Next, complete checkout for full access to Amitk.io
Welcome back! You've successfully signed in.
Unable to sign you in. Please try again.
Success! Your account is fully activated, you now have access to all content.
Error! Stripe checkout failed.
Success! Your billing info is updated.
Error! Billing info update failed.