Since OpenAI published GPTBot's user agent string in 2023, the robots.txt conversation has changed permanently. Now there are a dozen AI crawlers with named user agents, and every site owner has to decide: allow them, block them, or do nothing.
Before you add a line to your robots.txt, it's worth understanding what that line actually does — and what it doesn't.
The crawlers you need to know about
The major AI crawlers with named user agents that respond to robots.txt:
- GPTBot — OpenAI's training crawler. Documented since August 2023. Generally respects robots.txt.
- ClaudeBot — Anthropic's crawler. Documented and respects robots.txt.
- Google-Extended — Controls whether Google uses your content for Bard/Gemini training and AI Overviews. Separate from Googlebot — blocking Google-Extended doesn't affect your search rankings.
- PerplexityBot — Perplexity AI's crawler. Respects robots.txt but has been reported as inconsistent.
- Bytespider — ByteDance (TikTok parent company). Compliance is less consistent than the others.
- CCBot — Common Crawl. Used by many AI training datasets. Compliance varies widely.
What robots.txt actually does
robots.txt is a voluntary protocol. There is no technical enforcement. A well-behaved crawler reads your robots.txt before crawling and respects the Disallow directives. A poorly-behaved or malicious crawler ignores it entirely.
The major named AI crawlers — GPTBot, ClaudeBot, Google-Extended — are from large companies with reputational stakes and legal teams. They generally respect robots.txt. The long tail of smaller crawlers and scrapers? Much less consistent.
Key point: blocking a crawler in robots.txt prevents future crawling from that crawler. It does not remove your content from training datasets that were built before you added the block. If GPTBot already crawled your site last year, that data is already in the training set.
Should you block AI crawlers?
This is a genuine strategic decision, not a clear-cut technical one. The right answer depends on what you want:
Block if: your content is proprietary, paywalled, or you're concerned about your work being used to train commercial AI models without compensation. Publishers, news organizations, and content creators with monetized archives are the clearest cases for blocking.
Allow if: you want your content cited in AI-generated answers. When someone asks ChatGPT or Perplexity a question and your content is in the training data or can be crawled, there's a chance your site gets cited. This is called GEO — Generative Engine Optimisation — and it's the emerging counterpart to traditional SEO. Sites that block all AI crawlers are excluded from this entirely.
Selective approach: block training crawlers (GPTBot, CCBot) while allowing retrieval crawlers (PerplexityBot) and keeping Google-Extended allowed for AI Overviews. This attempts to avoid training data use while preserving citation opportunities.
How to block specific AI crawlers
# Block OpenAI training crawler: User-agent: GPTBot Disallow: / # Block Anthropic crawler: User-agent: ClaudeBot Disallow: / # Block Google's AI training (doesn't affect search rankings): User-agent: Google-Extended Disallow: / # Block ByteDance: User-agent: Bytespider Disallow: / # Block Common Crawl: User-agent: CCBot Disallow: /
How to allow all AI crawlers explicitly
# Explicit allow for AI crawlers (GEO strategy): User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: /
You don't technically need to add Allow directives if you're not blocking anything — crawlers are allowed by default. But explicit Allow entries signal intent and may be read by AI systems evaluating whether your content is available for citation.
What about the Google-Extended nuance
Google-Extended is worth understanding separately because it controls two different things: whether your content is used for AI model training, and whether it's included in AI Overviews (the AI-generated summaries at the top of search results).
Blocking Google-Extended removes you from both. If you want to appear in AI Overviews (which drives traffic) but don't want your content in training data, the current robots.txt spec doesn't support that distinction — it's all or nothing with Google-Extended.
Check your current AI bot coverage
Most robots.txt files were written before AI crawlers existed and don't address them at all. The first step is knowing where you currently stand.
Paste your robots.txt to see your AI bot coverage score — which crawlers are explicitly allowed, which are blocked, and which are unaddressed.
Open robots.txt Validator →The honest answer
robots.txt is a reasonable first line of defence against well-behaved AI crawlers. It's not a legal instrument, it's not technically enforced, and it doesn't reach data that's already been collected. But for the major named crawlers from large companies, it's generally respected and worth using if you have a clear preference.
The most important thing is to make a deliberate decision rather than ignoring the question. An unaddressed robots.txt in 2026 is an implicit "I haven't thought about this" — which is a fine answer but probably not the one you intend.