Fix: robots.txt Missing AI Bot Directives

AI training crawlers like GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Bytespider operate independently of Google. A robots.txt that only addresses User-agent: * and Googlebot does not explicitly address these crawlers. This guide covers both blocking and allowing them.

The Problem

Many sites either want to allow AI crawlers (for GEO — Generative Engine Optimisation, getting cited in AI answers) or block them (for content protection). A robots.txt without explicit AI crawler directives has ambiguous intent. Some AI crawlers respect User-agent: * rules. Others only act on explicit user-agent entries. Missing directives means inconsistent handling.

The Fix

robots.txt — Allow AI Crawlers (GEO strategy)
User-agent: *
Allow: /

# Allow AI crawlers explicitly for GEO (AI search citation)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Use Allow: / for AI crawlers you want to allow for AI search visibility, or Disallow: / to block them from training data. GPTBot is OpenAI's crawler. ClaudeBot is Anthropic's. PerplexityBot is used by Perplexity AI. Google-Extended controls Google's AI training separately from Googlebot.

Validate your robots.txt live — fetch any URL and get a corrected file in one click.

Open robots.txt Validator →

Frequently Asked Questions

Should I allow or block AI crawlers?
It depends on your goals. Allow them if you want your content cited in AI answers (GEO strategy) — sites that block AI crawlers are excluded from ChatGPT, Claude, and Perplexity's knowledge base. Block them if your content is proprietary and you're concerned about training data use.
Does blocking GPTBot stop ChatGPT from knowing about my site?
It stops future training data collection, but ChatGPT's existing knowledge base already contains your content if it was crawled before you added the block. Blocking prevents future training updates, not retroactive removal.
What is Google-Extended?
Google-Extended is a separate user agent that controls whether your content is used for Google's AI products (Bard/Gemini training, AI Overviews). Setting Disallow for Google-Extended while allowing Googlebot means Google can index your site normally but cannot use your content for AI training.

Related Guides