
AI Crawler Allowlist 2026: Which Bots to Let In, Block, or Ignore
GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider — 28+ AI crawlers want your content in 2026. A practical robots.txt + Cloudflare allowlist that protects revenue without cutting off citations.
In 2026 every credible AI assistant fetches the open web. Each one runs its own crawler, each one announces itself with a user-agent, and each one has a different role: training, indexing, in-session browsing, or citation. Treating them as one undifferentiated "AI scrapers" bucket — which is what Cloudflare's one-click block does by default — leaves citation revenue on the table and trains scrapers to ignore robots.txt entirely.
This is the 2026 cheat sheet: a 28-bot identity table, the three-policy framework (allow / training-only / block), and a working robots.txt + Cloudflare configuration that protects your data without cutting off AI traffic.
The four jobs a crawler does
Before you allow or block, understand which job the bot is doing. A single vendor often runs 2-4 separate crawlers, each for a different purpose.
| Job | What it does | Block means |
|---|---|---|
| Training | Fetches pages to train future models | No future LLM "knows" your domain |
| Search index | Builds a search index the LLM queries at inference | Not cited in answers, period |
| In-session browse | Fetches a URL the user (via the LLM) just asked about | The model can't summarize your live page |
| Embed / RAG | Pulls into a third-party embedding store | Less control of how your content is repackaged |
The mistake most sites make: blocking all four to "protect content" — which deletes the search-index and in-session-browse jobs at the same time. Those two are where 80% of cited-answer traffic comes from.
The 28-bot identity table (May 2026)
| User-agent | Vendor | Job | 2026 policy |
|---|---|---|---|
GPTBot | OpenAI | Training + index | Training-only (allow if you want to be in training) |
OAI-SearchBot | OpenAI | Search index | Allow |
ChatGPT-User | OpenAI | In-session browse | Allow (ignores robots.txt anyway) |
ClaudeBot | Anthropic | Training | Training-only |
Claude-SearchBot | Anthropic | Search index | Allow |
Claude-User | Anthropic | In-session browse | Allow |
anthropic-ai | Anthropic (legacy) | Training | Block (retired but still seen) |
Google-Extended | Training + AIO | Allow (AIO is now 60% of Google) | |
Googlebot | Classic search | Allow (the OG) | |
PerplexityBot | Perplexity | Search index + cite | Allow (citation backbone) |
Perplexity-User | Perplexity | In-session browse | Allow |
Applebot-Extended | Apple | Apple Intelligence training | Allow |
Bingbot | Microsoft | Search + Copilot index | Allow |
bingbot-Extended | Microsoft | Copilot training | Training-only |
Bytespider | ByteDance | Training | Block (no citation upside, heavy load) |
Meta-ExternalAgent | Meta | Training | Block |
FacebookBot | Meta | Embed for AI | Block |
Amazonbot | Amazon | Alexa AI training | Block |
cohere-ai | Cohere | Training | Block (no consumer surface) |
Diffbot | Diffbot | Commercial scraping | Block unless paying |
Omgilibot | Webz.io | Commercial scraping | Block |
MistralAI-User | Mistral | In-session browse | Allow |
Kagibot | Kagi | Premium search index | Allow (small but high-value) |
Webzio-Extended | Webz.io | Commercial scraping | Block |
Brave-SearchBot | Brave | Search index | Allow |
xAI-Bot / Grok-Bot | xAI | Training + Grok index | Allow |
YouBot | You.com | Search index | Allow |
Iceberg | Internal LLM tool | Various | Block unidentified by default |
A working robots.txt
For a B2B site that wants maximum citation eligibility and zero training contribution:
# Citation-grade — allow everything
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: MistralAI-User
Allow: /
User-agent: Kagibot
Allow: /
User-agent: Brave-SearchBot
Allow: /
User-agent: xAI-Bot
Allow: /
User-agent: YouBot
Allow: /
# Training-only — block
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: bingbot-Extended
Disallow: /
# Hostile / no citation upside
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Webzio-Extended
Disallow: /
# Everyone else
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Flip the GPTBot / ClaudeBot lines to Allow if you want to be in next-generation model training (most B2B brands benefit from this — it teaches the model your terminology). Keep them Disallow if you've published proprietary research you don't want absorbed.
Cloudflare enforcement
robots.txt is honored by compliant bots. Bytespider, scrape-as-a-service vendors, and residential-proxy scrapers ignore it. The 2026 enforcement layer is Cloudflare's AI Bot management dashboard (Security → Bots → AI Bots, available on Free tier since 2025):
- Disable the "Block AI Scrapers" one-click toggle — it's the blunt hammer that costs 70% of citation traffic.
- Enable per-bot rules with these settings:
- PerplexityBot, OAI-SearchBot, Claude-SearchBot, Google-Extended, Applebot-Extended: Allow
- GPTBot, ClaudeBot: Allow or Managed Challenge depending on training policy
- Bytespider, Amazonbot, Meta-ExternalAgent, FacebookBot: Block
- Add a WAF custom rule matching
(cf.client.bot eq false) and (http.user_agent contains "ai" or http.user_agent contains "bot")with action Managed Challenge to catch impersonators using user-agents that look AI-flavoured but aren't on the verified list. - Drop a honeytrap URL at
/disallow-honeypot/linked only fromllms.txt, set rate-limit-per-IP to 1, action Block 24h. Catches scrapers that ignore robots.txt — by design no legitimate AI crawler fetches that path.
Verification
After you ship the policy, verify weekly:
tail -f /var/log/nginx/access.log | grep -E "PerplexityBot|OAI-SearchBot|Claude-SearchBot"— confirm citation bots are still fetching.- Cloudflare Analytics → Security → Bot traffic — chart Allowed/Challenged/Blocked over time. Look for a 30-50% drop in blocked AI traffic after rollout; that's the unwanted training scrapers leaving.
- Perplexity citation share (queries that should cite you) — should rise 2-4 weeks after allowing PerplexityBot if you weren't already.
- Google Search Console → AI Overview impressions (rolling out throughout 2026) — should not crater.
If PerplexityBot fetches stop after rollout, your Cloudflare rule order is wrong (a broader Block rule is matching first). Reorder so per-bot Allows come before Challenge/Block.
Why the default block is wrong
Three out of four AI crawlers do something you want: index your content so it gets cited, browse your live page when a user asks, train future models to know your brand exists. Only one — pure scrapers like Bytespider — does nothing useful for you.
The 2026 default ("block all AI") was right in 2024 when there were no citation pathways and the crawlers were taking content with no return. In 2026 the pathways exist (PerplexityBot citations now drive measurable referral traffic; Google AI Overview cites with link attribution; ChatGPT-User pulls live data into answers) and the asymmetry has reversed: blocking is the expensive choice.
A 12-line robots.txt and 4 Cloudflare rules are the difference between being cited in the answers your customers ask AI tools and being invisible. The work takes one hour; the lift compounds for years.
01Does blocking GPTBot stop ChatGPT from citing my site?+
Partially. GPTBot is OpenAI's training/indexing crawler — blocking it stops your content from entering future model training sets and the OpenAI search index. But ChatGPT-User (the in-conversation browse) and OAI-SearchBot (the search index used for citations) are separate. Block GPTBot if you want to opt out of training; leave OAI-SearchBot and ChatGPT-User allowed to keep citation eligibility.
02Should I block Cloudflare's 'AI Scrapers' category by default?+
No. The default category blocks PerplexityBot, OAI-SearchBot, and ClaudeBot — the three that account for the majority of AI-cited traffic. Use the granular per-bot toggles instead, allow the citation-grade bots, and block only training-only crawlers.
03What's the difference between Google-Extended and Googlebot?+
Googlebot indexes for classic Google Search. Google-Extended is the separate signal that controls whether your content can be used to train Bard/Gemini and to populate Google AI Overview answers. Disallow Google-Extended ONLY if you also accept being absent from AI Overview — that traffic is climbing 60% of all Google queries by April 2026.
04How do I block bots that ignore robots.txt?+
Three layers: (1) Cloudflare WAF / Bot Fight Mode rules matched on user-agent regex, (2) reverse-proxy (nginx, Caddy) rules at the edge, (3) honeytrap URLs in your llms.txt that no legitimate bot will fetch — any IP that hits them gets a 24h block. Bytespider and unidentified residential-proxy scrapers need all three layers; user-agent alone won't catch them.
05Do I need separate rules for ClaudeBot vs Claude-SearchBot?+
Yes. ClaudeBot is Anthropic's training crawler (similar role to GPTBot). Claude-SearchBot powers in-chat web access for citations. anthropic-ai is the legacy user-agent retired in 2025 — keep a Disallow on it for safety, but allow the two new ones if you want Claude Sonnet/Opus to cite you.

AEO Complete Guide 2026: How to Get Cited by ChatGPT, Perplexity & Google AI Overview

Is llms.txt Dead in 2026? What the SE Ranking 300K-Domain Study Actually Found - and What Moves the Citation Needle Instead
