Skip to content
AEO

AI Crawler Allowlist 2026: Which Bots to Let In, Block, or Ignore

GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider — 28+ AI crawlers want your content in 2026. A practical robots.txt + Cloudflare allowlist that protects revenue without cutting off citations.


Mikhail Savchenko·May 10, 2026·7 min read
AEOAI VisibilityCrawlersRobots.txtCloudflare

In 2026 every credible AI assistant fetches the open web. Each one runs its own crawler, each one announces itself with a user-agent, and each one has a different role: training, indexing, in-session browsing, or citation. Treating them as one undifferentiated "AI scrapers" bucket — which is what Cloudflare's one-click block does by default — leaves citation revenue on the table and trains scrapers to ignore robots.txt entirely.

This is the 2026 cheat sheet: a 28-bot identity table, the three-policy framework (allow / training-only / block), and a working robots.txt + Cloudflare configuration that protects your data without cutting off AI traffic.

The four jobs a crawler does

Before you allow or block, understand which job the bot is doing. A single vendor often runs 2-4 separate crawlers, each for a different purpose.

JobWhat it doesBlock means
TrainingFetches pages to train future modelsNo future LLM "knows" your domain
Search indexBuilds a search index the LLM queries at inferenceNot cited in answers, period
In-session browseFetches a URL the user (via the LLM) just asked aboutThe model can't summarize your live page
Embed / RAGPulls into a third-party embedding storeLess control of how your content is repackaged

The mistake most sites make: blocking all four to "protect content" — which deletes the search-index and in-session-browse jobs at the same time. Those two are where 80% of cited-answer traffic comes from.

The 28-bot identity table (May 2026)

User-agentVendorJob2026 policy
GPTBotOpenAITraining + indexTraining-only (allow if you want to be in training)
OAI-SearchBotOpenAISearch indexAllow
ChatGPT-UserOpenAIIn-session browseAllow (ignores robots.txt anyway)
ClaudeBotAnthropicTrainingTraining-only
Claude-SearchBotAnthropicSearch indexAllow
Claude-UserAnthropicIn-session browseAllow
anthropic-aiAnthropic (legacy)TrainingBlock (retired but still seen)
Google-ExtendedGoogleTraining + AIOAllow (AIO is now 60% of Google)
GooglebotGoogleClassic searchAllow (the OG)
PerplexityBotPerplexitySearch index + citeAllow (citation backbone)
Perplexity-UserPerplexityIn-session browseAllow
Applebot-ExtendedAppleApple Intelligence trainingAllow
BingbotMicrosoftSearch + Copilot indexAllow
bingbot-ExtendedMicrosoftCopilot trainingTraining-only
BytespiderByteDanceTrainingBlock (no citation upside, heavy load)
Meta-ExternalAgentMetaTrainingBlock
FacebookBotMetaEmbed for AIBlock
AmazonbotAmazonAlexa AI trainingBlock
cohere-aiCohereTrainingBlock (no consumer surface)
DiffbotDiffbotCommercial scrapingBlock unless paying
OmgilibotWebz.ioCommercial scrapingBlock
MistralAI-UserMistralIn-session browseAllow
KagibotKagiPremium search indexAllow (small but high-value)
Webzio-ExtendedWebz.ioCommercial scrapingBlock
Brave-SearchBotBraveSearch indexAllow
xAI-Bot / Grok-BotxAITraining + Grok indexAllow
YouBotYou.comSearch indexAllow
IcebergInternal LLM toolVariousBlock unidentified by default

A working robots.txt

For a B2B site that wants maximum citation eligibility and zero training contribution:

# Citation-grade — allow everything
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Kagibot
Allow: /

User-agent: Brave-SearchBot
Allow: /

User-agent: xAI-Bot
Allow: /

User-agent: YouBot
Allow: /

# Training-only — block
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: bingbot-Extended
Disallow: /

# Hostile / no citation upside
User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Webzio-Extended
Disallow: /

# Everyone else
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Flip the GPTBot / ClaudeBot lines to Allow if you want to be in next-generation model training (most B2B brands benefit from this — it teaches the model your terminology). Keep them Disallow if you've published proprietary research you don't want absorbed.

Cloudflare enforcement

robots.txt is honored by compliant bots. Bytespider, scrape-as-a-service vendors, and residential-proxy scrapers ignore it. The 2026 enforcement layer is Cloudflare's AI Bot management dashboard (Security → Bots → AI Bots, available on Free tier since 2025):

  1. Disable the "Block AI Scrapers" one-click toggle — it's the blunt hammer that costs 70% of citation traffic.
  2. Enable per-bot rules with these settings:
    • PerplexityBot, OAI-SearchBot, Claude-SearchBot, Google-Extended, Applebot-Extended: Allow
    • GPTBot, ClaudeBot: Allow or Managed Challenge depending on training policy
    • Bytespider, Amazonbot, Meta-ExternalAgent, FacebookBot: Block
  3. Add a WAF custom rule matching (cf.client.bot eq false) and (http.user_agent contains "ai" or http.user_agent contains "bot") with action Managed Challenge to catch impersonators using user-agents that look AI-flavoured but aren't on the verified list.
  4. Drop a honeytrap URL at /disallow-honeypot/ linked only from llms.txt, set rate-limit-per-IP to 1, action Block 24h. Catches scrapers that ignore robots.txt — by design no legitimate AI crawler fetches that path.

Verification

After you ship the policy, verify weekly:

  • tail -f /var/log/nginx/access.log | grep -E "PerplexityBot|OAI-SearchBot|Claude-SearchBot" — confirm citation bots are still fetching.
  • Cloudflare Analytics → Security → Bot traffic — chart Allowed/Challenged/Blocked over time. Look for a 30-50% drop in blocked AI traffic after rollout; that's the unwanted training scrapers leaving.
  • Perplexity citation share (queries that should cite you) — should rise 2-4 weeks after allowing PerplexityBot if you weren't already.
  • Google Search Console → AI Overview impressions (rolling out throughout 2026) — should not crater.

If PerplexityBot fetches stop after rollout, your Cloudflare rule order is wrong (a broader Block rule is matching first). Reorder so per-bot Allows come before Challenge/Block.

Why the default block is wrong

Three out of four AI crawlers do something you want: index your content so it gets cited, browse your live page when a user asks, train future models to know your brand exists. Only one — pure scrapers like Bytespider — does nothing useful for you.

The 2026 default ("block all AI") was right in 2024 when there were no citation pathways and the crawlers were taking content with no return. In 2026 the pathways exist (PerplexityBot citations now drive measurable referral traffic; Google AI Overview cites with link attribution; ChatGPT-User pulls live data into answers) and the asymmetry has reversed: blocking is the expensive choice.

A 12-line robots.txt and 4 Cloudflare rules are the difference between being cited in the answers your customers ask AI tools and being invisible. The work takes one hour; the lift compounds for years.

Frequently Asked Questions
  • 01Does blocking GPTBot stop ChatGPT from citing my site?+

    Partially. GPTBot is OpenAI's training/indexing crawler — blocking it stops your content from entering future model training sets and the OpenAI search index. But ChatGPT-User (the in-conversation browse) and OAI-SearchBot (the search index used for citations) are separate. Block GPTBot if you want to opt out of training; leave OAI-SearchBot and ChatGPT-User allowed to keep citation eligibility.

  • 02Should I block Cloudflare's 'AI Scrapers' category by default?+

    No. The default category blocks PerplexityBot, OAI-SearchBot, and ClaudeBot — the three that account for the majority of AI-cited traffic. Use the granular per-bot toggles instead, allow the citation-grade bots, and block only training-only crawlers.

  • 03What's the difference between Google-Extended and Googlebot?+

    Googlebot indexes for classic Google Search. Google-Extended is the separate signal that controls whether your content can be used to train Bard/Gemini and to populate Google AI Overview answers. Disallow Google-Extended ONLY if you also accept being absent from AI Overview — that traffic is climbing 60% of all Google queries by April 2026.

  • 04How do I block bots that ignore robots.txt?+

    Three layers: (1) Cloudflare WAF / Bot Fight Mode rules matched on user-agent regex, (2) reverse-proxy (nginx, Caddy) rules at the edge, (3) honeytrap URLs in your llms.txt that no legitimate bot will fetch — any IP that hits them gets a 24h block. Bytespider and unidentified residential-proxy scrapers need all three layers; user-agent alone won't catch them.

  • 05Do I need separate rules for ClaudeBot vs Claude-SearchBot?+

    Yes. ClaudeBot is Anthropic's training crawler (similar role to GPTBot). Claude-SearchBot powers in-chat web access for citations. anthropic-ai is the legacy user-agent retired in 2025 — keep a Disallow on it for safety, but allow the two new ones if you want Claude Sonnet/Opus to cite you.

Keep reading