Skip to content
Back to blog
AEO

llms.txt vs ai.txt vs robots.txt vs identity.json: The Four-File AI Identity Surface

AI engines read four different files at the root of your domain. Each does a different job. A clear comparison and a copy-paste template for all four.

CostaApril 2, 20265 min read
llms.txtai.txtrobots.txtIdentity

AI engines read four files at your domain root: robots.txt (which crawlers can access what), llms.txt (long-form markdown guide for AI), ai.txt (concise key=value identity profile), and identity.json (Schema.org canonical business identity). Sites that publish all four are 1.6x more likely to be cited correctly by Perplexity and ChatGPT.

Key facts

  • Adoption of the full four-file surface among the top 10K sites: 11% in April 2026, up from 0.4% in April 2025.
  • Sites with full surface are 1.6x more likely to be cited correctly by Perplexity.
  • robots.txt: ~99% of sites; llms.txt: 11%; ai.txt: 9%; identity.json: 7%.
  • GPTBot, ClaudeBot, Google-Extended, PerplexityBot, Amazonbot all read llms.txt and ai.txt.
  • Total cost to publish all four: 1-2 hours of work; size budget: 1-3 KB each.

The Four Files at a Glance

Every site that wants to be visible to AI engines should publish four files at the domain root. Each does a different job. Each is read by different agents. Together, they form the AI identity surface.

FileFormatPurposeSizeAdoption (Apr 2026)
/robots.txtRobots directivesWhich crawlers can access what0.5-2 KB99%
/llms.txtMarkdownLong-form site guide for AI1-3 KB11%
/ai.txtPlain text key=valueConcise identity profile0.5-1.5 KB9%
/identity.jsonJSON-LDSchema.org canonical identity1-3 KB7%

Sites that publish all four are 1.6x more likely to be cited correctly by Perplexity (right brand name, right URL).

File 1: robots.txt - Access Control

The grandfather of the surface. Tells crawlers which paths they can fetch. For AEO, the critical job is making sure AI crawlers are not blocked.

# robots.txt
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

# Explicitly allow major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: CCBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Common mistake: blocking AI crawlers as a "training data" measure. This makes you invisible to AI search. Block specific paths if you must (e.g., paywalled content), not whole user-agents.

File 2: llms.txt - The Long-Form Guide

Markdown file at /llms.txt that gives AI crawlers a structured guide to your site. Created by Jeremy Howard (Answer.AI) in 2024, now read by GPTBot, ClaudeBot, PerplexityBot, Google-Extended.

# Your Brand

One-line description of what your business does and who you serve.

## Products
- Product A: short description and link to https://yourdomain.com/product-a
- Product B: short description and link to https://yourdomain.com/product-b

## Pricing
- Free: $0 - what's included
- Pro: $29/mo - what's included
- Enterprise: contact sales

## Key URLs
- Pricing: https://yourdomain.com/pricing
- Documentation: https://yourdomain.com/docs
- Blog: https://yourdomain.com/blog
- Contact: https://yourdomain.com/contact

## Company
- Founded: 2020
- Geography: Worldwide / EU / US-only
- Team size: 10-50

## Contact
- Email: hello@yourdomain.com

Read the full llms.txt guide for the complete spec, validator checklist, and adoption data.

File 3: ai.txt - The Identity Profile

Concise key=value file at /ai.txt. Faster for engines to parse than llms.txt, and complementary - engines read both.

# ai.txt
name: Your Brand
legal_name: Your Brand Inc.
description: One-line description.
url: https://yourdomain.com
type: SaaS
category: B2B / Marketing / Analytics
founded: 2020
geography: Worldwide
contact_email: hello@yourdomain.com

[products]
- Product A: https://yourdomain.com/product-a
- Product B: https://yourdomain.com/product-b

[pricing]
free: $0
pro: $29/mo
enterprise: contact

[social]
linkedin: https://linkedin.com/company/yourbrand
twitter: https://twitter.com/yourbrand

[crawlers]
allow: gptbot, claudebot, perplexitybot, google-extended, amazonbot

ai.txt is denser than llms.txt - same data, less prose. Engines that parse structured data prefer it; engines that parse markdown prefer llms.txt. Publish both.

File 4: identity.json - The Canonical Identity

JSON-LD file at /identity.json with Schema.org Organization (or Person for solo brands). The most precise of the four files; what engines hand to their knowledge graph builders.

{
 "@context": "https://schema.org",
 "@type": "Organization",
 "@id": "https://yourdomain.com/#organization",
 "name": "Your Brand",
 "legalName": "Your Brand Inc.",
 "alternateNames": ["YourBrand", "YB"],
 "description": "One-line description.",
 "url": "https://yourdomain.com",
 "logo": "https://yourdomain.com/logo.png",
 "foundingDate": "2020",
 "founder": {
 "@type": "Person",
 "name": "Founder Name",
 "jobTitle": "CEO"
 },
 "sameAs": [
 "https://linkedin.com/company/yourbrand",
 "https://twitter.com/yourbrand",
 "https://crunchbase.com/organization/yourbrand",
 "https://en.wikipedia.org/wiki/Your_Brand"
 ],
 "contactPoint": {
 "@type": "ContactPoint",
 "email": "hello@yourdomain.com",
 "contactType": "customer service"
 },
 "areaServed": "Worldwide"
}

The killer feature: sameAs[]. Listing your LinkedIn, Crunchbase, Wikipedia, and Twitter URLs lets AI engines disambiguate your entity from similarly-named competitors. Sites with full sameAs[] are cited with the correct brand name 2.1x more often.

A 60-Minute Implementation

Step 1 (10 min) - robots.txt audit. Open yourdomain.com/robots.txt. Make sure GPTBot, Google-Extended, ClaudeBot, PerplexityBot, Amazonbot are NOT in Disallow: / blocks.

Step 2 (20 min) - llms.txt. Copy the template above. Replace placeholders. Validate every URL resolves. Ship to /llms.txt.

Step 3 (15 min) - ai.txt. Copy the template. Replace placeholders. Same data as llms.txt, denser format. Ship to /ai.txt.

Step 4 (15 min) - identity.json. Copy the template. Critical: fill out sameAs[] with all your social and reference URLs. Validate at validator.schema.org. Ship to /identity.json.

Total: 60 minutes for the full four-file surface.

How Engines Use Them

Enginerobots.txtllms.txtai.txtidentity.json
GPTBot (OpenAI)YesYesYesYes
ClaudeBot (Anthropic)YesYesYesYes
Google-ExtendedYesYesPartialYes
PerplexityBotYesYesYesYes
AmazonbotYesYesPartialYes
Bytespider (TikTok)YesPartialNoPartial

Coverage isn't perfect. But every major engine reads at least three of four. The marginal cost of the fourth file is 15 minutes; ship it.

Common Mistakes

  1. Blocking AI crawlers in robots.txt. Makes you invisible to AI search. Don't.
  2. Putting files in subdirectories. They must be at the root. /docs/llms.txt is invisible.
  3. Wrong content-type. Serve llms.txt as text/plain or text/markdown. Serve ai.txt as text/plain. Serve identity.json as application/ld+json.
  4. Stale data. When pricing or products change, update all three identity files. Engines lose trust in stale identity surfaces.
  5. Missing sameAs. Without sameAs[] in identity.json, engines can't disambiguate your brand from similar names.

The Bottom Line

The AI identity surface is a 60-minute investment with a 1.6x correct-entity citation lift. Every major AI engine reads at least three of four files. The standards are stable, the templates are public, and the cost is trivial. If you publish only one new thing in 2026, make it the full four-file surface. Then layer Direct Answer Blocks, FAQPage schema, and statistical anchoring on top.

Read next: What Is llms.txt · AEO Complete Guide 2026.

Frequently Asked Questions

Do I need all four files?

Yes if you want full AI visibility. robots.txt controls access, llms.txt provides a long-form site guide, ai.txt provides a concise identity profile, and identity.json provides Schema.org-canonical business identity. Each serves a different surface and different engines weight them differently. The marginal cost of publishing the missing files is one hour.

Where do these files live?

All four at the root of your domain: yourdomain.com/robots.txt, yourdomain.com/llms.txt, yourdomain.com/ai.txt, yourdomain.com/identity.json. Same level as sitemap.xml. Do not put them in subdirectories or behind auth.

What format does each file use?

robots.txt: plain text, robots directives. llms.txt: markdown. ai.txt: plain text, key=value pairs. identity.json: JSON-LD with Schema.org Organization or Person types. All UTF-8 encoded.

Will publishing these files hurt classic SEO?

No. Search engines do not penalize llms.txt, ai.txt, or identity.json. Google has explicitly stated they read llms.txt and ai.txt without weighting it directly into rankings. There is no downside.

How do I generate them?

Hand-write them in 1-2 hours using public templates (llmstxt.org for llms.txt, the spec for ai.txt, schema.org for identity.json). Or use a generator - inite.ai's analyzer produces a ready-to-deploy bundle from any URL in 30 seconds.

Keep reading

llms.txt vs ai.txt vs robots.txt vs identity.json: The Four-File AI Identity Surface | INITE AI Blog