You wrote the perfect blog post. ChatGPT will never see it.
That is the situation roughly 27% of B2B SaaS and ecommerce sites are quietly sitting in right now — robots.txt looks fine, the post is indexed in Google, and yet AI crawlers like GPTBot, ClaudeBot, and PerplexityBot are being silently blocked at the CDN layer before they ever reach your origin server (Pravin Kumar). Meanwhile, AI bot traffic just hit 22% of all non-search bot traffic in Q1 2026, and LLM crawlers now hit the average website 3.6x more than Googlebot (Otterly.AI Citation Report 2026).
If your goal is to rank in Google AI Overviews or get cited by ChatGPT, the ranking strategy is downstream of one boring question: can the crawler reach the page? This guide is the technical answer.
What AI crawler optimization actually means
AI crawler optimization is the practice of configuring your robots.txt, CDN, server, and content structure so that LLM-powered crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and others — can fetch, parse, and use your pages as citation sources in AI search results. It is the technical prerequisite to ranking in AI Overviews and earning ChatGPT citations. It has three parts: permitting the right user agents, making sure your CDN does not silently override that permission, and structuring content so a non-JavaScript crawler can actually extract the answer.
This is different from traditional SEO crawl optimization for Googlebot in three ways:
The bots multiply. OpenAI alone runs three distinct user agents (GPTBot, OAI-SearchBot, ChatGPT-User), each with separate rules and separate consequences (OpenAI Bot Docs).
They don't run JavaScript. AI crawlers refuse to execute client-side rendering, so anything painted after page load is invisible to them (Aleyda Solis interview).
Speed gates extraction. Pages with First Contentful Paint under 0.4s average 6.7 ChatGPT citations versus 2.1 for pages over 1.13s (Otterly.AI). Slow pages get skipped, not just demoted.
The fix is mostly mechanical. Most teams just have not been told which mechanics matter.
The 2026 AI crawler landscape (and why it changed)
Two years ago there were two AI bots worth caring about. In May 2026 you should be configuring rules for at least eight. The growth curve is steep enough that AI-blocking by reputable sites jumped from 23% in September 2023 to nearly 60% by May 2025, with the typical "blocking" site now forbidding 15.5 different AI user agents (Originality.ai academic study, cited in PPC Land).
The other half of the story is what is doing the crawling. Cloudflare's 2025 Radar report and follow-up bot analyses show the AI crawler boom is not theoretical:
22% of all non-search bot traffic in Q1 2026 came from AI crawlers, up from a negligible share two years ago (TechnologyChecker).
LLM bots (ChatGPT-User, GPTBot, ClaudeBot, and others) now crawl 3.6x more pages than Googlebot on the average site (Otterly.AI).
49.4% of news publishers block GPTBot — the highest block rate of any category — while only 11.7% of general domains do (Cloudflare Radar, summarized by Ahrefs).
OpenAI's own SearchBot now has 55% coverage of the indexable web, up from near zero in 2024 (ALM Corp analysis).
Here is the cheat sheet for the user agents that actually matter in 2026:
Bot | Operator | Purpose | What blocking it costs you |
|---|---|---|---|
GPTBot | OpenAI | Training data for future models | Future model knowledge of your brand |
OAI-SearchBot | OpenAI | Live ChatGPT citations | Direct citations in ChatGPT search |
ChatGPT-User | OpenAI | On-demand fetch when a user pastes your URL | Users can't share your link inside ChatGPT |
ClaudeBot | Anthropic | Training crawler | Future Claude model coverage |
Claude-User | Anthropic | User-initiated fetch | Same as ChatGPT-User, for Claude |
Claude-SearchBot | Anthropic | Claude web search citations | Citations in Claude's search tool |
PerplexityBot | Perplexity | Live answer citations | Citations in Perplexity |
Google-Extended | Gemini & AI Overview training | Gemini training data |
Two practical implications. First, Claude-Web and anthropic-ai are dead strings; sites that disallowed only those are not actually blocking Anthropic (ALM Corp). Second, GPTBot and OAI-SearchBot are independent toggles — you can disallow training while allowing live citations, or vice versa.
The 4-layer AI crawler stack
Most AI crawler advice is one tip in a vacuum. The reality is that crawler access is a stack — and a failure at any layer kills the bot's request. Call it the 4-Layer AI Crawler Stack:
Permission layer — robots.txt, meta tags, and HTTP headers tell well-behaved bots what they can fetch.
Network layer — your CDN, WAF, and bot management rules decide whether the request reaches your server at all.
Render layer — your hosting and templating decide whether the bot sees real content or a JavaScript shell.
Content layer — your markdown, schema, and llms.txt decide whether the crawler can extract a clean answer from what it sees.
Most ranking advice optimizes layer 4 while layer 2 is silently returning a 403. Walk down the stack in order. If a bot is not citing you, the failure is almost always at the lowest unfixed layer.
Layer 1: The robots.txt file every AI-friendly blog should ship
Robots.txt is the easiest layer and the one teams get wrong most often. Anthropic, OpenAI, and Perplexity all publicly commit to honoring it (Anthropic crawler docs). The trick is naming the right user agents.
Here is the copy-paste starter that allows everything important and explicitly blocks the bots you almost certainly do not want — like archivers and bulk scrapers:
# AI search & citation bots — ALLOW
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Google-Extended
Allow: /
# Bulk training scrapers you probably want to block
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
# Sitemap pointer (critical)
Sitemap: https://yourdomain.com/sitemap.xmlThree rules of thumb that save you weeks of debugging:
One block per user agent. Many sites use
User-agent: *thinking it covers AI bots. It does, but specific rules win over wildcards, so the second any bot hits a named block you have to be explicit.Include the Sitemap directive. OpenAI's documentation flags sitemap presence as a discovery signal, and Perplexity's bot uses it for fan-out crawling.
Wait 24 hours. OpenAI's published propagation time for robots.txt changes is roughly a day (Mersel AI guide).
If you publish to a Quillly subdirectory, the robots.txt at the root domain governs /blog/* automatically. That is the practical advantage of publishing to yourdomain.com/blog rather than a subdomain — your existing robots.txt and authority inherits directly to every post.
Layer 2: The Cloudflare layer most teams forget
This is where the silent failures happen. Approximately 27% of B2B SaaS and ecommerce sites are accidentally blocking major LLM crawlers at the CDN layer despite having a perfectly correct robots.txt (Pravin Kumar analysis). The pattern is consistent: a one-click "Manage AI bots" managed rule, or a default WAF policy that treats any data-center IP plus a non-cookie-accepting client as suspicious.
The result is brutal. AI crawlers hit your CDN, get a 403 or a managed challenge, and never reach your origin. Your robots.txt says "come on in." Your CDN says "denied." The bot never finds out which one is real.
To audit Cloudflare specifically:
Open Security → WAF → Managed rules and look for any rule referencing "AI Crawlers" or "AI bots" — disable or set to "Log" if you want AI traffic.
Open Security → Bots → Configure Super Bot Fight Mode and confirm "AI Scrapers and Crawlers" is set to Allow, not Block or Challenge.
Check Caching → Configuration → Caching Level for any AI-bot-specific rules added by recent Cloudflare default changes.
Spot-check your origin logs for
User-Agentstrings matchingGPTBot,ClaudeBot,PerplexityBot. If none appear and your CDN logs show 403s on those agents, you have the silent block.
Other common CDN culprits: AWS WAF managed rule groups, Akamai Bot Manager's AI category, Imperva, and Vercel's optional "AI bot challenge." Each has its own toggle.
Layer 3: Render the answer in HTML, not JavaScript
This is where Aleyda Solis spends most of her keynotes. AI crawlers refuse to process JavaScript. They ingest raw markup. Sites relying on client-side rendering lose entire menus, product details, pricing tables, and conversion paths from the bot's view (Humans of Martech interview).
"AI crawlers are exposing the technical debt of a decade of JavaScript-first front-ends. Many teams discover the gap only when their citations disappear." — Aleyda Solís, International SEO and AI Search Consultant (Humans of Martech, Jan 2026)
The practical fix has three flavors:
Server-side rendering (SSR). Next.js with
getServerSidePropsor app router server components. Nuxt withssr: true. SvelteKit's default.Static site generation (SSG). Markdown-rendered-to-HTML. Astro, Hugo, Jekyll, and most headless blog stacks fall here. This is what Quillly's blog renderer ships by default — every published post is fully hydrated HTML at the URL the bot fetches.
Pre-rendering. Tools like Prerender.io or Rendertron serve a pre-rendered HTML snapshot when a known bot user-agent appears. Workable, but adds an extra moving part.
Page speed compounds the render question. Pages with First Contentful Paint under 0.4 seconds average 6.7 ChatGPT citations versus 2.1 citations for pages over 1.13 seconds — likely because ChatGPT's extractor has a soft timeout (Otterly.AI 2026 citation report). Aggressive image lazy-loading and font subsetting are not vanity metrics; they directly affect how much content the bot can grab before moving on.
Layer 4: llms.txt, schema, and the content layer
Once the bots can reach your page and read the HTML, you decide how easy you make it for them to extract the answer.
llms.txt is the proposed standard for telling AI crawlers what content matters. It is a Markdown file at https://yourdomain.com/llms.txt listing your most important pages with one-line descriptions (Search Engine Land coverage). Adoption is mixed — roughly 10% of analyzed sites have shipped one, and major crawlers do not yet fetch it in volume (AEO.press State of llms.txt 2026). The contrarian read is to ship it anyway. Cost is half a day, every IDE-agent ecosystem already reads it, and the moment a major LLM provider flips the switch you are correct by default. Anthropic, Stripe, Cursor, Cloudflare, and Vercel already ship one.
A minimal blog-friendly llms.txt looks like this:
# Yourdomain Blog
> Practical guides for [your niche]. Updated weekly.
## Featured Posts
- [How we cut churn 40% in 90 days](https://yourdomain.com/blog/churn-case-study): Step-by-step playbook with numbers
- [The honest review of [Tool X]](https://yourdomain.com/blog/tool-x-review): Independent benchmark
## Documentation
- [Pricing](https://yourdomain.com/pricing)
- [Changelog](https://yourdomain.com/changelog)Schema markup is the second extraction lever. The correlation with AI citations is small but consistently positive across studies (Megrisoft AI Citation Ranking Factors). Three schemas matter most for blogs: Article (or BlogPosting), FAQPage, and HowTo. Quillly auto-generates FAQPage schema from any Q/A-formatted FAQ section, so writing in that format pays double — readers scan it, machines parse it.
Markdown structure is the underrated lever. Sections of 120–180 words between headings earn 4.6 average ChatGPT citations versus 2.7 for sections under 50 words (Otterly.AI). Lead each section with the answer in the first sentence. Use bullet lists and tables. AI crawlers extract structured chunks; reward them with structure.
