Afternoon BriefAI Search & Discovery

AI Crawlers Now Outnumber Googlebot 4:1 on Brand Pages. Here's the 15-Minute Infrastructure Check.

Optimly's March 2026 baseline shows AI crawlers send 4.3x more requests than Googlebot to brand pages. Most B2B teams haven't audited what those crawlers see. This is the infrastructure check that takes 15 minutes and fixes the gap before it costs you citations.

Christian Lehman|
AI Crawlers Now Outnumber Googlebot 4:1 on Brand Pages. Here's the 15-Minute Infrastructure Check.

Your marketing team spent two quarters building content for AI search. Structured headings, sourced statistics, answer-first formatting. Then someone ran a server log audit and discovered GPTBot was blocked at the network level before it ever read a single page.

That scenario is playing out across thousands of B2B sites right now, and Optimly's first "State of AI Brand Crawling" baseline, published March 29, 2026, quantifies exactly how much traffic is at stake. Across 5,829 tracked brand pages, AI crawlers sent 19,454 requests per week compared to 6,730 from traditional search engine crawlers. That is a 4.3-to-1 ratio, and OpenAI alone accounted for more requests than all search engines combined.

Christian Lehman's read on this data: the infrastructure layer is now the single fastest way to gain or lose AI visibility, because it operates before any content optimization matters. You can have the best-structured FAQ page in your category. If your WAF blocks ClaudeBot at the edge, that page does not exist for Claude.

What the Optimly data actually shows

The platform breakdown from the March 22-28 tracking window tells you where the crawl volume is concentrated.

PlatformWeekly requestsPrimary crawlerShare of AI traffic
OpenAI10,816GPTBot/1.355.6%
Anthropic4,669ClaudeBot/1.024.0%
Amazon4,366Amzn-SearchBot22.4%
Perplexity1,699PerplexityBot/1.08.7%
ByteDance15Bytespider0.1%

Source: Optimly, State of AI Brand Crawling, March 2026, Cloudflare server logs and directory data, week of March 22-28, 2026.

OpenAI is running two separate crawlers that do different things. GPTBot collects data for model training and improvement. OAI-SearchBot handles real-time retrieval when a ChatGPT user triggers live web search during a conversation. Christian Lehman flags this because most B2B sites that have a GPTBot rule in their robots.txt do not have an OAI-SearchBot directive. That means you can allow model training but still be invisible in ChatGPT's live search results, which is the surface that actually sends referral traffic.

Cloudflare's own July 2025 analysis of crawler trends confirmed this trajectory. GPTBot's share of AI crawler traffic grew 305% year over year, moving from the ninth-ranked crawler to third. PerplexityBot saw a 157,490% increase in raw request volume over the same period. These are not projections. This is the measured volume hitting brand pages today.

Why most B2B sites are blocking AI crawlers without knowing it

The problem is rarely intentional. It shows up in three places.

Wildcard robots.txt rules. A blanket User-agent: * / Disallow: / directive blocks every AI crawler by default. Unless you add explicit Allow overrides for each AI user-agent, you are invisible. Cloudflare's data showed that out of 3,816 domains with robots.txt files from the top 10,000 websites, only 546 had any AI-specific directives at all. The majority of sites have not even addressed the question.

WAF and bot protection settings. Cloudflare, Sucuri, and similar services treat unknown bots the same way they treat scrapers. If your bot management is configured aggressively, OAI-SearchBot and ClaudeBot get blocked at the network level before robots.txt is even consulted. A Fuel Online technical SEO audit of 730 sites found that sites blocking GPTBot were cited 73% less often in ChatGPT responses compared to similar sites that allowed it.

Missing user-agents. There are at least 14 distinct AI crawler user-agents that should have explicit rules. Most sites that have addressed AI crawlers at all have configured one or two, typically GPTBot and maybe Google-Extended. They are missing ChatGPT-User, OAI-SearchBot, Claude-SearchBot, anthropic-ai, Perplexity-User, Meta-ExternalAgent, Meta-ExternalFetcher, Applebot-Extended, and Amazonbot. Each missing entry is a platform where your content cannot be retrieved.

The Optimly data adds a dimension that makes this more urgent. Brand profile pages receive the highest crawl frequency from AI systems. These are the entity pages that models use to build brand representations. If those pages are inaccessible, the AI model's understanding of your brand is assembled from whatever fragments it finds elsewhere, which is how the mispositioning problem starts.

The 15-minute infrastructure check

Run this against your primary domain right now. No tools required beyond a browser and your server access.

Step 1: Read your robots.txt (2 minutes). Navigate to yourdomain.com/robots.txt. Search the file for each of these user-agents: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Amazonbot, Applebot-Extended, Meta-ExternalAgent, Meta-ExternalFetcher. Count how many are present with explicit Allow directives. If fewer than 10 are listed, you have gaps. If a wildcard Disallow rule exists without per-bot overrides, you are blocking everything.

Step 2: Check your WAF or CDN bot management (5 minutes). Log into your Cloudflare, Sucuri, or equivalent dashboard. Navigate to the bot management or firewall rules section. Look for any rules that categorize "AI bots" or "scrapers" as threats. Look for rate-limiting rules that would throttle high-volume crawlers. If OAI-SearchBot is sending 10,000+ requests per week as Optimly's data suggests, aggressive rate limiting treats that as an attack and cuts it off.

Step 3: Verify with server logs (5 minutes). If you have access to raw server logs, run a grep for AI crawler user-agents:

grep -E "GPTBot|ClaudeBot|PerplexityBot|OAI-SearchBot|ChatGPT-User" /var/log/nginx/access.log | awk '{print $14}' | sort | uniq -c | sort -rn

If specific crawlers that should be hitting your site are absent from the logs, the block is happening at the network level, not in robots.txt.

Step 4: Test your llms.txt (3 minutes). Check whether yourdomain.com/llms.txt exists. This newer file format gives AI crawlers a plain-text summary of your site structure and preferred content locations. It is not required, but it reduces ambiguity for models trying to understand what content you want them to prioritize. If you do not have one, creating it is a 10-minute task that removes a layer of uncertainty from every AI crawler visit.

What changes after the fix

The Optimly data shows that AI crawlers are not just reading your content. They are building entity representations from it. Brand profile pages get the highest crawl frequency. Category pages get moderate frequency. Blog content gets lower AI crawl frequency but higher search-retrieval frequency from the real-time search crawlers like OAI-SearchBot and Claude-SearchBot.

That pattern tells you where to prioritize. The pages that define what your company does, what category you belong to, and what problems you solve are the pages AI crawlers visit most. If those pages are structurally clean, schema-marked, and factually dense, the AI model's representation of your brand gets more accurate with every crawl cycle.

AirOps research on citation durability found that pages not updated within 90 days are 3x more likely to lose AI citations. The infrastructure layer compounds on top of that: a page that is both fresh and accessible earns citations. A page that is fresh but blocked earns nothing.

Ahrefs' analysis of 75,000 brands found that brand web mentions correlate 0.664 with AI Overview visibility, compared to 0.218 for backlinks. That 3x difference measures what happens after AI crawlers access your content and the content about you. It does not measure the brands that were never crawled at all. Those brands are not in the dataset. They are not in the competition.

The infrastructure layer is now your first citation gate

Christian Lehman's take on where this data lands for operators: content optimization, entity work, and earned media all matter. But none of them produce results if the infrastructure gate is closed. A blocked crawler means every dollar you spent on GEO-optimized content, every editorial placement you earned, and every schema markup you implemented has zero chance of being retrieved by the platform that would have cited it.

This is where the broader framing matters. As Jaxon Parrott has argued, the individual optimization layers, whether you call them GEO, AEO, or technical SEO, are tactics within a larger system. The mechanism that gets a brand cited in AI answers runs through a chain: earned media in publications AI engines trust, entity clarity that AI engines can resolve, and content that AI engines can retrieve. The infrastructure layer is the retrieval step. Skip it, and the chain breaks at the last link.

That chain is what Machine Relations names as a discipline: the operating system for earning AI citations through third-party credibility, entity architecture, and citation infrastructure working together. The infrastructure check you run today determines whether everything upstream of it, the earned authority you have already built, actually reaches the models that would have cited it.

The GEO-16 framework (Kumar et al., arXiv, September 2025) identified metadata freshness (r=0.68), semantic HTML (r=0.65), and structured data (r=0.63) as the three on-page signals most correlated with cross-engine citation. Pages meeting the quality threshold achieved a 78% cross-engine citation rate. But those correlations assume the page was crawled. The infrastructure check is what makes the GEO-16 score matter.

Run the four-step check this afternoon. It takes less time than a standup meeting and it determines whether everything else you are doing for AI visibility actually reaches the platforms where your buyers are researching.

Run an AI Visibility Audit to see how your brand currently appears across AI engines and which platforms are retrieving your content.

Related Reading