GPTBot, ChatGPT-User, OAI-SearchBot: 4 Classes

4 classes of AI bots crawl your site: training, indexing, real-time fetchers, and autonomous agents. How to configure robots.txt + WAF for each.

Max Tsygankov · Founder, Crawloria

Published May 7, 2026 · 13 min read

TL;DR

AI bots split cleanly into four classes that behave very differently: training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended), search-index crawlers (OAI-SearchBot, PerplexityBot), real-time fetchers (ChatGPT-User, Claude-User, Google-Agent), and autonomous agents (ChatGPT Operator, Claude for Chrome, Perplexity Comet).
Each class respects robots.txt differently — training and search-index bots obey, real-time fetchers do not (they are user-initiated), and autonomous agents send no distinguishable user-agent because they drive a real browser.
Vercel measured GPTBot at 569M monthly requests and Claude at 370M against Googlebot's 4.5B — AI crawlers now generate roughly 28% of Googlebot's volume, with both ChatGPT and Claude wasting 34% of their fetches on 404 pages.
The right policy is class-specific: allow Class 1 if you want representation in trained models, always allow Class 2 to be cited in AI search, treat Class 3 like ordinary user traffic, and design Class 4 traffic to reach a human payment confirmation. One blanket rule wastes most of the upside.

A site owner opens their server logs and sees GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, and CCBot all hitting the same week. Most existing guides answer "should you block GPTBot?" — a single decision applied to one bot. That framing misses the point. These bots are not interchangeable. Training crawlers, search-index crawlers, real-time fetchers, and autonomous agents are four separate classes with four separate behaviors and four separate appropriate responses.

This article is the taxonomy and the playbook. It defines the four classes against the 2026 landscape (including Google's Google-Agent user-triggered fetcher, added to Google's crawler list on March 20, 2026), shows the user-agent strings, robots.txt behavior, and rendering capability of each, and gives a concrete action checklist by class.

Why distinguish between four classes of AI bots?

Because the same robots.txt rule that helps with one class actively hurts you with another. Disallow GPTBot and you opt out of OpenAI's training set. That same disallow does nothing to stop ChatGPT-User from fetching your page when a user asks ChatGPT a question that lands on you — those requests are user-initiated and not subject to robots.txt. And neither rule has any effect on Operator or Comet, which drive a real Chrome browser using the user's own session.

The four classes differ on five dimensions that matter for site operation:

Class	Examples	User-agent identifies bot?	Respects robots.txt?	JavaScript rendering?	Visible in pageviews?
1. Training crawlers	GPTBot, ClaudeBot, CCBot, Google-Extended	Yes	Yes	Mostly no	No (filtered by bot UA)
2. Search-index crawlers	OAI-SearchBot, PerplexityBot	Yes	Yes	Limited	No
3. Real-time fetchers	ChatGPT-User, Claude-User, Google-Agent	Yes	No (user-initiated)	Mostly no	No
4. Autonomous agents	ChatGPT Operator, Claude for Chrome, Perplexity Comet	No (regular browser UA)	N/A (drives a browser)	Yes (real Chrome)	Yes (looks like human)

The blocking decisions follow from this table, not the other way around. Saying "I want to block AI bots" is incoherent unless you say which class. Below, I walk through each.

Class 1: Training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended)

These crawlers exist to collect data for model training. They do not generate user-facing visibility on their own. What they do generate is what the model "knows" about your brand the next time someone asks ChatGPT, Claude, or Gemini for a recommendation.

The four major training crawlers in 2026:

GPTBot (OpenAI) — UA: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot. Used "to make generative AI foundation models more useful and safe." Respects robots.txt. IP ranges published at openai.com/gptbot.json. As of late 2025, GPTBot was the most-blocked crawler in robots.txt files on the web, with roughly 3.5% of sites blocking it. For the full breakdown of what it fetches, how to verify it, and whether to allow it, see the GPTBot deep dive.
ClaudeBot (Anthropic) — Anthropic's training crawler. Respects robots.txt. Identified by ClaudeBot in the user-agent.
CCBot (Common Crawl) — UA: CCBot/2.0 (https://commoncrawl.org/faq/). Common Crawl is a non-profit that maintains an open repository of web data. Common Crawl data is used to seed many AI training datasets — blocking CCBot is the indirect way to opt out of more model training than blocking any single AI vendor.
Google-Extended — A standalone robots.txt token (not a separate user-agent string in HTTP requests). Lets publishers control whether their content is used to train Bard, Gemini, and Vertex AI generative APIs. Launched September 2023. Critically, blocking Google-Extended does not affect Google Search ranking — it is a separate opt-out lever for AI training only.

Vercel's 2025 measurements show GPTBot at 569 million monthly requests against Googlebot's 4.5 billion. Claude was at 370 million. Combined AI crawler volume reached approximately 28% of Googlebot's. The trend is clear: training crawlers are now a meaningful fraction of total bot traffic.

Rendering: most training crawlers do not execute JavaScript. They fetch HTML and parse text. If your homepage is client-side-rendered (a React app that hydrates on the client), GPTBot and ClaudeBot see a near-empty document. Vercel reports ChatGPT's crawler spends 57.7% of its bandwidth on HTML and only 11.5% on JavaScript — meaning JS-heavy pages get partially indexed at best.

The decision: allow Class 1 unless you have a specific reason to opt out (legal, content licensing strategy, server cost). Blocking forfeits training-set inclusion, which over the next 1-3 years is the single biggest factor in whether a model "knows" your brand. Block selectively — for example, allow GPTBot but not CCBot if you specifically don't want third-party datasets including your content.

Class 2: Search-index crawlers (OAI-SearchBot, PerplexityBot)

These are the bots that crawl the web specifically to power AI-driven search and citation. When a user asks ChatGPT a question that triggers ChatGPT Search, the answer is built from pages OAI-SearchBot has indexed. Same for Perplexity.

The major search-index crawlers:

OAI-SearchBot (OpenAI) — UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot. Powers ChatGPT search. Respects robots.txt. IP ranges at openai.com/searchbot.json. Importantly: GPTBot and OAI-SearchBot are separate user-agents. A site that allowed GPTBot thinking it covered everything OpenAI does will still block OAI-SearchBot if its robots.txt has a default Disallow: / rule above it. The OAI-SearchBot deep dive walks through this separation and how to confirm both are allowed.
PerplexityBot (Perplexity) — Indexes pages for Perplexity's AI search results. Respects robots.txt. See the PerplexityBot deep dive for its user-agent, verification, and how it differs from Perplexity's user-initiated fetcher.

The blocking calculus here is different from Class 1. Disallowing search-index crawlers has no upside. You do not pay any cost for being indexed in ChatGPT Search or Perplexity — unlike training, there is no IP licensing concern, no model-knows-your-content question. Refusing search-index bots simply removes you from the answer set the next time a relevant question is asked. This is the worst trade in the entire taxonomy. Allow these unconditionally.

The decision: allow all Class 2 crawlers always. The single best free-SERP investment in AI visibility is to ensure OAI-SearchBot and PerplexityBot can fetch every public page on your site. If you have a Cloudflare "Block AI Bots and Scrapers" rule enabled, it likely catches these too — disable it specifically for Class 2.

Class 3: Real-time fetchers (ChatGPT-User, Claude-User, Google-Agent)

This is where most existing AI-bot guides get the framing wrong. Real-time fetchers are not crawlers in the traditional sense — they are user-initiated fetches that happen at inference time, when a specific user asks the AI a specific question that requires a specific page. The model retrieves the page, reads it, and uses it to answer that one query.

The three major real-time fetchers:

ChatGPT-User (OpenAI) — UA: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot. Handles user-initiated requests in ChatGPT and Custom GPTs. Does not respect robots.txt, because OpenAI's documentation states: "these actions are initiated by a user, robots.txt rules may not apply." This is the most important fact in the entire taxonomy.
Claude-User (Anthropic) — Anthropic's user-initiated fetch agent. Same posture: triggered by a user query, not by a crawl scheduler.
Google-Agent (Google) — Added to Google's crawler list on March 20, 2026. Identifies requests from AI agents running on Google infrastructure, including Project Mariner. User-triggered fetcher.

The robots.txt non-compliance is not malicious. The argument is that when a human user types a question into ChatGPT and that question requires fetching your page, the fetch is logically equivalent to that human pasting your URL into a browser — a personal, one-off retrieval, not a systematic crawl. robots.txt was designed to govern systematic crawling.

Practical implication: you cannot block Class 3 with robots.txt alone. If you are determined to refuse these requests, you need server-side blocks based on the user-agent string or IP range. Most sites should not — refusing real-time fetches means the AI cannot cite you in answer to a real user question that's already directed at you.

The decision: treat Class 3 traffic like high-quality referral traffic. Each ChatGPT-User fetch represents a specific user with specific intent who has already decided to learn about your topic. The fetch is happening because the AI is preparing to answer a question. Allowing this traffic and serving it well is the most direct path from AI search to human visit.

Class 4: Autonomous agents (ChatGPT Operator, Claude for Chrome, Perplexity Comet)

The fourth class is structurally different from the first three. Operator, Claude for Chrome, and Comet are not crawlers — they are AI agents that drive a real Chrome browser, click buttons, fill forms, and try to complete a task on the user's behalf. From your server's perspective, the request looks like ordinary Chrome traffic, because it is ordinary Chrome traffic.

The major autonomous agents in 2026:

ChatGPT Operator (OpenAI) — Cloud-hosted browser. Runs in OpenAI infrastructure with a fresh browser session, no cookies, no signed-in account. From your server, it looks like a Chrome request from an OpenAI-controlled IP.
Claude for Chrome (Anthropic) — Browser extension running inside the user's own Chrome. Uses the user's cookies, logins, and session state. From your analytics, it appears as the user's normal traffic.
Perplexity Comet (Perplexity) — Standalone Chromium browser, free since March 2026. Identifies itself in the user-agent string with Comet, so this one is filterable in analytics.

The defining property of Class 4 is that they hit the same code paths as humans. They click "Add to cart." They submit forms. They fail at login walls and CAPTCHA. The fixes that work on them are not robots.txt edits — they are UX changes to the actual page. The Shopify checkout walls article goes deep on this for e-commerce; the screen-resolution article covers the visual side.

The decision: allow Class 4 by default. Robots.txt is irrelevant to them. The only meaningful "block" is a server-side rule against OpenAI's IP ranges, which would refuse legitimate user-initiated agent traffic — almost never the right call. The action items for Class 4 are downstream: design checkout flows that work for non-cookie sessions, surface CTAs above the fold, avoid CAPTCHA on cart pages, and confirm guest checkout is enabled.

Action checklist by class

The right policy is class-specific. Here is the matrix as a decision-support table:

Class	Default policy	robots.txt rule	Cloudflare action	Action items
Training crawlers	Allow (unless legal opt-out)	`User-agent: GPTBot` / `ClaudeBot` / `CCBot` with `Allow: /`	Disable "Block AI Bots and Scrapers"	Make sure your homepage and key pages render in raw HTML — these crawlers don't run JS
Search-index crawlers	Always allow	`User-agent: OAI-SearchBot` / `PerplexityBot` with `Allow: /`	Disable "Block AI Bots and Scrapers"	Add structured data (JSON-LD), Open Graph tags, and an llms.txt — these signals improve citation quality
Real-time fetchers	Allow (cannot block via robots.txt anyway)	No effect — robots.txt does not apply	Whitelist OpenAI/Anthropic/Google IP ranges if you have aggressive WAF rules	Render pages in raw HTML, expose product details server-side, ensure no auth wall on public content
Autonomous agents	Allow	No effect — they drive a real browser	Whitelist verified AI bots above any Bot Fight Mode rule	Guest checkout enabled, no login wall on cart, CAPTCHA only on high-risk events, sticky CTAs above the fold

Two cross-cutting rules apply across all classes:

JavaScript rendering: training crawlers and search-index crawlers mostly do not execute JS. If your content depends on client-side rendering, those classes see a blank page. Switch to server-side rendering or static generation for any page that matters in AI visibility. Vercel's data shows 34% of ChatGPT crawler fetches and 34% of Claude crawler fetches end on 404 pages — much of this is JS-rendered SPAs that look broken to a non-rendering crawler.
Bot fingerprinting: Cloudflare's Bot Fight Mode and similar WAF rules use fingerprinting (clean browser session, server-side IP, lack of human interaction history) to flag traffic. Class 4 agents trigger these signals even though they are not malicious. The right response is to whitelist verified AI bots specifically rather than disable the WAF entirely.

How to test which classes can reach your site

Three concrete checks tell you whether each class is being served correctly:

Class 1 + 2 (training and search crawlers): run a Crawloria audit on your homepage. Our bot-access category sends real HTTP requests with each crawler's documented user-agent — GPTBot, ClaudeBot, OAI-SearchBot, PerplexityBot, CCBot, anthropic-ai, Google-Extended — and reports which ones get HTTP 200 vs blocked. Audit takes 20 seconds and surfaces both robots.txt issues and Cloudflare WAF blocks.
Class 3 (real-time fetchers): this is harder to test directly because the fetches are user-triggered. The closest signal is whether ChatGPT can answer accurate, specific questions about your site after browsing it. Ask ChatGPT directly: "what does [your domain] sell?" If the answer is generic or wrong, ChatGPT-User cannot reliably retrieve usable content. Common cause: client-side rendering hides the actual product details.
Class 4 (autonomous agents): simulate the user journey from a clean browser session. Open your site in an incognito Chrome window with no cookies, no account, no extensions. Try to add a product to cart and reach checkout. The first wall you hit (login modal, cookie banner, CAPTCHA, sign-up prompt) is where Operator and Comet abandon. Fix that wall. Repeat.

Beyond these structural checks, there is no public way to verify how a specific AI vendor processes your content downstream. Logs and audit signals are the strongest available indicators. Run them quarterly — the bot landscape changes faster than annual policy reviews can keep up with.

FAQ

Are AI bots ranked by how much traffic they generate to my site?

Mostly no. Training and search-index crawlers (Class 1 + 2) generate no user-facing visits at all. Real-time fetchers (Class 3) generate one fetch per user query, then the user clicks through to your site as a regular visit if they choose. Only autonomous agents (Class 4) generate human-shaped sessions. So the question "is GPTBot driving traffic?" usually has the answer "no, but allowing it lets ChatGPT recommend you, which drives traffic separately."

Should I block AI bots to save server bandwidth?

For most sites, no. Vercel's data shows AI crawler combined volume at roughly 28% of Googlebot's volume — meaningful but not catastrophic. Bandwidth is rarely the binding constraint. The exception: if your site is a static media library or hosts large file downloads, training crawlers can hit you hard. Add per-class rate limits in your CDN before doing a wholesale block.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is a Class 1 training crawler — automated, scheduled by OpenAI, respects robots.txt, used to gather data for future models. ChatGPT-User is a Class 3 real-time fetcher — triggered by a specific user question in ChatGPT, does not respect robots.txt because the fetch is user-initiated. Allowing one does not allow the other. They are separate user-agents that need separate robots.txt entries.

Does blocking Google-Extended affect my Google Search ranking?

No. Google explicitly stated when launching Google-Extended in September 2023 that it controls only training data for Bard, Gemini, and Vertex AI. Google Search ranking uses Googlebot and is governed separately. You can block Google-Extended without losing search rankings.

How do I block Class 4 autonomous agents specifically?

You generally cannot, because they drive a real Chrome browser and from your server they look identical to human Chrome traffic. The exceptions: ChatGPT Operator runs from OpenAI cloud IPs (you could IP-block those, at the cost of refusing legitimate Operator users), and Perplexity Comet identifies itself with Comet in the user-agent string (filterable in analytics or WAF). Claude for Chrome is essentially indistinguishable because it is an extension running in the user's own browser.

What about Bytespider, Amazonbot, Meta-ExternalAgent?

Bytespider (ByteDance), Amazonbot, and Meta-ExternalAgent are Class 1 training crawlers from non-Western or non-OpenAI vendors. Same posture as GPTBot: they respect robots.txt, they do not generate visibility traffic, allowing them puts your content in the relevant model. Whether that's valuable depends on whether your audience uses those vendors' products.

Is llms.txt relevant to any of these classes?

It targets retrieval pipelines for Class 2 (search-index crawlers) and Class 3 (real-time fetchers) most directly. The training crawlers in Class 1 fetch broadly and don't preferentially read llms.txt. Whether to publish one is covered in the llms.txt explainer.

Are autonomous agents the same thing as Generative Engine Optimization (GEO)?

GEO is a broader positioning term covering all four classes plus the structured-data and content-strategy work that makes a site cite-able in AI answers. Autonomous agents are one specific class within the AI bot landscape that GEO addresses. Confusing the two is a common source of bad advice.

What's next

The four-class taxonomy is the vocabulary spine for the rest of this site's AI agent coverage. For specific cases:

Visibility prerequisites — start with ChatGPT Not Showing Your Website? 9 Causes and How to Fix Each. Class 1 and Class 2 access is the gating issue there.
Vision-based agent rendering — How AI Agents See Your Website: The 1568-Pixel Rule covers Class 4 specifically, where Claude Computer Use and Operator render screenshots and read pixels.
Commerce specifics — Shopify ChatGPT Integration: 5 Walls Blocking AI Agent Checkout is the Class 4 deep dive for DTC sites.
The retrieval-side hint file — llms.txt: What It Actually Does, Who Uses It, and Whether You Need It covers the optional hint file for Class 2 and Class 3 retrieval, plus a free generator.

Run a free Crawloria audit to see which classes are reaching your site today, with prioritized fixes per class. Most sites have a different problem in each of the four classes, and one blanket policy hides all of them.