GPTBot: What It Is and How to Control It

GPTBot is OpenAI's training crawler. Get the user-agent string, robots.txt syntax, log checks, and a 2026 framework for deciding whether to block it.

Max Tsygankov · Founder, Crawloria

Published June 13, 2026 · 11 min read

Intro

GPTBot is the most argued-about crawler on the web. Cloudflare's July 2025 crawler report found it was both the most-blocked AI bot among the top 10,000 domains and the most explicitly allowed one. Site owners cannot agree on what to do with it, partly because most of the advice that ranks for "gptbot" was written in 2023-2024, when blocking was the only conversation worth having.

The situation changed. ChatGPT now sends measurable referral traffic; in Crawloria's own GA4, ChatGPT is the number-two traffic referrer as of June 2026. That doesn't settle the GPTBot question, because GPTBot isn't the crawler behind those referrals. But it does mean the block/allow call deserves more than a copy-pasted robots.txt snippet. This guide covers what GPTBot does, how to recognize and verify it, how to control it, and how to make the decision with 2026 information instead of 2023 reflexes.

What is GPTBot?

GPTBot is the web crawler OpenAI uses to collect publicly accessible content for training its generative AI models. OpenAI's own crawler documentation describes its purpose as making "our generative AI foundation models more useful and safe." It reads public pages the way a search-engine crawler does, but the destination is a training corpus, not a search index.

Two boundaries matter. First, GPTBot only touches publicly available content; it doesn't bypass paywalls or log into anything. Second, it checks robots.txt and honors a disallow rule addressed to its token.

It is also no fringe crawler anymore. Cloudflare's measurement across its network found GPTBot grew from 5% of AI-crawler traffic in May 2024 to 30% in May 2025, a 305% increase in raw requests, moving it from the ninth to the third most active crawler overall, behind only the giants of traditional search. If you run a public website, GPTBot has almost certainly read it.

Which OpenAI bot is which?

OpenAI operates four distinct crawlers, and conflating them is one of the most common mistakes in bot-policy decisions. Per OpenAI's bot documentation (checked June 2026):

Bot	What it does	robots.txt token	What blocking costs you
GPTBot	Collects content for model training	`GPTBot`	Your content stays out of future training runs. No effect on search visibility.
OAI-SearchBot	Indexes sites for ChatGPT's search features	`OAI-SearchBot`	Your site stops appearing in ChatGPT search results.
ChatGPT-User	Fetches pages when a user asks ChatGPT to	n/a (robots.txt rules may not apply)	User-initiated fetches of your pages fail.
OAI-AdsBot	Validates landing pages submitted as ChatGPT ads	`OAI-AdsBot`	Your ad landing pages can't be safety-checked.

The row that surprises people: blocking GPTBot does nothing to your ChatGPT search visibility, and allowing it doesn't help your visibility either. Training and search surfacing run through separate crawlers with separate tokens. These four are also just OpenAI's slice of a bigger picture; for the full taxonomy of training crawlers, search indexers, real-time fetchers, and autonomous agents, see our breakdown of the four classes of AI bots.

What does the GPTBot user agent look like?

The current GPTBot user-agent string, verbatim from OpenAI's documentation as of June 2026:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot

The stable part is GPTBot/1.3 plus the reference URL. Version numbers move over time, so match on GPTBot rather than the full string in any filter or log query you write.

A user-agent string is just a text header, and anyone can fake one. Scrapers impersonate GPTBot to borrow its reputation with sites that allow it. OpenAI publishes the authoritative IP ranges at https://openai.com/gptbot.json; a request claiming to be GPTBot from an IP outside those ranges is an impostor and fair game for your firewall. The other OpenAI bots have their own range files (searchbot.json, chatgpt-user.json) at the same path pattern.

How do you block GPTBot in robots.txt?

To block GPTBot from your entire site, add this to robots.txt:

User-agent: GPTBot
Disallow: /

To keep it out of specific sections only:

User-agent: GPTBot
Disallow: /members/
Disallow: /research/

OpenAI's documentation states the effect plainly: "Disallowing GPTBot indicates a site's content should not be used in training generative AI foundation models." Three caveats before you ship that file:

It's forward-looking. A disallow rule stops future crawls. It does not pull content out of models that were already trained on it.
It's GPTBot-specific. The rule above leaves OAI-SearchBot and every other vendor's crawler untouched. Each needs its own user-agent block.
It's a request, not a wall. Compliant crawlers honor it; spoofed ones don't. For enforcement, you need the IP-range check above, or edge filtering of the kind we cover in our guide to Cloudflare Bot Fight Mode and AI agents on Shopify. Be careful with blunt WAF rules, though: aggressive bot-fight settings routinely catch the crawlers you want alongside the ones you don't.

Should you block GPTBot in 2026?

For most sites that want any form of AI visibility, no. For sites whose content is the product, often yes. The honest answer is a cost-benefit question that 2023-era articles never had to ask, because back then there was no benefit side to weigh.

The case for blocking is real. If you sell access to your content (journalism, original research, licensed data), GPTBot crawls transfer value to a model vendor without compensation, and a disallow rule is the documented way to refuse. Several major publishers did exactly that, and licensing deals between publishers and AI labs followed. Blocking preserves your negotiating position.

The case against blocking has grown quietly. Content that enters training data shapes what models know about your brand, your product category, and your claims. A model that has never read you can't describe you accurately when a user asks. And while GPTBot itself doesn't drive referrals, a blanket "block everything OpenAI" reflex (which is how many sites implement it) takes down OAI-SearchBot in the same robots.txt edit and silently removes the site from ChatGPT search. We've seen what that costs: ChatGPT is the second-largest referrer to crawloria.com in our GA4 data as of June 2026, and merchants who block indiscriminately are cutting off a channel they never measured. If your pages already aren't surfacing, the cause is usually diagnosable; we wrote up the common failure modes in why ChatGPT isn't showing your website.

Cloudflare's data confirms there is no industry consensus to hide behind: GPTBot leads both the most-blocked and the most-allowed lists. The split tracks business models, not best practices. Decide from your numbers, not from someone else's robots.txt.

How do you check GPTBot activity in your logs?

One grep against your access logs answers whether GPTBot visits and how hard. On any server with standard combined-format logs:

grep -i "gptbot" access.log | wc -l

For the verified picture, pull the requesting IPs and compare against OpenAI's published ranges:

grep -i "gptbot" access.log | awk '{print $1}' | sort | uniq -c | sort -rn

Anything claiming GPTBot from outside the gptbot.json ranges is spoofed traffic wearing a respectable name. On Shopify or other hosted platforms where raw logs aren't exposed, a CDN analytics layer (Cloudflare's bot reports, for instance) gives the same view.

What you'll typically find: GPTBot hits arrive in bursts rather than a steady trickle, and frequency varies widely from site to site. The useful signals are direction and load. Rising hit counts mean OpenAI's crawler considers your site worth revisiting. Hit counts heavy enough to show in your server costs are an argument for rate limiting at the edge rather than an outright block.

A decision framework for DTC merchants

Three questions settle the GPTBot policy for a typical DTC store, in order:

Is ChatGPT already referring traffic to you? Check GA4 referrers for chat.openai.com and chatgpt.com. If yes, do not touch OpenAI bots with blanket rules; an over-broad block here costs you a working channel. Scope any rule to the GPTBot token alone.
Is your content itself the asset? Product pages, collection pages, and a merchant blog exist to be found and recommended. That content gains from being known to models. Block GPTBot only for the parts of the site that hold genuinely proprietary material, like paid research or member content, using path-scoped disallow rules.
Is crawl load a real cost? For most stores it isn't. If your logs show otherwise, rate-limit at the CDN before reaching for a full block.

The default that falls out for most merchants: allow GPTBot site-wide, or disallow it on a narrow path set, and never let a training-bot decision spill over onto the search and fetch bots that carry actual revenue.

FAQ

Does GPTBot respect robots.txt?

Yes. OpenAI documents that GPTBot honors robots.txt rules addressed to its token, and that a disallow signals content should not be used in training. The compliance is voluntary, though; enforcement against non-compliant or spoofed crawlers requires IP-level verification against openai.com/gptbot.json.

Does blocking GPTBot remove my site from ChatGPT?

No. ChatGPT search visibility runs through OAI-SearchBot, a separate crawler with its own robots.txt token, and user-initiated page fetches run through ChatGPT-User. Blocking GPTBot only opts your content out of future model training. The reverse also holds: allowing GPTBot does not improve your ChatGPT search presence.

Does blocking GPTBot remove content already used in training?

No. A robots.txt disallow stops future crawls. Content collected before the rule existed stays in whatever training runs already used it. There is no retroactive removal mechanism via robots.txt.

How often does GPTBot crawl a site?

There's no published schedule, and frequency varies by site. Network-level data shows the overall volume climbing steeply: Cloudflare measured a 305% year-over-year increase in GPTBot requests between May 2024 and May 2025. Your own access logs are the only reliable answer for your site.

Where to start

If you've never made a deliberate GPTBot decision, do these four things this week:

Grep your access logs (or CDN analytics) for GPTBot and note the volume.
Check GA4 for ChatGPT referral traffic, so you know what an over-broad block would cost.
Write a robots.txt policy per bot token, not per vendor: GPTBot gets its own decision, separate from OAI-SearchBot.
Run a free Crawloria audit to see which AI crawlers can actually reach, render, and extract your pages, because a site that's technically open to GPTBot but unreadable to it gets none of the benefits of either choice.