Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

79% of Major News Sites Now Block AI Bots - Data Shows a New Divide

Reviewed:
Andrii Daniv
9
min read
Jan 8, 2026
AI crawlers locked out browser with shield analytics dashboard visibility funnel person toggling switch

AI Bot Blocking by Top News Publishers: 79% Limit Training Crawlers, 71% Limit Retrieval Bots

Large news organizations are rapidly updating robots.txt to control how AI systems access their content. New data from BuzzStream, combined with technical reports from Cloudflare and public statements from Google, indicates that most leading US and UK news sites now restrict AI training crawlers and, increasingly, the bots that fetch content for live AI answers.[S1][S2][S7]

Most Major News Publishers Block AI Training & Retrieval Bots
Most major news publishers now block one or more AI training and retrieval crawlers.

Executive Snapshot: AI training bots blocking trends

The latest robots.txt analysis across 100 top US and UK news publishers shows:

  • 79% of publishers block at least one AI training bot via robots.txt directives.[S1]
  • 71% of sites block at least one retrieval or live-search bot, limiting how often they appear as cited sources in AI-generated answers.[S1]
  • Among training bots, CCBot is blocked by 75% of sites, Anthropic-ai by 72%, ClaudeBot by 69%, and GPTBot by 62%.[S1]
  • For retrieval bots, Claude-Web is blocked by 66% of sites, OpenAI's OAI-SearchBot by 49%, ChatGPT-User by 40%, and Perplexity-User by 17%.[S1][S5][S6]
  • 14% of publishers block all AI bots tracked in the study, while 18% block none. Cloudflare's network data shows GPTBot, ClaudeBot, and CCBot among the most fully disallowed bots across large domains, while Googlebot and Bingbot are usually only partially restricted.[S1][S7]

Implication for marketers: visibility in AI products is now shaped as much by bot access rules as by traditional SEO, so content strategies increasingly depend on which training and retrieval crawlers are allowed or refused.

Method & Source Notes on AI crawler blocking data

BuzzStream analyzed robots.txt files for 100 high-traffic news publishers across the US and UK, starting from the top 50 sites in each market by SimilarWeb traffic share and then deduplicating overlaps.[S1] Each robots.txt file was scanned for rules targeting named AI crawlers, which the researchers grouped into three categories: model training bots (for corpus building), retrieval or live-search bots (for real-time queries), and indexing bots (for AI search corpora).[S1] The analysis records whether a bot or group of bots received Disallow directives, not how often those bots actually crawl or whether they comply.

In this report, source IDs refer to: BuzzStream's publisher blocking study [S1]; Search Engine Journal's coverage of Google's position on robots.txt and unauthorized scraping [S2]; Search Engine Journal's write-up of Cloudflare's decision to delist and block Perplexity from its verified bots program [S3]; Perplexity's public response on Reddit [S4]; OpenAI's crawler documentation for GPTBot and OAI-SearchBot [S5]; Perplexity's documentation for PerplexityBot and Perplexity-User [S6]; and Search Engine Journal's summary of Cloudflare's Year in Review on bot traffic and AI crawlers.[S7]

Key limitations include: the sample is restricted to large English-language news sites in two markets; robots.txt is a voluntary protocol that many bots ignore; and the BuzzStream data captures a single moment in time rather than long-term trends.[S1][S2] Cloudflare reports that some services, such as Perplexity, used techniques such as IP rotation and user-agent spoofing to bypass robots.txt on certain sites, highlighting that stated policy and real bot behavior can diverge.[S3] Perplexity has published a response disputing these claims, so the scale of any non-compliance remains uncertain.[S4]

Findings on how news publishers block AI crawlers

Taken together, the datasets show a clear shift by major news publishers toward managed access for AI systems rather than an all-or-nothing stance. BuzzStream's robots.txt scan quantifies which specific AI services are most often blocked, while Cloudflare's network-level view shows how those policies appear at scale across many large domains.[S1][S7]

Training bots: high block rates for CCBot, Anthropic, Claude, GPTBot, and Google-Extended

BuzzStream reports that 79% of the 100 news sites in its sample block at least one AI training crawler.[S1] Within this category, blocks are concentrated on non-search-engine bots: 75% of sites disallow Common Crawl's CCBot, 72% disallow Anthropic-ai, 69% disallow ClaudeBot, and 62% disallow OpenAI's GPTBot.[S1] These bots gather large text corpora that help power models such as Claude and ChatGPT.

Google-Extended, which controls whether Google can use content for Gemini model training, is the least blocked training crawler at 46% across the full sample.[S1] The study notes a regional split: 58% of US publishers block Google-Extended compared with 29% of UK publishers.[S1] The dataset does not include direct information on why these regional differences exist.

Retrieval bots: 71% of publishers limit live AI search access

Retrieval or live-search bots are also widely restricted. BuzzStream finds that 71% of sites block at least one such bot, which governs whether AI assistants can fetch and cite fresh content when responding to user queries.[S1] Claude-Web, Anthropic's web retrieval crawler, is blocked by 66% of sites, while OpenAI's OAI-SearchBot, which powers ChatGPT's live search, is blocked by 49%.[S1][S5] ChatGPT-User, a separate bot that handles certain user-initiated retrieval requests, is blocked by 40% of publishers in the sample.[S1]

Perplexity's retrieval agent, Perplexity-User, is the least blocked retrieval bot, disallowed by 17% of sites.[S1][S6] OpenAI separates its crawlers into training and retrieval functions: GPTBot collects training data, while OAI-SearchBot and ChatGPT-User serve live user queries.[S5] Perplexity makes a similar distinction between PerplexityBot, which indexes pages for its search corpus, and Perplexity-User, which handles retrieval for user questions.[S6] This separation allows a publisher to permit model training while preventing live AI products from querying or citing its site, or apply the reverse policy.

Indexing bots and overall blocking strategies

For indexing bots that feed AI search corpora, BuzzStream reports that PerplexityBot is disallowed by 67% of the sampled news sites.[S1][S6] When policies are viewed as a whole, 14% of publishers block every tracked AI bot, while 18% do not block any of them.[S1] The majority sit between these extremes, with different rules for different services and functions.

Cloudflare's network-wide data aligns with this picture of selective blocking.[S7] Across large domains it monitors, Cloudflare notes that GPTBot, ClaudeBot, and CCBot are among the bots most likely to face full Disallow directives, while Googlebot and Bingbot are more often subject to partial restrictions.[S7] Search Engine Journal's coverage of Google's Gary Illyes reports that many site owners now discuss Googlebot in the context of both search indexing and AI training.[S2][S7]

Interpretation & Implications for SEO, AI visibility, and traffic

The points below interpret the data for marketers and are not direct measurements.

  • Likely: AI products will have shallower access to up-to-date content from large news publishers, especially for services linked to Anthropic, OpenAI, and Perplexity, given the 60-75% block rates on key training and retrieval bots in this sector.[S1]
  • Likely: Because 71% of sampled sites block at least one retrieval bot, branded exposure through AI answer citations for news content will depend heavily on each publisher's bot policy, not only on established SEO ranking signals.[S1][S5][S6]
  • Likely: The relatively lower block rate for Google-Extended (46% overall, with only 29% of UK publishers blocking it) suggests many news organizations still prioritize compatibility with Google's AI ecosystem, even if they restrict other bots, although the available data does not measure this preference directly.[S1]
  • Likely: For non-news brands, these patterns indicate that AI visibility is becoming a policy decision. Leaving retrieval bots unblocked increases the odds of being referenced in AI answers, while blocking them favors control of content reuse over potential AI-driven referral traffic.
  • Tentative: The higher US block rate on Google-Extended compared with the UK may reflect different legal risk calculations, licensing discussions, or dependency on Google traffic across markets, but the available data does not isolate specific drivers.[S1]
  • Likely: Because robots.txt is not an enforcement mechanism, organizations that are serious about limiting AI reuse of their content will need technical controls beyond robots directives, such as CDN or firewall rules, IP reputation checks, and bot fingerprinting, as indicated by Cloudflare's move to block Perplexity at the network edge.[S2][S3]
  • Likely: Search and AI channel planning now requires two parallel decisions - whether to allow training crawlers and whether to allow retrieval crawlers - since OpenAI and Perplexity treat them as separate systems.[S5][S6] Allowing training but blocking retrieval can keep a site out of live AI citations even if model weights already contain its content.

Contradictions and gaps in the AI bot blocking evidence

The current evidence base has several gaps. BuzzStream's study covers only 100 large news sites in two English-speaking markets, so the results may not reflect mid-sized publishers, non-news sectors, or other languages.[S1] The analysis is based solely on robots.txt configurations at one point in time; publishers can and do update these files as AI and commercial conditions change, which this snapshot does not capture.[S1]

Robots.txt also records stated policy rather than actual crawling behavior. Google's Gary Illyes has confirmed that robots.txt cannot prevent unauthorized access and functions more like a voluntary code that compliant bots choose to respect.[S2] Cloudflare's description of Perplexity's alleged stealth crawling - IP rotation, ASN changes, and user-agent spoofing - illustrates how actors can work around robots directives, while Perplexity's rebuttal means the extent of such behavior is contested.[S3][S4]

Finally, none of the cited sources directly link bot-blocking patterns to measurable changes in AI answer share, referral traffic, or subscription revenue. The strategic interpretations above therefore rely on logical inference from access controls and stated crawler functions, rather than on outcome-based experiments.

Data appendix: AI crawler block rates from the BuzzStream sample

The table below summarizes the key percentages reported in BuzzStream's robots.txt analysis of 100 major US and UK news publishers.[S1] All figures represent the share of sites that issue a Disallow directive for the named bot.

Metric / bot Category % of sites blocking Notes Source
Any training bot Training 79% Blocks at least one AI training crawler [S1]
CCBot (Common Crawl) Training 75% High block rate among non-search crawlers [S1]
Anthropic-ai Training 72% Used for Anthropic model training [S1]
ClaudeBot Training 69% Anthropic-related training crawler [S1]
GPTBot (OpenAI) Training 62% OpenAI model training crawler [S1]
Google-Extended - overall Training 46% 58% block rate in US, 29% in UK [S1]
Any retrieval / live-search bot Retrieval 71% Blocks at least one retrieval crawler [S1]
Claude-Web Retrieval 66% Anthropic web retrieval bot [S1]
OAI-SearchBot (OpenAI) Retrieval 49% Powers ChatGPT live search [S1][S5]
ChatGPT-User Retrieval 40% Handles certain user-initiated retrieval requests [S1][S5]
Perplexity-User Retrieval 17% Perplexity's retrieval agent [S1][S6]
PerplexityBot Indexing 67% Indexes pages for Perplexity's search corpus [S1][S6]
Sites blocking all tracked AI bots Overall policy 14% Full blocking across all bots in the study [S1]
Sites blocking none of the tracked AI bots Overall policy 18% No AI-specific Disallow directives [S1]

Cloudflare's Year in Review separately identifies GPTBot, ClaudeBot, and CCBot as among the most frequently and fully disallowed bots on large domains, which is directionally consistent with BuzzStream's publisher-level findings.[S7]

Quickly summarize and get insighs with: 
Author
Etavrian AI
Etavrian AI is developed by Andrii Daniv to produce and optimize content for etavrian.com website.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents