AI recommendation systems in tools like ChatGPT, Claude, and Google's AI Overviews are now a visible part of how people find brands, yet new data shows these systems return highly unstable recommendation lists across repeated queries.
AI Recommendation Stability in ChatGPT, Claude, and Google AI Overviews (SparkToro Study Review)
Executive Snapshot
- ChatGPT and Google's AI in Search returned the same list of brands for a given prompt less than 1% of the time across repeated runs.[S1][S2]
- The same list in the same order appeared less than 0.1% of the time across runs of the same prompt.[S1][S2]
- Across 2,961 total prompts run 60-100 times each, nearly every AI response was unique in list composition, ordering, and number of recommendations.[S1][S2]
- Despite phrasing differences, a handful of headphone brands (Bose, Sony, Sennheiser, Apple) appeared in 55-77% of 994 responses to varied, user-written prompts.[S1][S2]
- A separate Ahrefs study found Google AI Mode and AI Overviews cited different URLs 87% of the time for the same query.[S3]
Implication for marketers: AI "ranking position" is unstable as a metric; visibility frequency across many runs and prompts is a more meaningful signal than any single AI answer.
Method & Source Notes
- SparkToro / Gumshoe.ai study - Led by Rand Fishkin (SparkToro) and Patrick O'Donnell (Gumshoe.ai). Tested 12 prompts asking for brand or product recommendations (for example, chef's knives, headphones, cancer care hospitals, digital marketing consultants, science fiction novels). Each prompt was run 60-100 times on ChatGPT, Claude, and Google Search's AI Overviews/AI Mode, for a total of 2,961 prompts across platforms. Conducted over November-December with hundreds of volunteers using their typical settings. Full methodology and raw data were published on a public mini-site.[S1][S2] A new report from SparkToro provides additional context and interpretation.
- User prompt variability sub-study - 142 participants each wrote a prompt about headphones for a traveling family member. Semantic similarity across prompts averaged 0.081 (low similarity), yet outputs were drawn from a relatively stable brand pool.[S1][S2]
- Ahrefs study on Google AI Mode vs AI Overviews - Large-scale measurement of URL citations showed 87% of Google queries where AI Mode and AI Overviews cited different sources for the same query.[S3]
Key limitations: SparkToro's work is not peer-reviewed, covers 12 prompt types, and focuses on brand/product recommendation queries. Respondents' AI settings were not standardized, which increases ecological validity but introduces uncontrolled variation.[S1][S2]
How consistent are AI recommendations across repeated queries?
Across the 12 brand recommendation prompts, SparkToro found very low repeatability when the same prompt was run multiple times on the same AI system.[S1][S2] The team defined "reliable repeatability" as returning the same brand list and ordering at a meaningful rate, and none of the platforms met that bar. Key numeric results:
- Probability that ChatGPT or Google AI in Search returned the same set of brands (ignoring order) across runs of the same prompt: <1%.[S1][S2]
- Probability that the same set of brands in the same order appeared again: <0.1%.[S1][S2]
- Nearly every one of the thousands of responses differed in at least one of three dimensions:[S1][S2]
- Which brands were named
- The order in which brands were recommended
- How many brands were listed in the answer
Differences across models were modest. Claude was slightly more likely than ChatGPT or Google's AI to repeat the same brand list, but less likely to repeat the same order.[S1][S2] Even so, no model produced the kind of stability associated with traditional search rankings or paid search ad positions.
Taken together with Ahrefs' finding that Google AI Mode and AI Overviews chose different URLs 87% of the time for identical queries on the same engine, the pattern indicates high variability both across runs and across AI features even when user intent and platform are held constant.[S3]
Which AI visibility metrics give more stable brand exposure?
SparkToro's data indicates that visibility frequency - how often a brand appears across many runs and similar prompts - is more stable than any single ranking position or specific slot in an AI answer.[S1] While individual lists changed almost every time, some brands surfaced repeatedly across runs.
In tighter categories where there are relatively few widely recognized options (for example, cloud infrastructure providers), the same leading brands tended to appear in a majority of responses, even as order and exact lists shifted.[S1] In broader or more subjective categories (for example, science fiction novels), the distribution of recommendations was more scattered, with many different titles appearing only occasionally.[S1]
This distinction matches the headphone sub-study. The researchers asked 142 participants to write their own natural-language prompts around a shared scenario (headphones for a traveling family member). The prompts varied widely in wording - semantic similarity averaged just 0.081, which the authors likened to the distance between two very different food dishes - but the response sets overlapped heavily.[S1][S2] Across 994 AI responses to these varied prompts:
- Bose, Sony, Sennheiser, and Apple appeared in 55-77% of answers.[S1][S2]
Despite unstable ordering and list structure, these brands remained in a relatively consistent "consideration set" across different queries. This supports Fishkin's contention that "any tool that gives a 'ranking position in AI' is full of baloney," while metrics based on share of appearances or presence vs absence across many runs appear more meaningful.[S1][S2]
User prompt variation and AI brand consideration sets
The headphone experiment highlights real-world prompt variation. When people were asked simply to request headphone recommendations for a scenario, almost no two prompts looked similar beyond the core topic.[S1][S2] Participants varied:
- The level of detail (budget, noise cancellation, wireless vs wired, brand preferences)
- The narrative ("for my dad who travels a lot" vs "for a frequent-flyer consultant")
- Constraints (comfort, battery life, durability, price caps)
Even with this wide spread of phrasing, the AI tools repeatedly selected from a relatively small set of large, well-known headphone brands.[S1][S2] This suggests that for many commercial categories, the underlying model and its training data create a stable "shortlist" of likely brands, even though the specific combination and presentation change from one response to the next.
From a measurement standpoint, this means prompt-level variability adds noise on top of an already variable recommendation surface. Any attempt to track AI visibility that only tests a handful of narrowly phrased prompts will likely miss how people actually query these systems day to day.[S1]
Interpretation and implications for AI brand visibility tracking
Likely: ranking-style AI metrics are unreliable for performance reporting.
Given the <1% chance of seeing the same brand list and the <0.1% chance of seeing the same list in the same order for a repeated prompt, "position" in AI answers is far less stable than search rankings or ad slots.[S1][S2] Tools that promise static "AI rank tracking" for a small set of keywords are probably measuring noise more than signal.
Likely: presence/absence and share-of-voice metrics are more meaningful.
Across both the main SparkToro study and the headphone prompt sub-study, how often a brand appeared across many runs and prompt variants showed more consistency than exact placement.[S1][S2] For marketers, this suggests that metrics such as "% of runs where the brand appears somewhere in the AI answer" or "share of responses mentioning the brand vs competitors" will better reflect AI visibility than any single-run output.
Tentative: category structure shapes AI stability.
In narrow markets with a small number of dominant providers (cloud platforms, major software vendors), the same brands tend to be present across a majority of AI responses.[S1] In broad or taste-driven categories, AI outputs are more diffuse. Measurement strategies and expectations for "coverage" likely need to adjust by category type.
Tentative: user prompt diversity weakens narrow tracking strategies.
The headphone experiment shows that real users rarely phrase prompts the same way, even when their intent is similar.[S1][S2] Visibility measured on one or two "lab-designed" prompts may misrepresent actual exposure. A more realistic approach would sample multiple natural-language prompts per intent, then aggregate results into a distribution of brand appearances.
Speculative: large brands may enjoy structural advantages in AI recommendations.
In the headphone data, the same well-known brands dominated a majority of answers across diverse prompts.[S1][S2] While the study does not directly test causation, it is reasonable to suspect that factors such as extensive web coverage, review volume, and frequent mentions in training data increase the likelihood of being included in AI recommendation sets.
Contradictions and gaps in current AI recommendation data
- Limited scope of query types. The SparkToro study concentrates on brand and product recommendations and does not analyze informational, navigational, or support-style queries.[S1] Performance and variability may differ for those intents.
- Manual vs API prompts. The researchers explicitly note uncertainty over whether API calls produce the same degree of variation as manual prompts through consumer UIs.[S1] Many enterprise tools use APIs, so this remains a significant unknown.
- Settings and personalization effects. Participants used their usual AI tool settings; these were not controlled or logged in detail.[S1] Temperature, personalization, and regional differences may all affect variability but are not isolated in the data.
- Comparisons across more models. The study tests ChatGPT, Claude, and Google Search AI Overviews/AI Mode.[S1] Other emerging systems (for example, model variants and specialized vertical AIs) may exhibit different stability patterns.
- External validation. The research is transparent and data-rich but not peer-reviewed. Independent replication with larger samples and additional categories would strengthen confidence in the findings.[S1]
At the same time, the general direction of SparkToro's results - high variability in AI outputs across runs - is consistent with Ahrefs' separate observation that Google's AI Mode and AI Overviews cited different URLs for the same query 87% of the time.[S3] Both datasets point toward a recommendation environment that is inherently unstable at the level of specific rankings or URL citations.
Data appendix for AI recommendation stability studies
Key numeric highlights (SparkToro / Gumshoe.ai):[S1][S2]
| Metric | Value / Range | Notes |
|---|---|---|
| Total prompts tested across platforms | 2,961 | 12 core prompts, each run 60-100 times per AI system |
| Number of core brand/product recommendation prompts | 12 | Categories included knives, headphones, hospitals, consultants, etc. |
| Probability of identical brand list (same prompt, same system) | <1% | Order ignored |
| Probability of identical list and identical order | <0.1% | Very low repeatability |
| Number of human-written headphone prompts | 142 | Each from a different participant |
| Semantic similarity score across headphone prompts | 0.081 | Indicates very low textual similarity |
| Total headphone responses analyzed | 994 | Combined across tested AI tools |
| Appearance rate of key headphone brands in responses | 55-77% | Bose, Sony, Sennheiser, Apple |
Key numeric highlights (Ahrefs):[S3]
| Metric | Value | Notes |
|---|---|---|
| Queries where AI Mode & AI Overviews cited same URLs | 13% | For 87% of queries, cited URLs differed between the two features |
| Queries where AI Mode & AI Overviews differed | 87% | Indicates substantial variability even within one search platform |
Sources
- [S1] SparkToro / Gumshoe.ai - "New research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility" (methodology mini-site and raw data).
- [S2] Matt G. Southern, Search Engine Journal - "AI Recommendations Change With Nearly Every Query: SparkToro" (summary article describing the study and its topline numbers).
- [S3] Ahrefs - Report on Google AI Mode vs AI Overviews citing different URLs for the same query (showing an 87% difference rate in cited sources).






