Generative engines that answer queries by ranking and summarizing items with large language models (LLMs) are emerging as new search surfaces alongside classical web search. A recent research project introduced CORE, a method for systematically shifting how these models rank candidate results such as product listings, using only controlled edits to item text. The study focused on Claude 4, GPT-4o, Gemini 2.5, and Grok-3 accessed via API, and reports that both reasoning-style explanations and review-style language can move a target item from last place to the top position in many test cases [S1][S2].
Executive snapshot of the LLM ranking experiment
- Query-based optimization, which repeatedly edits a product description and re-queries the target LLM, raised a last-ranked item to #1 in about 77-82% of tested cases across Claude 4, GPT-4o, Gemini 2.5, and Grok-3 [S1][S2].
- A shadow-model approach using Llama-3.1-8B as a local proxy achieved lower promotion success, moving a last-ranked item to first position in about 30-34% of cases [S1][S2].
- For GPT-4o, reasoning-focused edits achieved an 81.0% rate of moving a last-ranked product to #1, while review-style edits achieved 79.0%. Review-style content reached up to 91% success when the goal was to move the item into the top 5 [S1][S2].
- Across models, GPT-4o and Claude 4 reacted more to added reasoning, while Gemini 2.5 and Grok-3 reacted more to added review-like descriptions. Review-style edits overall reached roughly 79-83.5% success in promoting items from last to first in the tests [S1][S2].
- A purely synthetic character string produced by optimization increased rankings in 33% of cases but was flagged as spammy by human raters in 98.5% of cases. Reasoning-style edits were noticed as artificial in 62.1% of cases [S1][S2].
Implication for marketers: LLM-based search and answer systems are highly sensitive to how product information is written, especially to explicit reasoning and realistic-sounding reviews, but aggressive manipulation is obvious to humans and raises compliance and trust risks.
Method and source notes on CORE LLM ranking research
The CORE study, Controlling Output Rankings in Generative Engines for LLM-based Search (CORE), examines whether and how one can systematically control the ranking order of items returned by LLMs when those models are used as rankers for search-like tasks [S1][S2]. The work targets four proprietary models accessed via API: Claude 4, GPT-4o, Gemini 2.5, and Grok-3.
All experiments were run in a controlled environment where the researchers supplied the query and a fixed list of candidate items directly in the prompt. Retrieval, browsing, and other external tools were disabled; the LLMs only saw the item descriptions provided by the researchers [S1][S2].
The core question was whether edits to a single target item's description could reliably move it from the bottom of the ranking to a higher position when evaluated by these LLMs, and which kinds of edits were most effective. The researchers compared two optimization setups:
- Query-based black-box method - repeatedly calls the target LLM, edits the target item text based on feedback, and checks the new ranking.
- Shadow-model method - trains a local surrogate (Llama-3.1-8B) to mimic the target LLM's ranking, then runs optimization against that surrogate and tests transfer of the edits back to the real model [S1][S2].
They tested three content strategies: (1) synthetic character strings, (2) reasoning-style expansions, and (3) review-style expansions. Human annotators assessed whether the resulting content seemed artificial or spammy [S1][S2].
Key reporting limitation: the public summary does not specify query counts, product counts, domains beyond products and travel, or details of the human-rating protocol. All numeric results cited here come from the Search Engine Journal write-up of the paper, not from direct review of the preprint [S1][S2].
Sources used
- [S1] Controlling Output Rankings in Generative Engines for LLM-based Search (CORE), arXiv:2602.03608 (original research paper referenced by Search Engine Journal).
- [S2] Roger Montti, "How Researchers Reverse-Engineered LLMs For A Ranking Experiment," Search Engine Journal, accessed via user-supplied text.
Findings on query-based and shadow model LLM ranking optimization
The CORE study reports separate findings for the query-based approach, the shadow-model approach, and the three content strategies. All of the results below refer to rankings over fixed candidate lists supplied by the researchers, not to live web search or AI Overviews [S1][S2].
Performance of query-based versus shadow model optimization
In the query-based setup, the researchers treat each target LLM as a true black box. They:
- Start with a query and a list of candidate items, including a target item initially ranked at the bottom.
- Call the LLM to produce a ranking over the candidates.
- Use another LLM call to suggest text to add to the target item (content expansion rather than full rewriting).
- Insert the new text into the target item, resubmit the list, and record the updated ranking.
- Iterate until the target item meets a desired rank or a stop condition [S1][S2].
Using this loop, query-based optimization can move the last-ranked item to #1 in roughly 77-82% of test cases, depending on the target model and content style [S1][S2]. This is the highest promotion rate among all methods tested.
In the shadow-model setup, the team first trains Llama-3.1-8B as a surrogate for a target LLM such as GPT-4o. They feed the same query-candidate pairs into both models and adjust Llama-3.1-8B until its rankings match the target's as closely as possible. They then run optimization only against the surrogate and test whether the resulting edits still move the item up when evaluated by the real LLM [S1][S2].
Reported outcomes [S1][S2]:
- Llama-3.1-8B reached an average similarity score of 4.5 out of 5 when compared with GPT-4o's rankings, indicating high but not perfect match.
- Despite approximation error, shadow-model optimization still transferred, but with lower promotion success: roughly 30-34% of runs moved the target item from last to first.
- Performance degrades smoothly as the surrogate becomes less accurate, rather than collapsing entirely, which suggests that partial matching can still be useful for offline experimentation.
These results indicate that both direct black-box optimization and surrogate-based optimization can systematically change how LLMs rank items in controlled lists, with a significant but lower effect when relying on a proxy model.
Impact of content type and detection rates
The study contrasts three styles of content modification: a synthetic string, reasoning-style expansions, and review-style expansions, applied through both query-based and shadow-model pipelines [S1][S2].
String-based optimization. The team begins with a 20-character string of exclamation marks attached to the target item, then applies gradient-like updates (2,000 iterations) using the surrogate model to search for a character sequence that maximizes the item's rank [S1][S2]. The final string appears meaningless to humans but still improves rankings. This approach achieved 33% success at promoting a last-ranked item to #1, but human raters flagged this pattern as spammy or artificial in 98.5% of cases, so it is highly detectable [S1][S2].
Reasoning-based optimization. Here, generated text explains why the target product is a strong match, often by comparing it to alternatives and organizing information under topic headings (for example, "Understanding Air Fryer Types," "Explaining Key Features") [S1][S2]. This style mirrors a detailed buying guide: it breaks down product categories, key features, model differences, and purchase strategies. In tests, the reasoning-based method delivered the highest promotion rates overall among the three content strategies. For GPT-4o specifically, reasoning-style edits moved the last-ranked item to #1 in 81.0% of cases [S1][S2]. Human raters judged this content as artificially structured in 62.1% of cases, suggesting a trade-off between machine effectiveness and natural feel [S1][S2].
Review-based optimization. This style generates past-tense, first-person review text, as if an actual customer had purchased and tested the product, often mentioning comparisons and long-term use (for example, "After 6 months of testing, the Gourmia Air Fryer Oven (GAF486) is my #1 recommendation...") [S1][S2]. The content mimics common review patterns: overall impression, feature breakdown, differences across models, buying tips, and a verdict. Across models, review-style edits achieved about 79-83.5% success in moving a last-ranked item to #1 in the lists used for the experiment [S1][S2]. For GPT-4o, review-style changes reached 79.0% last-to-first promotion and up to 91% success in moving the item into the top 5 [S1][S2]. Gemini 2.5 and Grok-3 were especially responsive to review-like language, while GPT-4o and Claude 4 were more responsive to reasoning-like explanations [S1][S2].
These findings show that LLMs respond in measurable, model-specific ways to the style and framing of added content, not only to its length or keyword presence.
Interpretation and implications for marketers using AI search
This section summarizes what the reported results likely mean for marketing and SEO planning, with explicit labels for confidence levels based on the available data.
Likely: AI search favors detailed explanations and experiential language
The strong promotion rates from reasoning-style and review-style expansions suggest that LLM rankers pay attention to explicit explanations of fit to a query and to realistic user experience details, not only to surface-level keywords [S1][S2]. For product pages and informational content, this supports a focus on:
- Clear explanations of how and why an item meets a given need.
- Comparisons with alternatives at the feature level.
- Concrete experience markers (time of use, typical use cases, pros and cons).
Likely: Different LLMs have distinct content preferences
The observation that GPT-4o and Claude 4 respond more strongly to reasoning content, while Gemini 2.5 and Grok-3 react more to review-like language, indicates that model-specific copy strategies may matter for AI answer surfaces [S1][S2]. Marketers planning for multiple AI search channels should expect some variation in what each model treats as persuasive evidence.
Tentative: Open-weight models can support offline testing of AI search copy
Because Llama-3.1-8B reached a 4.5/5 similarity score to GPT-4o for ranking behavior in these tests, and optimizations against the surrogate partially transferred, there is early evidence that open models can serve as offline "labs" for AI-search-oriented content experiments [S1][S2]. However, the performance drop from about 77-82% (direct) to 30-34% (via surrogate) shows that differences between models still matter. This makes transfer a tool for exploration, not a guarantee.
Likely: Aggressive manipulation is risky and visible to users
The synthetic string approach shows that purely technical exploitation is both possible and extremely obvious to humans (98.5% detection) [S1][S2]. Even the more natural-sounding reasoning-style expansions were flagged as artificial in 62.1% of cases [S1][S2]. Generating made-up reviews or over-structured explanations just to move up in AI rankings:
- Conflicts with platform policies and consumer-protection rules on several major marketplaces and ad platforms.
- Damages trust if users notice repetition and exaggerated claims.
For marketers, a safer reading of the study is to emphasize authentic reviews and genuine explanatory content, not synthetic or fabricated material.
Speculative: Similar techniques may contribute to AI search spam
The study was run on static, researcher-supplied candidate lists, not live search. Still, the high success rates in closed settings suggest that determined actors could adapt comparable methods to content targeting AI-generated overviews or curated product panels. The author of the Search Engine Journal article notes the possibility that some existing AI search spam may already rely on these ideas [S2]. This remains unproven but is consistent with the incentives facing low-quality operators.
Contradictions, gaps, and open questions in LLM ranking control
Several aspects of the CORE work limit how far marketers can generalize the findings:
- No live retrieval layer - the study bypasses web-scale retrieval and feeds items directly to the LLM [S1][S2]. In real search products, additional ranking layers, quality filters, and safety checks could dampen or override the effects seen in this experiment.
- Unknown corpus scale and diversity - the public summary mentions product search and notes that effects generalize to travel, but does not describe dataset sizes, language mix, or vertical diversity [S2]. Without that, it is hard to know whether similar promotion rates would hold on messy, real-world catalogs.
- Limited visibility into human-rating protocol - while detection percentages are given (98.5% for string-based, 62.1% for reasoning-style content), the number of raters, guidelines, and rating tasks are not described in the summary [S2]. That makes it unclear how these detection rates would compare to user perception on commercial platforms.
- Unclear durability over time - the study does not address whether model updates or revised safety policies would break these optimizations. AI providers frequently change ranking heuristics, prompt handling, and content filters.
- Ethical and policy boundaries - the review-style method, as described, generates reviews without real product testing, which conflicts with guidelines on many marketplaces and search platforms. The research context is controlled and academic, but any operational use of similar methods would raise compliance issues.
For business decision-makers, these gaps mean the results should be treated as evidence of sensitivity, not as a direct recipe for manipulating live AI search.
Data appendix for CORE LLM ranking study
All figures below are reported in the Search Engine Journal summary of the CORE paper [S1][S2]. Exact sample sizes are not provided in that summary.
| Aspect | Setting | Metric | Reported result | Source |
|---|---|---|---|---|
| Query-based optimization | Claude 4, GPT-4o, Gemini 2.5, Grok-3 | Share of runs where last-ranked item moved to rank 1 | ≈77-82% | [S1][S2] |
| Shadow-model optimization | Llama-3.1-8B surrogate for target LLMs | Share of runs where last-ranked item moved to rank 1 | ≈30-34% | [S1][S2] |
| Surrogate similarity | Llama-3.1-8B vs GPT-4o | Ranking similarity (1 = divergence, 5 = high match) | 4.5 / 5 | [S1][S2] |
| String-based edits | Surrogate-driven optimization | Promotion from last to first | 33% of test cases | [S1][S2] |
| String-based detection | Human raters | Share of cases flagged as spammy/artificial | 98.5% | [S1][S2] |
| Reasoning-style edits (GPT-4o) | Query-based optimization | Promotion from last to first | 81.0% | [S1][S2] |
| Review-style edits (GPT-4o) | Query-based optimization | Promotion from last to first | 79.0% | [S1][S2] |
| Review-style edits (GPT-4o) | Query-based optimization | Promotion from last to top 5 | Up to 91% | [S1][S2] |
| Review-style edits (all models) | Query-based / shadow, product lists | Promotion from last to first | ≈79-83.5% range | [S1][S2] |
| Reasoning-style detection | Human raters | Share of cases judged artificial | 62.1% | [S1][S2] |
From a marketing and SEO perspective, the quantitative message is consistent: when LLMs act as rankers over provided candidates, how you describe an item, through structured explanations or realistic user language, can dramatically affect its position. Translating that insight into live AI search strategies, while staying truthful and policy-compliant, is the main challenge.






