Google has outlined a new way to infer user intent from on-device interactions. The method uses small models that run locally, reportedly outperform at least one large multimodal model baseline on this task while keeping raw interaction data on the device.[S1][S2][S3]
Executive snapshot: Google’s new user intent extraction method
- Google proposes a two-stage intent extraction system that runs on devices, summarizing each UI interaction and then inferring a higher-level goal from those summaries, without sending raw screenshots or clicks to Google servers.[S1][S2][S3]
- The approach reportedly achieves superior performance to both other small models and a state-of-the-art multimodal large language model (MLLM) across datasets and model types, and handles noisy interaction logs better than standard supervised fine-tuning.[S3][S1]
- Human agreement on intent labels is limited: prior work cited in the paper reports 80% agreement on web trajectories and 76% on mobile trajectories, showing that even humans often disagree on the intent behind the same interaction sequence.[S3][S1]
- Experiments cover Android and web environments with English-speaking users in the United States. The work is framed for autonomous on-device agents providing Proactive Assistance and Personalized Memory, not classic web search ranking.[S1][S2][S3]
Implication for marketers: Google is investing in modeling user intent from entire interaction journeys on devices, suggesting a future where task-level behavior across apps and screens may matter at least as much as individual queries or clicks.
Method & source notes on Google’s intent extraction research
Google’s method is documented in two primary artifacts: an academic paper and a companion blog post, both summarized by Search Engine Journal (SEJ).
- [S3] EMNLP 2025 paper: “Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition” (PDF) - technical description of the two-stage architecture, datasets, and evaluation.
- [S2] Google Research blog: Small models, big results: Achieving superior intent extraction through decomposition - higher-level explanation and positioning of the work for on-device assistance.
- [S1] SEJ summary: Roger Montti’s article “Google’s New User Intent Extraction Method” - secondary coverage that quotes and paraphrases the paper and blog.
What was measured
- Task: extract a concise, actionable description of a user’s intent (goal) from a trajectory - a sequence of interactions in mobile apps or web UIs, captured as screenshots plus textual representations of actions (taps, clicks, text entry).[S1][S3]
- Inputs: for each interaction step, (1) the visual state of the screen (screenshot) and (2) the user’s action on that screen.[S1][S3]
- Output: an extracted intent that is faithful, comprehensive, and relevant enough to reproduce the same trajectory.[S3][S1]
Methodology
- Modeling approach: two-stage decomposition: per-interaction summarization via prompting, followed by a fine-tuned model that infers overall intent from those summaries.[S3][S1]
- Model scale and placement: small models optimized to run on-device in mobile or browser environments, rather than large models in a data center.[S2][S3][S1]
- Data: trajectories recorded from Android and web usage for English-speaking users in the United States.[S1][S3]
Key limitations and caveats
- Geographic and language scope is narrow (US, English), which may limit generalization across markets and languages.[S1][S3]
- Testing is limited to Android and web; the paper notes that results may not transfer directly to Apple or other ecosystems.[S1][S3]
- Intent labels are inherently ambiguous; prior work cited in the paper shows human annotators only agree 80% of the time on web trajectories and 76% on mobile.[S3][S1]
- Public materials do not specify sample sizes or detailed benchmark numbers; performance is described qualitatively as “superior” to baselines.[S2][S3][S1]
- Google does not state that the method is used in Search or any production agent today; it is described as a building block for future assistive features.[S2][S1]
Findings: on-device user intent extraction from UI interactions
This section summarizes factual findings from the paper, blog, and SEJ coverage.
1. Two-stage decomposition of intent extraction
Google’s researchers split the problem into two linked tasks that small models can handle locally.[S3][S1]
Stage 1 - interaction-level summaries
- For each interaction in a trajectory, a prompt-based model produces a structured summary with at least two explicit parts:
- a description of what is displayed on the screen, and
- a description of the user’s action on that screen.[S3][S1]
- The prompt also asks the model to speculate about user intent, but this speculative intent field is explicitly discarded before later stages.[S3][S1]
- Allowing the model to speculate and then stripping that speculation was empirically found to improve the quality of the factual summaries of screen and action.[S3][S1]
Stage 2 - overall intent generation
- A second model is fine-tuned to take the full sequence of interaction-level summaries and produce a single overall intent description for the trajectory.[S3][S1]
- Training pairs consist of (1) all summaries for a trajectory and (2) a ground truth description of the user’s overall intent.[S3][S1]
The researchers report that this decomposition yields better results than directly prompting or fine-tuning models to map raw trajectories to intents in a single step, especially for smaller models.[S3][S1]
2. Addressing reasoning limits and hallucinations in small models
Google evaluated other techniques before settling on this two-stage flow.[S3][S1]
- Chain-of-thought (CoT) reasoning - where the model reasons step by step - was tested, but small language models struggled to maintain reliable reasoning quality across long, multi-step trajectories.[S3][S1]
- The two-stage method emulates some benefits of CoT (breaking down the problem and then reasoning over a sequence) while keeping individual steps simpler so small models can manage them on device.[S3][S1]
The team also reports an initial issue with hallucinations in Stage 2:[S3][S1]
- Because the target intent descriptions contained more detail than was present in the input summaries, the second-stage model learned to fill gaps by inventing missing details.
- To counter this, they refined the target intents by removing any detail not supported by the input summaries, which trained the model to base its output only on what was actually observed.[S3][S1]
3. Definition and evaluation of “good” extracted intent
The work defines three properties for a high-quality extracted intent.[S3][S1]
- Faithful - only refers to events that actually occur in the trajectory.
- Comprehensive - contains all information necessary to re-enact the trajectory.
- Relevant - omits extraneous information that is not needed to understand or reproduce the trajectory.
Evaluating those properties is challenging.[S3][S1]
- Intent descriptions often include concrete details (dates, quantities, financial values), which are easy to misstate.
- User motivations are partially hidden: observers see the sequence of actions, but not the internal reason for each choice (for example, price vs features when choosing a product).[S3][S1]
- Prior work cited in the paper shows that even human labelers only match each other’s intents 80% of the time on web trajectories and 76% on mobile, underscoring that there is rarely a single undisputed “true” intent.[S3][S1]
4. Reported performance versus MLLM baselines
The authors compare their approach against:[S3][S1]
- smaller models without decomposition, and
- at least one state-of-the-art large multimodal language model (MLLM) running in a data-center setting.
They report that:[S3][S1]
- their two-stage method delivers superior performance to both those baselines, and
- this advantage holds across different datasets and model types, not just for one specific configuration.
- the method also “naturally handles scenarios with noisy data that traditional supervised fine-tuning methods struggle with,” suggesting improved robustness to messy real-world interaction logs.[S3][S1]
Public summaries do not provide exact accuracy or error rates, so the magnitude of the improvement is not known from the provided text.
5. On-device processing and privacy positioning
The work is explicitly framed around on-device processing.[S2][S1]
- User interactions (screenshots and actions) are processed locally by small models, avoiding the need to send raw data back to Google servers.
- This design is presented as privacy-preserving, aligning with scenarios where legal or user expectations restrict sharing detailed behavioral data with the cloud.[S2][S1]
- The blog notes that as small models improve and devices gain more processing power, on-device intent understanding could become a building block for assistive features on mobile devices.[S2][S1]
6. Stated use cases: autonomous agents, not classic search
The paper and blog describe the method in the context of autonomous agents that observe and assist the user on the device.[S2][S3][S1]
- Proactive Assistance - an agent that monitors what the user is doing and supports enhanced personalization and improved work efficiency.[S3][S1]
- Personalized Memory - the device can recall past activities in terms of user intents, helping users return to or reuse previous workflows.[S3][S1]
Neither source ties this work directly to classic web search ranking or AI-powered search results.[S2][S3][S1]
Interpretation & implications for marketers and product leaders
This section reflects interpretation and strategic implications, not direct claims from Google.
Likely implications for how “intent” is modeled
- Shift from query-level to task-level intent (Likely): since the method models entire trajectories of interactions, Google’s research direction suggests a growing emphasis on task completion rather than single queries or clicks. For marketers, this points toward experiences designed around end-to-end tasks (for example, plan, compare, transact, follow-up) rather than isolated touchpoints.
- More context-aware personalization (Likely): if on-device agents gain a reliable picture of what users are trying to achieve across apps and screens, recommendations, reminders, and assistance can be tuned to that ongoing task. This could mean more contextually relevant prompts to re-engage with content, apps, or offers when users resume a task they previously started.
- Privacy-centric behavior modeling (Likely): because processing happens locally, fine-grained behavioral data may remain on the device, with only higher-level signals or aggregated insights ever leaving it. That could shift how platforms justify data collection and personalization, emphasizing models that work well under stricter data-sharing constraints.
Impact on search, ads, and analytics strategies
- Attribution and funnel modeling (Tentative): if similar trajectory-level intent models are applied more broadly, traditional funnel views (impression → click → conversion) may become less central than understanding multi-step tasks across properties. Marketers may benefit from mapping and measuring complete task flows within their own apps and sites, as this is closer to how Google is formalizing intent.
- Content and UX for assistive agents (Tentative): agents that help with Proactive Assistance or Personalized Memory will favor flows that are easy to observe and summarize: clear page states, unambiguous actions, and consistent UI patterns. Designs that minimize ambiguous steps or dead ends are more legible to both users and agents, increasing the chance that assistants can support the task effectively.
- Ads and targeting opportunities (Speculative): over time, trajectory-based intent signals could inform when users are early in a task, mid-decision, or completing it. If such signals are surfaced (in privacy-preserving form) to ad systems, budgeting and creative strategies might move from static audience segments toward task-stage-aware strategies. There is no explicit evidence of this integration yet.
Product and data decisions inside organizations
- Instrumentation of user journeys (Likely): the research formalizes trajectories as ordered observations and actions. Teams that already capture high-quality interaction logs (screen states, actions, timestamps) will be better positioned to experiment with similar models on their own properties.
- Localization and device diversity (Tentative): because current results are limited to US English users on Android and web, organizations should expect non-trivial work for localized versions of intent models. Markets with different UI conventions or languages may need their own training data and tuning.
Contradictions & gaps in Google’s intent extraction work
Lack of public quantitative benchmarks
- The paper and blog, as summarized, state that the two-stage approach beats both small-model and state-of-the-art MLLM baselines, and handles noisy data better than standard fine-tuning.[S3][S2][S1]
- No public numbers (accuracy, F1, human parity metrics) are given in the provided material, so the size of the performance gain and how close the system comes to human agreement levels are unknown.
- For marketers and analysts, this means the method’s effectiveness is directionally positive but not measurable from currently available summaries.
Unclear connection to Search and ads ecosystems
- The sources explicitly frame the work as an enabler for on-device agents, not for Search ranking or ad serving.[S2][S3][S1]
- Google has not publicly linked this method to ranking signals, quality scores, or ad relevance. Any connection between on-device trajectory intent and off-device platforms (Search, YouTube, Ads) remains speculative at this time.
Open questions on governance and user control
- The paper flags ethical risks, including agents taking actions that are not in the user’s interest, and the need for guardrails.[S3][S1]
- Public materials do not yet describe:
- how users will see, edit, or delete stored personalized memories of their intents,
- how consent for trajectory observation will be obtained or managed over time, or
- how conflicting intents (for example, multiple users on a shared device) will be handled.
These gaps matter for businesses operating in regulated sectors or strict privacy regimes, where agent-style monitoring and assistance may trigger additional compliance requirements.
Data appendix: definitions, sources, and key numbers
Key definitions
- Trajectory: a sequence of user interactions within a mobile or web application, where each step consists of an observation (screen state) and an action (user input on that screen).[S3][S1]
- Extracted intent: a text description of what the user is trying to achieve, which should be faithful, comprehensive, and relevant enough to allow someone (or an agent) to reproduce the same trajectory.[S3][S1]
Selected quantitative figures
| Metric | Web trajectories | Mobile trajectories | Source |
|---|---|---|---|
| Human agreement on intent annotations | 80% | 76% | [S3][S1] |
These agreement rates come from prior research cited in the EMNLP 2025 paper and are referenced in the SEJ summary, illustrating the inherent ambiguity in intent labeling even for human experts.[S3][S1]
Source index
- [S1] Roger Montti, “Google’s New User Intent Extraction Method,” Search Engine Journal, 2026.
- [S2] Google Research Blog, “Small models, big results: Achieving superior intent extraction through decomposition.”
- [S3] Google Research, “Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition,” EMNLP 2025 main proceedings.






