Google and Included Health are moving conversational medical AI from lab simulations to a nationwide randomized trial embedded in real virtual care workflows. The program described below shows how they are building evidence on diagnostic reasoning, safety, and patient experience before broad deployment, and what that signals for healthcare and digital health businesses.
AI in virtual care randomized study: key takeaways for healthcare businesses
Executive snapshot
- Google and Included Health plan a prospective, consented, nationwide randomized controlled trial (RCT) of conversational AI within real virtual care workflows, pending IRB approval, comparing AI-assisted care with standard practice for real patients and concerns.[S1]
- The RCT is the fourth step in a sequence: (1) simulated cases, (2) patient-actor studies vs primary care physicians, (3) a single-center feasibility and safety study at Beth Israel Deaconess Medical Center, and now (4) a multisite, randomized national study.[S1][S3]
- Earlier work with the AMIE system showed clinician-level diagnostic reasoning capabilities and conversational diagnostic capabilities in simulations, where AI matched or exceeded primary care physicians on diagnostic accuracy and conversation quality in actor-based consultations.[S1]
- Parallel work on a Personal Health Agent and Fitbit Labs pilots tests AI coaching on wearable data, medical records navigation, and visit preparation, while a wayfinding agent experiments with guiding users to better health information online.[S1]
- The program suggests a full-stack approach to AI in health: from symptom checks and information search, to virtual visits, to longitudinal management, tested through staged studies rather than immediate mass rollout.[S1]
One-line implication for marketers and operators:
Regulated healthcare AI is shifting toward drug-like evidence standards; claims about virtual care AI will increasingly need RCT-grade backing rather than demos or retrospective case studies.
Method and source notes
The main source is Google’s 3 February 2026 research blog outlining the upcoming nationwide randomized study with Included Health and summarising prior work on AMIE, the Personal Health Agent (PHA), Fitbit Labs experiments, a wayfinding AI, and a feasibility study at Beth Israel Deaconess Medical Center.[S1] This work sits within Google’s broader health initiatives and research efforts, including teams such as Google DeepMind, Google Platforms and Devices, and Google for Health.
Key referenced components:
- Nationwide virtual care RCT - Planned randomized controlled trial with consented participants recruited across the US, assessing conversational AI within real-world virtual care workflows compared with standard practice.[S1] Study design details such as exact sample size, endpoints, and stratification are not disclosed in the blog.
- AMIE diagnostic and conversational studies - Nature papers and supporting Google Research work evaluating AI’s diagnostic reasoning and its performance versus primary care physicians in simulated consultations with patient actors, as well as studies on physician-centered asynchronous oversight.[S1][S2] The blog does not state specific sample sizes or effect sizes.
- Beth Israel Deaconess feasibility trial - A single-center feasibility study (ClinicalTrials.gov NCT06911398) focused on safety, using outcome measures such as the number of interruptions from a safety supervisor responding to safety concerns.[S1][S3] Results are described only as showing “strong indications of safety,” with no numeric data yet shared.
- Personal Health Agent and Fitbit Labs - Retrospective research on multimodal models that reason over wearable data (sleep, activity) and experiments like Symptom Checker and Medical Records Navigator and Plan for Care, used to observe how users seek symptom assessment and prepare for visits.[S1][S5]
- Wayfinding AI agent - Research agent based on Gemini models studying how conversational AI can guide users toward higher-quality health information through goal understanding and dialogue.[S1][S6]
Limitations
- The nationwide RCT is prospective and has not reported outcomes; all impact assessments for that study are necessarily forward-looking.
- The blog summarises several Nature and internal studies but does not provide raw data, effect sizes, or full methodologies; independent evaluation would require consulting the underlying publications.
- All results described involve Google’s systems and partners; generalisation to other vendors or models is unknown.
Findings on AI in virtual care workflows and medical reasoning
Google’s program describes a staged path from lab prototypes to real-world virtual care, with increasing complexity and external oversight at each step.[S1]
Early work used simulated cases and synthetic scenarios to test whether AMIE could perform diagnostic and management reasoning at a clinician level. Training via simulated self-play produced a model that, in tests with patient actors and synthetic cases, could match or exceed primary care physicians on diagnostic accuracy and perceived conversation quality in controlled settings.[S1] These studies were published in Nature and were framed as establishing capability in environments where ground truth and safety are easier to control.[S1][S2]
Subsequent research extended AMIE from one-off diagnostic conversations to longitudinal disease management. The system was adapted to reason over clinical guidelines and a patient’s historical data to suggest investigations and treatments over time, and to incorporate multimodal inputs such as clinical images.[S1][S4] This shifts the focus from “what is likely going on?” to “how should care progress over multiple interactions?”, which is closer to chronic care management workflows common in virtual settings.
In parallel, a physician-centered oversight model was studied. Rather than replacing clinicians, AMIE’s outputs are reviewed asynchronously by physicians who retain decision authority.[S1] This design is intended to address safety and liability concerns by keeping clinicians in the loop while testing whether AI can handle parts of history-taking, documentation, or preliminary reasoning.
The first real-world deployment step was a single-center feasibility study with Beth Israel Deaconess Medical Center.[S1][S3] In this specialist-care context, the main safety measure mentioned is the count of interruptions by a safety supervisor responding to concerns during AI-supported interactions.[S1] Google reports “strong indications of safety” from this study but has not yet released quantitative outcomes or secondary measures such as clinician workload, visit length, or patient understanding.
Beyond direct clinical reasoning, Google reports work on a Personal Health Agent that analyses wearable data such as sleep and activity to provide personalised health coaching, using a multi-agent architecture that combines roles like data scientist, clinician, and health coach.[S1][S5] Fitbit Labs pilots (Symptom Checker, Medical Records Navigator, Plan for Care) test how consumers use such tools at home when assessing symptoms or preparing for a visit.[S1] Separately, a wayfinding AI agent is studied as a conversational layer on top of web search, designed to guide users toward higher-quality health resources based on their goals and questions.[S1][S6]
The new nationwide RCT with Included Health is positioned as the first large-scale, prospective, consented randomized test of conversational AI within everyday virtual care workflows across multiple conditions and geographies.[S1] Its stated aim is to compare AI-managed patient interactions within those workflows to standard virtual care, measuring safety, utility, and experience for both patients and clinicians.
Interpretation: how AI in virtual care may affect digital health strategy and marketing
Likely implications (supported by current evidence)
The staged research plan described by Google suggests that health AI, especially when touching diagnosis or treatment decisions, will be judged less on demo quality and more on trial-grade evidence.[S1] For virtual care providers, that makes RCTs and feasibility trials part of the go-to-market path for AI features rather than optional extras. Claims about triage quality, visit efficiency, or patient satisfaction will increasingly need references to controlled or prospective studies.
The scope of Google’s work - diagnostic reasoning (AMIE), ongoing management, wearable-based coaching (PHA, Fitbit Labs), and information navigation (wayfinding AI) - indicates an intent to cover the full patient journey with different AI agents.[S1] For telehealth and digital health businesses, this points to a near-term environment where multiple AI touchpoints may surround each encounter: pre-visit symptom checking, AI-assisted intake, clinician-assist tools during the visit, and post-visit coaching.
Given the focus on physician-centered oversight and the measurement of safety supervisor interventions in the feasibility study, it is reasonable to expect that regulators and hospital buyers will ask not only “does it work?” but also “how is clinician oversight configured, and how is safety monitored in real time?”[S1][S3] From a messaging perspective, transparency about supervision models, escalation paths, and safeguards is likely to carry similar weight as accuracy metrics.
Tentative implications (plausible but awaiting RCT data)
If the nationwide RCT demonstrates that conversational AI can maintain or improve clinical quality while reducing clinician time per virtual visit, virtual care providers may restructure workflows so that AI handles more of intake, history gathering, and patient education before or after the clinician touchpoint. That could change how capacity, staffing, and pricing are framed in sales and payer discussions, but this depends on trial outcomes not yet available.[S1]
The integration of wearable data and medical records into PHA and Fitbit Labs experiments points toward richer, context-aware virtual consultations.[S1][S5] For marketers, this suggests that differentiation may come from how seamlessly a service connects AI across devices, records, and visits, rather than from any one chatbot feature. However, real-world user engagement and adherence for these tools have not yet been reported at scale.
Speculative implications (strategy signals, not yet evidenced)
Should RCTs validate safety and effectiveness, payers may begin to reimburse AI-supported virtual visits differently from traditional telehealth, for example by creating new billing codes or outcome-based contracts. Vendors that can show high-quality prospective data could have stronger positions in payer negotiations and enterprise sales. This is a strategic possibility rather than a documented trend, as no such commercialisation details are included in the current sources.[S1]
Contradictions, gaps, and open questions in AI for virtual care
The program as presented puts heavy weight on safety and diagnostic reasoning, but current public summaries leave several gaps.
- Outcome measures for the national RCT are not yet known. The blog states that the study will assess “helpfulness and safety” in virtual care workflows compared to standard practice.[S1] It does not specify whether primary outcomes will be diagnostic accuracy, patient satisfaction, time savings, downstream utilisation, cost, or clinical outcomes. Different choices would affect how persuasive the results are for different stakeholders.
- Generalisability beyond partners and conditions is unclear. All described real-world work so far involves Google’s systems, Fitbit devices, Included Health, and Beth Israel Deaconess Medical Center.[S1][S3][S5] It is unclear how models perform in health systems with different EHRs, patient populations, or documentation norms, or in low-resource settings.
- Equity and bias measures are not detailed. The blog does not describe subgroup analyses (for example by age, language, race or ethnicity, or insurance status) or efforts to mitigate biased performance.[S1] For marketing and product strategy, this is relevant because many health organisations now require documented fairness and accessibility assessments before procurement.
- Patient trust and adoption metrics are not reported. While the wayfinding and Fitbit Labs efforts focus on user interactions, the sources do not yet describe sustained engagement, opt-out rates, or comparative satisfaction versus human-only care.[S1][S5][S6] These factors will influence whether AI features are framed as core service elements or optional add-ons.
- Comparisons with non-Google AI systems are absent. The Nature evaluations compare AMIE to primary care physicians in simulations, not to other commercial or open-source medical models.[S1][S2] Buyers weighing multiple vendors will likely look for head-to-head or at least benchmark-style assessments; such data is not yet public in the described work.
Until the nationwide RCT reports results, any claims about business impact on visit volumes, costs, or outcomes remain hypothetical. For now, the main concrete signal is the methodological bar being set - multisite randomized trials, staged feasibility work, and clear safety monitoring - rather than specific performance numbers.
Data appendix: key AI health studies and initiatives referenced
[S1] Schaekermann M, Chen C. “Collaborating on a nationwide randomized study of AI in real-world virtual care.” Google Research Blog, 3 Feb 2026.
Describes a forthcoming nationwide, consented, randomized controlled trial of conversational AI within Included Health virtual care workflows, plus summaries of prior AMIE, PHA, Fitbit Labs, wayfinding AI, and Beth Israel feasibility studies.
[S2] AMIE diagnostic reasoning and conversational studies (Nature 2025, Google Research blogs 2024-2025).
Evaluated AI’s diagnostic reasoning and conversational performance, including comparisons with primary care physicians in simulated consultations using patient actors, and studies of AI as an assistive tool for clinicians.[S1]
[S3] Beth Israel Deaconess feasibility trial - ClinicalTrials.gov NCT06911398.
Single-center, specialist-care feasibility study testing conversational AI in real clinical settings, with safety supervisor interruptions as a main safety outcome. Reported as showing “strong indications of safety,” with full results pending.[S1]
[S4] “From diagnosis to treatment: advancing AMIE for longitudinal disease management.” Google Research.
Extends AMIE to reason over clinical guidelines and longitudinal patient history for disease management, and adds multimodal reasoning such as image interpretation.[S1]
[S5] “The anatomy of a Personal Health Agent” and Fitbit Labs experiments (Symptom Checker, Medical Records Navigator, Plan for Care).
Retrospective research on multi-agent, multimodal models that analyse wearable and personal health data to provide personalised coaching and help users prepare for clinical encounters.[S1]
[S6] “Towards better health conversations: research insights on a wayfinding AI agent based on Gemini.” Google Research.
Describes a conversational agent that guides users toward better-quality health information through goal-aware dialogue, informing design choices for patient-facing health search and triage experiences.[S1]






