Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

Inside Google's AMIE Study: What 100 Real Patients Revealed About Diagnostic AI Safety

Reviewed:
Andrii Daniv
14
min read
Mar 12, 2026
Minimalist diagnostic decision hub illustration with ai clinician toggle patient report human pointing

This report summarizes early real-world results from Google's AMIE conversational diagnostic AI used for pre-visit history taking in an ambulatory primary care clinic and highlights what they suggest about safety, patient trust, and workflow fit for healthcare and AI product teams.

Real-World Feasibility of AMIE: 100-Patient Conversational Diagnostic AI Study in Primary Care

Executive snapshot: conversational diagnostic AI in primary care

  • 100 adult patients completed a pre-visit AMIE chat; 98 went on to see a primary care provider (PCP). All interactions were supervised in real time by a physician, and zero safety stops were triggered across four predefined safety criteria.[S1]
  • In blinded review, clinicians rated AMIE and PCPs as similar on overall differential diagnosis (DDx) and management plan quality, with no significant differences in DDx quality or management appropriateness and safety. PCPs scored higher on practicality and cost-effectiveness of management plans.[S1]
  • AMIE's DDx contained the final diagnosis in 90% of cases, achieved 75% top-3 accuracy, and 56% top-1 accuracy compared with chart-reviewed final diagnoses eight weeks later; top-7 coverage was 90%.[S1]
  • Patient attitudes toward AI, measured with the General Attitudes towards AI Scale, improved significantly after the AI interaction and remained elevated after the clinician visit; patients reported high satisfaction with politeness and clarity.[S1][S5]
  • Clinicians reported that AMIE's transcripts were useful for visit preparation, and interviews suggested visits shifted from "data gathering" to "data verification" and shared decision-making.[S1]

Implication for marketers: supervised conversational diagnostic AI can be framed credibly as a pre-visit intake and reasoning aid that fits clinical workflows without measurable harm to safety or trust in this early deployment, rather than as a clinician replacement.

Method and source notes for the conversational diagnostic AI study

Google Research, Google DeepMind, and Beth Israel Deaconess Medical Center ran a prospective, single-arm, single-center feasibility study of AMIE in an academic ambulatory primary care urgent care clinic.[S1] The study was IRB-approved and pre-registered on ClinicalTrials.gov (NCT06911398).[S1][S6] Patients were adults presenting with new, non-emergency episodic complaints, either in person or via telehealth.[S1]

Key design elements:[S1]

  • Sample: 100 adults completed AMIE chats; 98 attended the subsequent PCP visit.
  • Interaction mode: Text-only chat via secure web link before the visit.
  • Oversight: A physician ("AI supervisor") observed each chat on live video with screen-sharing and could trigger a safety stop under four criteria: harm to self or others, significant emotional distress related to the AI, potential for clinical harm, or explicit patient request to stop.
  • Outputs to clinicians: AMIE produced a transcript and summary before the PCP visit; with patient consent, these were provided to the PCP as pre-visit information.
  • Evaluation: Three independent clinical evaluators (not the treating PCPs) reviewed DDx lists and management (Mx) plans from AMIE and PCPs in a blinded, randomized fashion; quality scores per case were aggregated using medians across the three evaluators.[S1]
  • Outcome definition: Final diagnoses were determined via chart review eight weeks after the encounter, classed as presumptive (PCP judgment) or confirmatory (test or specialist).[S1]
  • Attitudes: Patient attitudes to AI were measured at three timepoints using the General Attitudes towards AI Scale (GAAIS).[S1][S5]

Limitations and caveats:[S1]

  • Single center, 100-patient sample - limits generalizability across health systems and geographies.
  • No control arm - no direct quantitative comparison to a "no-AI" workflow.
  • Text-only AMIE - no physical exam, vitals, imaging, or EHR context; PCPs had richer information.
  • Supervised use only - every interaction had a dedicated physician supervisor, which may not reflect scalable deployment.
  • Participation bias - study patients skewed younger than the clinic's overall urgent care population.

Primary sources

Findings on AMIE's real-world performance and user perceptions

The study moves AMIE from simulated environments (diagnostic vignettes and trained patient actors)[S2][S3] into direct patient use in clinical workflow, under supervision.[S1] This section summarizes what actually happened: who used the system, how safe it was, how well it reasoned, and how patients and clinicians perceived it.

Study sample, clinical workflow, and patient characteristics

Patients were recruited during appointment booking for urgent, non-emergency primary care visits at Beth Israel Deaconess Medical Center.[S1] They were informed that participation was voluntary and would not affect care.

Workflow:[S1]

  • Patient books an urgent care slot.
  • Before the visit, the patient completes a text chat with AMIE via secure link, supervised on live video by a physician.
  • AMIE produces a summary and transcript.
  • The treating PCP receives this material before seeing the patient.
  • After the visit, independent clinical evaluators review AMIE and PCP outputs, and charts are reviewed eight weeks later for final diagnosis.

Sample characteristics:[S1]

  • 100 adults completed AMIE; 98 attended their scheduled visit.
  • The study cohort had more younger patients relative to the clinic's full urgent care volume, where over half of visits during the study period were among patients aged 60+.
  • Demographics (gender, race and ethnicity) of participants were generally consistent with the clinic's broader urgent care population, which skewed female and white.
  • Participants varied in health literacy, technology literacy, language, and prior chatbot exposure, based on survey measures.[S1]

The blog summary does not give exact percentages for literacy levels or chatbot use, but reports that the sample covered a range across these dimensions.

Safety and feasibility of supervised conversational diagnostic AI

Safety was defined using four specific criteria that would trigger a safety stop by the supervising physician: immediate risk of harm to self or others, marked emotional distress tied to the AI interaction, potential for clinical harm detected by the supervisor, or a patient request to terminate the session.[S1]

Across all 100 AMIE-patient interactions:[S1]

  • 0 safety stops were required.
  • There were no reported events in the summary that met the predefined safety thresholds during AI-patient interactions.

AMIE operated purely via text, without access to records, test results, or physical examination. Supervisors were present on a live video call but intervened only if needed; none of the interactions crossed the defined safety thresholds.[S1]

The authors interpret this as evidence that, under these narrowly defined conditions (adult, non-emergency, supervised, academic primary care clinic), AMIE was conversationally safe and feasible to deploy in real workflow.[S1]

Clinical reasoning: diagnostic accuracy and management quality

Clinical performance was evaluated through blinded review of AMIE and PCP outputs for each case.[S1] Three separate clinicians rated, for each source (AI and PCP):

  • Overall DDx quality
  • Management plan appropriateness and safety
  • Practicality of the plan
  • Cost-effectiveness

Key findings:[S1]

  • Overall quality: AMIE and PCPs were rated on par for overall DDx and Mx plan quality. The study reports no statistically significant differences in DDx quality or in the appropriateness and safety of Mx plans.
  • Practicality and cost: PCPs outperformed AMIE on practicality and cost-effectiveness of management plans. This is attributed to PCPs having EHR access, physical exams, and local system experience, none of which AMIE had.

Diagnostic accuracy vs final chart diagnosis:[S1]

  • AMIE's DDx included the final diagnosis in 90% of cases.
  • Top-3 accuracy: 75% of cases had the final diagnosis within the top three AI hypotheses.
  • Top-1 accuracy: 56% of cases had the final diagnosis as AMIE's single most likely diagnosis.
  • Top-7 coverage: AMIE listed the eventual diagnosis within its top seven hypotheses in 90% of cases.
  • In a subset of 46 patients whose final diagnosis was confirmed by tests or specialist input, AMIE's accuracy remained high, though the blog does not report exact percentages.
  • Accuracy was higher for cases where the final diagnosis was presumptive (PCP judgment without confirmatory testing) than for those requiring objective confirmation, but performance remained strong in both groups.[S1]

These figures describe how often AMIE's reasoning overlapped with eventual clinician judgment and test results, not whether use of AMIE improved those outcomes compared with usual care (there was no control arm).

Trust, satisfaction, and experience for patients and clinicians

Patients completed the GAAIS General Attitudes towards AI Scale before using AMIE, immediately after the AMIE interaction, and after seeing their PCP.[S1][S5] The scale measures perceived utility and concerns, combined into an overall attitude score.

Findings on attitudes and satisfaction:[S1]

  • Attitudes towards AI became more positive after the AMIE chat, with statistically significant improvement in:
    • Perceived utility
    • Concerns (reduced or remained around neutral)
    • Overall attitude score
  • These more positive attitudes persisted after the clinician visit, indicating that the AI experience did not erode trust once the patient saw a human provider.
  • Surveys and interviews reported high patient satisfaction, with frequent comments that AMIE was polite and explained medical issues clearly.

Clinician experience:[S1]

  • PCPs reviewing AMIE summaries and transcripts generally found them useful for visit preparation.
  • In interviews, PCPs reported that AMIE helped shift the visit from primarily collecting information toward verifying details and engaging in shared decision-making.
  • Conversation quality ratings from both patients and clinical evaluators were largely in the most favorable categories across domains such as eliciting information, explaining conditions, and managing concerns, based on the graphs provided.[S1]

These findings align with earlier simulated AMIE studies where clinicians often preferred AI-assisted interactions on empathy and communication metrics, though those earlier results come from standardized scenarios rather than real patients.[S2][S3]

Interpretation and implications for strategy and product positioning

This section contains interpretation beyond the raw study results. Certainty labels reflect how directly the data support each point.

Likely: supervised conversational diagnostic AI can be safely positioned as pre-visit intake in defined use cases

Interpretation (Likely) Given zero safety stops and no reported adverse conversational events in 100 supervised interactions, it is reasonable to infer that pre-visit, text-based diagnostic conversations with adult, non-emergency patients can run safely under real-time clinician oversight in settings similar to BIDMC.[S1]

For healthcare product and marketing teams, this supports a positioning of conversational diagnostic AI as:

  • A pre-visit history and reasoning layer in urgent or episodic primary care, under explicit clinician oversight
  • Limited initially to non-emergency complaints and adults, consistent with the study scope
  • An input that generates structured summaries and DDx lists to help clinicians start visits better prepared

Messaging that emphasizes "augments clinician assessment" rather than "autonomous diagnosis" is most consistent with the evidence. Claims about improvement in accuracy, outcomes, or cost would require controlled comparison data that this study does not provide.

Likely: well-designed AI interactions can improve patient attitudes and reduce adoption friction

Interpretation (Likely) The statistically significant improvement in GAAIS scores after using AMIE, with sustained gains after the PCP visit, indicates that direct experience with a carefully designed, supervised medical AI can move patient attitudes towards greater acceptance.[S1][S5]

For marketers and UX leads, this suggests:

  • First-use experience quality matters: politeness, clear explanations, and visible connection to a clinician context likely contribute to improved trust.
  • Supervision and consent are features, not just safeguards: explicitly showing that a physician supervises the AI interaction may reassure users and reduce fear of "AI replacing doctors."
  • Trials and pilots can function as trust-building campaigns: structured pilots that measure attitudes pre- and post-use, as in this study, can help health systems demonstrate to internal stakeholders that patient trust does not necessarily fall when AI is introduced.

Any claims about long-term trust should remain cautious; the study captures short-term attitude changes only.

Tentative: workflow impact and value proposition for providers

Interpretation (Tentative) PCPs reported that AMIE shifted visits from gathering information to verifying and jointly discussing plans and found AI-generated summaries helpful for preparation.[S1] Combined with AMIE's DDx overlap with final diagnoses (90% coverage, 56% top-1), this points toward a value proposition where AI:

  • Front-loads history collection and problem structuring, so clinicians can allocate more visit time to interpretation and communication
  • Provides a "second set of eyes" on likely diagnoses, which clinicians can accept, refine, or reject based on their exam and context

For provider-facing messaging and sales, that implies emphasizing:

  • Time reallocation, not time savings: the study does not report shorter visits, but describes a change in how visit time is used.
  • Support in documentation and reasoning: AMIE produced usable summaries without access to EHR data, suggesting potential to reduce manual note-taking if integrated with clinical systems in future iterations.

Because the study lacks quantitative measures on visit length, throughput, or clinician burnout, any claims about efficiency gains should currently be framed as hypotheses to be tested, not proven benefits.

Tentative: constraints around practicality and cost highlight product design priorities

Interpretation (Tentative) PCPs outperformed AMIE on practicality and cost-effectiveness of management plans.[S1] The likely reasons are straightforward: clinicians know local formularies, service availability, insurance constraints, and patient context; AMIE in this study had no EHR access, no exam findings, and no system-level cost inputs.

For product development and positioning, this suggests:

  • Current conversational diagnostic AI is better framed as a clinical reasoning aid than as a care-pathway optimizer.
  • Claims about cost savings should wait for tools that:
    • Incorporate local service availability and cost data
    • Integrate tightly with EHRs and care pathways
    • Are tested in controlled trials that track resource use

In marketing materials, over-promising on efficiency or cost optimization risks conflict with the evidence. Clear framing as "suggests medically reasonable options; human clinician adapts to practical constraints" is more defensible.

Speculative: user segmentation and interface strategy for broader populations

Interpretation (Speculative) The study sample skewed younger than the clinic's overall urgent care population and included a mix of tech and health literacy levels.[S1] Older patients, who account for a large share of healthcare use, were under-represented relative to their proportion of visits. This might reflect recruitment patterns, comfort with technology, or other unmeasured factors.

Speculative implications for market strategy:

  • Adoption may lag among older and low-tech-literacy groups, so relying on early pilot data to forecast uptake could overestimate real usage in core high-need segments.
  • Adding voice interfaces, simpler flows, or caregiver-assisted modes may be important to reach those segments; the current study's text-only chat does not test that.
  • Health systems rolling out such tools may need targeted education campaigns and support for less tech-confident cohorts.

Because the study did not analyze outcomes by literacy or age strata in detail (at least in the blog summary), these points should be treated as design hypotheses, not confirmed patterns.

Contradictions and gaps in the current evidence

Several important questions remain open, and some findings limit how far businesses can generalize from this study.

  • No counterfactual: Without a control group, it is unknown whether AMIE improved diagnostic accuracy, time to diagnosis, visit length, or patient outcomes compared with standard intake methods.[S1]
  • Short follow-up horizon: The eight-week chart review window may miss longer-term diagnostic revisions or late-presenting conditions.[S1]
  • Generalizability: Results come from a single academic center and English-dominant context; behavior in community clinics, other countries, or different languages is not established.
  • Supervision load and scalability: Every interaction had a dedicated physician supervisor. The operational feasibility and safety profile of asynchronous or lighter-touch oversight remain under study in separate work.[S1]
  • Unmeasured operational metrics: The study does not report impacts on visit duration, clinician workload, documentation burden, or no-show rates - metrics that matter directly for ROI calculations.
  • Equity and subgroups: While the sample spanned multiple demographic groups, the blog summary does not detail performance or satisfaction by race, language, literacy, or prior chatbot exposure. Equity and bias questions remain largely unanswered.
  • Clinical severity boundaries: The study is limited to non-emergency episodic complaints; performance and safety in more complex chronic or high-acuity scenarios are unknown.

For decision-makers, these gaps mean the study should be read as evidence of feasibility and short-term acceptability, not as proof of outcome improvement or cost savings.

Data appendix: key quantitative results from the AMIE clinical feasibility study

(All figures are as reported in the Google Research blog and accompanying paper summary.[S1])

Sample and flow

  • 100 adult patients completed AMIE chat.
  • 98 of those attended their urgent care visit.
  • 1,452 urgent care visits occurred during the study period overall.

Safety

  • 4 predefined safety criteria (self or other harm, emotional distress from AI, potential clinical harm, patient request to stop).
  • 0 safety stops triggered.

Diagnostic performance vs final diagnosis

  • 90% of cases: final diagnosis included somewhere in AMIE DDx.
  • 75% of cases: final diagnosis in AMIE's top-3 diagnoses.
  • 56% of cases: final diagnosis ranked #1 by AMIE.
  • 90% of cases: final diagnosis within AMIE's top-7 diagnoses.
  • Subset of 46 test- or specialist-confirmed cases: accuracy described as "high," exact percentages not reported.

Comparative quality vs PCPs (blinded evaluator ratings)

  • Overall DDx quality: AMIE ≈ PCPs (no significant difference).
  • Mx plan appropriateness and safety: AMIE ≈ PCPs (no significant difference).
  • Mx plan practicality: PCPs > AMIE.
  • Mx plan cost-effectiveness: PCPs > AMIE.

Attitudes and experience

  • GAAIS overall scores: statistically significant improvement from pre-AI to post-AI, maintained post-PCP (exact scale values not reported).[S1][S5]
  • Patient and evaluator ratings of conversation quality: majority in the most favorable categories across eliciting information, explaining conditions, and managing concerns.[S1]

This work reflects a collaboration between Google Research, Google DeepMind, and clinical partners at Beth Israel Deaconess Medical Center, part of Beth Israel Lahey Health, within the broader context of health AI initiatives such as Google for Health.

Quickly summarize and get insighs with: 
Author
Etavrian AI
Etavrian AI is developed by Andrii Daniv to produce and optimize content for etavrian.com website.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents