General LLMs top medical models in Africa - but the real surprise came from human ratings

AfriMed‑QA is a pan‑African benchmark for medical question answering that combines clinician exam items and consumer health queries to test how large language models perform beyond Western contexts. Its protocol measures accuracy for multiple‑choice and short‑answer tasks and adds blinded human ratings for safety and localization. This brief distills what is new, how it was measured, and where it matters for health AI products and go‑to‑market plans.

AfriMed‑QA Executive Snapshot

AfriMed‑QA is an open evaluation dataset for African health question answering with expert‑labeled items and consumer questions, plus tooling for side‑by‑side model comparison. Built by the AfriMed‑QA consortium with contributions from 60+ medical schools and community partners, it has been used to evaluate 30 general and biomedical LLMs as of May 2025. See the ACL 2025 Paper and the public Benchmark Datasets for details.

Scope: ~15,000 Q&A items - 4,000+ expert MCQs, 1,200+ short‑answer questions with references, and ~10,000 consumer queries; 621 contributors; 60+ medical schools; 12 to 16 countries; 32 specialties.
Scale effects: Larger models were more accurate than smaller models on AfriMed‑QA; general‑purpose LLMs outperformed biomedical‑tuned peers of similar size in this setting.
Human ratings (n=3,000 items): In blinded tests on consumer health questions, consumers and clinicians rated frontier LLM responses as more complete, informative, and relevant than clinician answers, with fewer omissions and hallucinations reported by raters.
Localization: Raters scored whether answers were locally appropriate across axes adapted from MedLM work - inaccuracy, omission, demographic bias, and potential harm. See the MedLM paper for the rubric.
Openness and impact: Dataset and evaluation code are public, enabling model submissions and comparison. The work has informed training and evaluation of newer medical models such as MedGemma.

Implication for marketers: General LLMs paired with localized evaluation and human review can meet consumer information needs in African health contexts, but model choice, governance, and deployment constraints require region‑specific testing. For broader deployment guidance, see the PATH/The Gates Foundation brief.

Method and sources: AfriMed‑QA data and evaluations

AfriMed‑QA combines clinician exam items and consumer‑style questions to measure correctness, completeness, and contextual appropriateness of LLM outputs in African healthcare. Google Research and the AfriMed‑QA consortium collected and curated data; analysis spans automatic accuracy metrics and blinded human preference and safety ratings. Full methods appear in the ACL 2025 Paper.

What was measured: MCQ accuracy (single‑label match), SAQ similarity (semantic and sentence overlap vs reference), and human ratings on correctness, localization, omissions, hallucinations, demographic bias, and potential harm for a 3,000‑item subset.
Who and when: Consortium partners across Africa; experiments reported as of May 2025. The work was presented at ACL 2025 and received the Best Social Impact Paper Award.
Sample size: ~15,000 total items - 4,000+ MCQs, 1,200+ SAQs, and ~10,000 consumer queries; 621 contributors from 60+ medical schools across 12 to 16 countries; 32 specialties.
Methods: Web platform sourcing with blinded review; rater scale 1 to 5 per axis; raters evaluated mixed sets of model and human responses without source labels.
Tools: Public dataset on Hugging Face (Benchmark Datasets) and open‑source evaluation code on GitHub (AfriMed‑QA Evaluation Code).
Key limitations: English‑only text in the current release; more than 50% of expert MCQs sourced from Nigeria; expansion to non‑English languages and multimodal content is in progress.

Findings on LLM performance in African health question answering

AfriMed‑QA isolates how models handle distribution shifts that matter in African contexts - disease prevalence, guideline differences, drug nomenclature, and resource availability - by using locally derived exam items and consumer questions. The suite also introduces human ratings that explicitly score localization and potential harm, which are not emphasized in older medical QA corpora such as USMLE MedQA.

Model performance patterns

Scale: Larger general models achieved higher MCQ accuracy and stronger SAQ similarity than smaller models, indicating benefits from parameter count and training breadth.
Domain specialization: General LLMs outperformed comparably sized biomedical‑tuned models in this benchmark, potentially due to overfitting in domain‑specific models or size limits in the open biomedical checkpoints evaluated.
Deployment trade‑offs: The scale advantage conflicts with on‑device or edge deployment needs common in low‑resource settings, where smaller models are preferred. Compression and distillation are needed to preserve localization and context handling.
Evaluation coverage: Thirty general and biomedical models, both open and closed, were tested with consistent scoring. MCQ accuracy and SAQ semantic overlap were complemented by blinded human ratings for consumer Q&A.

Human ratings and safety signals

Blinded preference: In a 3,000‑item study, consumers and clinicians rated frontier LLM answers to consumer questions as more complete, informative, and relevant than clinician‑authored answers, with fewer omissions and hallucinations reported for LLM outputs on these items.
Safety axes: Ratings adapted from the MedLM paper covered inaccuracy, omission, demographic bias, and potential harm; clinicians also judged localization appropriateness. A 1 to 5 scale captured degree of adherence, with raters blind to source.
Localization: The evaluation checks alignment with local standards, drug availability, and care pathways - factors often absent in Western‑centric medical QA sets.

Interpretation and implications for health AI and marketing

Likely: For consumer‑facing health information services in African markets, well‑configured general LLMs can produce responses that users rate as more comprehensive and relevant than clinician‑written short replies, provided there is governance for medical safety, localization checks, and escalation pathways.
Likely: Product teams should validate any model on region‑specific data before deployment; specialization alone does not guarantee superior performance under local distribution shifts.
Likely: Budget for human review remains necessary. Use AfriMed‑QA‑style axes - accuracy, omission, harm, bias, and localization - as acceptance criteria and monitoring targets across languages and specialties.
Tentative: Given scale advantages, consider a hybrid architecture - server‑side inference for complex queries and smaller distilled models for triage or on‑device tasks - to balance quality, latency, and cost in bandwidth‑constrained settings.
Tentative: Content strategy for SEO and owned channels can prioritize consumer Q&A aligned to prevalent local conditions and terminology. Check outputs for drug naming variants and local referral guidance to avoid misinformation.
Speculative: As the dataset expands to non‑English and multimodal items, expect shifts in model rankings. Teams planning multilingual launches should reserve capacity for re‑evaluation by language and modality.

Contradictions and gaps

Country count discrepancy: Materials cite both 16 countries overall and 12 countries for the v2 data release. Confirm the latest dataset card for the definitive figure per release.
Geographic skew: More than half of expert MCQs originate from Nigeria, which may bias specialty distributions and guideline norms. Expansion is planned but not yet reflected in published splits.
Metrics detail: The paper summarizes model group performance but per‑model scores are not reproduced here. Consult the leaderboard and code for exact metrics and confidence intervals.
Rating reliability: Inter‑rater agreement statistics are not reported in the summary; variance in harm and localization judgments is unknown.
Language and modality: Current release is English‑only text; non‑English and multimodal expansions are pending, leaving gaps for high‑priority languages and imaging or audio tasks.
Training leakage risk: AfriMed‑QA has informed development of models such as MedGemma. Ensure evaluation sets remain unseen during model training or finetuning to avoid contamination.

Data appendix: counts and links

Items: ~15,000 total; 4,000+ MCQs; 1,200+ SAQs; ~10,000 consumer queries.
Contributors and coverage: 621 contributors; 60+ medical schools; 12 to 16 countries; 32 specialties.
Evaluation: 30 models evaluated as of May 2025; human ratings on n=3,000 items with a 1 to 5 scale across safety and localization axes.
Access: Paper | Benchmark Datasets | AfriMed‑QA Evaluation Code | MedGemma | USMLE MedQA