Frontier LLMs Outperform Specialized Clinical AI Tools in Medical Evaluations
General-purpose large language models (LLMs) outperformed specialized clinical AI tools in medical knowledge, expert alignment, and real-world clinical use, according to a quantitative comparison. Frontier models including Google Gemini, OpenAI GPT, and Anthropic Claude achieved higher accuracy and clinician ratings than domain-specific tools like OpenEvidence and UpToDate Expert AI.
The study compared these systems using three stages: 500 US Medical Licensing Examination-style MedQA questions, 500 HealthBench items, and 100 real clinical queries (RCQ) evaluated by 12 blinded US clinicians. The results placed general-purpose LLMs in a higher performance tier across all dimensions.
How did general-purpose AI perform against clinical tools?
Frontier LLMs consistently beat clinical AI tools on medical knowledge tests. In MedQA testing, Google Gemini reached the highest accuracy at 97.4%, followed by OpenAI GPT at 94.2% and Anthropic Claude at 90.2%.

Clinical tools scored lower, with OpenEvidence achieving 89.6% accuracy and UpToDate reaching 88.4%. The analysis found that Gemini outperformed all other models tested.
On the HealthBench evaluation, which measures agreement with expert clinicians, GPT scored highest at 88.0. Gemini followed at 79.3 and Claude at 77.0. OpenEvidence and UpToDate scored 62.6 and 61.3, respectively.
Why did specialized clinical AI tools lag behind?
The research suggests that the scale, alignment, and cross-domain reasoning of frontier models may outweigh the benefits of domain-specific tuning. Frontier LLMs benefit from larger training corpora and faster iteration cycles than specialist systems.

The study noted that retrieval-augmented generation (RAG), a technique likely used by OpenEvidence and UpToDate Expert AI, can negatively affect performance if the model retrieves irrelevant material or integrates it poorly.
Clinician reviews of real-world queries revealed that OpenEvidence scored lowest on clarity, with a mean rating of 2.84 on a 4-point scale. This suggests a weakness in communication rather than a lack of medical knowledge.
Are general-purpose models safer for medical use?
Safety outcomes did not differ significantly across the models tested. No model produced more hallucinations or harmful content than any other, according to the study’s findings.
When evaluating real clinical queries, frontier models formed a top tier with mean aggregate ratings between 3.52 and 3.62. Clinical tools and Google Search AI Overview formed a lower tier, scoring between 3.17 and 3.35.
Google Search AI Overview, an auto-enabled feature, performed similarly to the specialized clinical AI tools in this benchmark.
What happens next for AI in medical practice?
The current advantages of general-purpose models may reflect heavy investment and rapid development. If the returns on scaling these models diminish, domain-specific tuning and clinician-in-the-loop optimization could increase in value.

Future development may shift toward hospital-specific LLMs that use institutional data to reduce external harm. Deeply subspecialized medical tasks may also continue to favor more sophisticated, domain-specific adaptations.
The study suggests that a combination of frontier models for less-sensitive tasks and institution-grounded frameworks for local workflows could be a possible path forward.
Frequently Asked Questions
Which AI model showed the highest accuracy in medical knowledge?
Google Gemini achieved the highest accuracy on MedQA questions at 97.4%.
Did specialized clinical AI tools provide safer answers than general LLMs?
The study found no significant difference between the models regarding the production of harmful content or hallucinations.
How did Google Search AI Overview compare to clinical AI tools?
Google Search AI Overview matched the performance of clinical AI tools in the real-world clinical query benchmark.
Do you think general-purpose AI will eventually replace specialized medical software in clinics?