Large Language Models in Medicine: A Comprehensive Review of Recent Advances and Applications

Recent research has unveiled a new generation of large language models (LLMs) that operate at the scale of entire health systems, acting as versatile prediction engines for a range of clinical tasks (Jiang et al., 2023). Parallel work demonstrated that these models inherently encode medical knowledge, enabling them to answer clinical questions without task‑specific fine‑tuning (Singhal et al., 2023).

Foundational Models Extend Into Specialty Domains

Building on system‑wide capabilities, researchers have crafted foundation models for narrow specialties. A general‑purpose model for computational pathology was introduced, allowing the generation of pathology reports directly from slide images (Chen et al., 2024). Similarly, a multimodal generative AI “copilot” was shown to assist human pathologists by integrating visual and textual cues (Lu et al., 2024).

Clinical Pilots Demonstrate Real‑World Impact

In mental health, a personalized self‑referral chatbot reduced barriers to care by guiding patients through intake steps (Habicht et al., 2024). Outpatient reception was streamlined when nurses collaborated with an LLM, improving triage efficiency in a randomized trial (Wan et al., 2024). A clinician‑centered drug‑repurposing model suggested novel therapeutic candidates by reasoning over biomedical literature (Huang et al., 2024). Integrated image‑based deep learning combined with language models also enhanced primary diabetes management, delivering actionable insights from retinal scans (Li et al., 2024).

Performance Gains and Evaluation Frameworks

Adapted LLMs have outperformed human experts in clinical text summarization, producing concise and accurate discharge notes (Van Veen et al., 2024). A dedicated evaluation framework now benchmarks LLMs on patient‑interaction tasks such as counseling and instruction delivery (Johri et al., 2025). A generalist medical LLM released in 2025 offered disease‑diagnosis assistance across a broad spectrum of conditions (Liu et al., 2025).

Did You Know? The first health‑system‑scale language model was reported in Nature in 2023 as an “all‑purpose prediction engine,” marking a shift from narrow AI tools to platform‑level intelligence in medicine.

Challenges and Ethical Considerations

Despite impressive gains, studies have highlighted limitations in clinical decision‑making, noting that LLMs can propagate errors if not properly mitigated (Hager et al., 2024). Ethical analyses stress that accuracy, rather than abstract notions of fairness, must guide AI deployment in diagnostics (Sabuncu et al., 2024). Research into model “sycophancy” reveals susceptibility to misleading prompts, prompting calls for robust defense strategies (Rrv et al., 2024).

Expert Insight: Samantha Carter, senior health‑technology editor, notes that while LLMs are rapidly approaching expert‑level performance, their integration will require layered oversight—combining human expertise, rigorous validation, and transparent prompt engineering—to ensure patient safety and equitable outcomes.

Looking Ahead: Toward Integrated, Adaptive AI

Future scenarios envision LLMs operating within retrieval‑augmented generation pipelines, pulling up‑to‑date evidence from electronic health records to support real‑time reasoning (Lewis et al., 2020). Multimodal models that fuse imaging, genomics, and narrative data could enable “digital twins” of patient health trajectories (Makarov et al., 2025). AI agents trained with reinforcement learning may soon assist clinicians in therapeutic planning across complex drug‑combination spaces (Gao et al., 2024).

Frequently Asked Questions

What are health system‑scale language models?

They are large AI models trained on massive, system‑wide clinical data sets and designed to perform many predictive and reasoning tasks across a health network, effectively serving as a universal prediction engine (Jiang et al., 2023).

How are LLMs currently being used in clinical settings?

Examples include mental‑health self‑referral chatbots, nurse‑LLM collaboration for outpatient triage, drug‑repurposing suggestions, pathology report generation, and diabetes care assistance that combines image analysis with language understanding (Habicht et al., 2024; Wan et al., 2024; Huang et al., 2024; Li et al., 2024).

What are the main limitations of using LLMs in medicine?

Research shows that LLMs can make inaccurate clinical decisions, be vulnerable to misleading prompts, and raise ethical concerns about accuracy versus fairness. Ongoing work focuses on mitigation strategies, evaluation frameworks, and transparent prompt engineering (Hager et al., 2024; Sabuncu et al., 2024; Rrv et al., 2024).

How do you think these emerging AI tools will reshape patient care in the next few years?