Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

May 28, 2024

1. A comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark.

2. Only one of the five models tested performed above the 50^th percentile, with worse performance observed in clinical oncology subcategories and female-predominant malignancies.

Evidence Rating Level: 2 (Good)

Study Rundown: Many medical professionals have begun to use large language models (LLMs), such as ChatGPT, as augmented search engines for medical information. LLMs have demonstrated high performance on subspecialty medical examinations across multiple medical specialties, but the utility of LLMs in clinical applications of clinical oncology remain unexplored.

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The study found that only one of the five LLMs (GPT-4) scored higher than the 50^th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs. Overall, this study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies.

Click here to read the study in NEJM AI

Click to read an accompanying editorial in NEJM AI

Relevant Reading: Performance of ChatGPT on a primary FRCA multiple choice question bank

In-Depth [cross-sectional study]: In this study, Rydzewski and colleagues assessed the accuracies of five LLMs (LLaMA, PaLM 2, Claude-v1, GPT-3.5, and GPT-4) in 2044 multiple choice questions and aimed to identify strategies to help end users identify reliable LLM outputs. The questions were sourced from the American College of Radiology in-training radiation oncology examinations from 2013-2017, 2020, and 2021. Each question was repeated across three independent replicates. The authors compared LLM performance with a random guessing strategy and human scores for the questions sourced from the 2013 and 2014 examinations. Also, the authors assessed the self-appraised confidence of the LLMs by prompting for a confidence score ranging from 1-4, with 1 indicating a random guess and 4 indicating maximal confidence.

The five LLMs had mean accuracies ranging from 25.6%-68.7%, as compared to 25.2% of the random guess strategy. When compared against humans, only GPT-4 scored higher than 50^th percentile, achieving 69^th and 89^th percentiles. The overall performances of LLMs were positively correlated (Pearson’s r = 0.630; p < 0.001) with their performances in a specific topic. Other than LLaMA 65B, all LLMs performed better on foundational topics (e.g., medical statistics, cancer biology), than clinical subcategories (p < 0.02). LLMs performed the worst with regards to subjects involving breast and gynecologic malignancies. All LLMs produced a confidence score of 3 or 4 in more than 94% of responses. Finally, by combining self-assessed confidence and output consistency, the authors generated accuracies of 81.7% and 81.1% in Claude-v1 and GPT-4, respectively.

In conclusion, the authors assessed the ability of five LLMs to answer clinical oncology examination questions. This work displayed the need for further LLM safety evaluations before routine clinical implementation. It also provided insight into a potential strategy to more reliably use LLM output.

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Author

admin

View all posts

ABOUT THE EXPERTS

Cheng En Xi

PW Weekly Newsletters

The latest articles and insights from your colleagues in your specialty(ies) of choice.

PODCAST

PW Weekly Newsletters

The latest articles and insights from your colleagues in your specialty(ies) of choice.

SUBSCRIBE NOW

View all newsletters

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

ABOUT THE EXPERTS

PW Weekly Newsletters

PODCAST

Leveraging Nonverbal Cues to Build Trust & Improve Patient Care

Resilience in Medicine: Sharing Practices Physicians Can Easily Adopt to Improve Well-Being

Strategies to Help Diagnosis NSCLC and the Benefits of a Team-Based Approach, Part 2

PW Weekly Newsletters

RELEVANT ARTICLES FOR YOU

Is HFUS as Accurate as Pathology for Diagnosing Skin Cancer?

Is Clinical Remission a Feasible Goal in Long-Term Treatment for Severe Asthma?

ROS1+ NSCLC eBook

Slider Test

Navigating the Treatment Landscape for ROS1+ NSCLC

Tailoring ROS1+ NSCLC Treatment: The Role of Biomarker Testing

The Role of the Multidisciplinary Team for ROS1+ NSCLC

The Role of Pharmacists in Oncology Care

Test Your Knowledge of ROS1+ NSCLC

PODCAST2

Leveraging Nonverbal Cues to Build Trust & Improve Patient Care

Resilience in Medicine: Sharing Practices Physicians Can Easily Adopt to Improve Well-Being

Strategies to Help Diagnosis NSCLC and the Benefits of a Team-Based Approach, Part 2

Follow Us

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

Author

ABOUT THE EXPERTS

PW Weekly Newsletters

EXPLORE MORE

PODCAST

PW Weekly Newsletters

RELEVANT ARTICLES FOR YOU

PODCAST2

Stay connected to the latest news with Physician’s Weekly