The following is a summary of “Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy,” published in the May 2024 issue of Ophthalmology by Cheong et al.
Researchers conducted a prospective study comparing the efficacy of generative and retrieval-based chatbots in answering patient queries on age-related macular degeneration (AMD) and diabetic retinopathy (DR).
They examined four chatbots: ChatGPT-4, ChatGPT-3.5, Google Bard (generative models), and OcularBERT (retrieval-based model) in a study. The evaluation was based on the chatbot’s accuracy in responding to 45 questions (15 about AMD, 15 about DR, and 15 others). Retinal specialists graded responses on a three-point Likert scale: 2 (good, error-free), 1 (borderline), or 0 (poor with significant inaccuracies). Scores were aggregated, ranging from 0 to 6.
The results showed that ChatPGT-4 and ChatGPT-3.5 did best, scoring a median IQR of 6 (1) compared to Google Bard’s 4.5 (2) and OcularBERT’s 2 (1) with (all P≤8.4×10−3). The consensus approach showed that ChatGPT-4 and ChatGPT-3.5 had 83.3% and 86.7% ‘Good’ ratings, beating Google Bard (50%) and OcularBERT (10%) where (all P≤1.4×10−2). In matters of ‘Poor’ rated results, ChatGPT-4 and ChatGPT-3.5 had none, Google Bard showcased 6.7% poor responses, and OcularBERT had 20%. Across different question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.
Investigators concluded that ChatGPT-4 and ChatGPT-3.5 performed best, showing potential for answering specialized questions. However, more validation studies are needed before using them in real-world settings.
Source: bjo.bmj.com/content/early/2024/05/15/bjo-2023-324533