Large Language Models Show Comparable Response Performance but Vary in Readability Regarding Patient Questions on Hip Arthroscopy

Purpose

To compare the quality of large language model (LLM) responses to frequently asked questions regarding hip arthroscopy, assess the incorrect response rate of LLMs, and compare the readability among different LLM outputs.

Methods

Three LLMs, including OpenAI Chat Generative Pre-Trained Transformer (ChatGPT) 3.5, Microsoft Co-Pilot, and Google Gemini, were each queried with 10 frequently asked questions regarding hip arthroscopy. Two high-volume hip arthroscopists graded the responses on a 4-point Likert scale (1 = excellent, requiring no clarification; 2 = satisfactory, requiring minimal clarification; 3 = satisfactory, requiring moderate clarification; and 4 = unsatisfactory, requiring substantial clarification). Additionally, the 2 graders ranked the responses from the 3 different LLMs for each of the 10 questions on a 3-point Likert scale (1 = best, 2 = intermediate, 3 = worst). Readability was assessed using the Flesch-Kincaid Grade Level and Flesch Reading Ease metrics.

Results

Commonly used LLMs performed on a similar level of response accuracy and adequacy (mean ± SD: ChatGPT: 3.0 ± 1.0 vs Microsoft: 2.9 ± 1.1 vs Gemini: 2.6 ± 1.1, P =.481). Reviewers had no preference for one LLM’s responses over another (mean ± SD: ChatGPT: 2.0 ± 0.8 vs Microsoft: 2.1 ± 0.9 vs Gemini: 2.0 ± 0.8, P =.931). The overall incorrect response rate among LLMs was 20%. ChatGPT responses were at a significantly worse reading level compared to Gemini and Microsoft outputs (Flesch-Kincaid Grade Level mean ± SD: ChatGPT: 11.0 ± 2.2 grade reading level vs Microsoft: 8.6 ± 2.3 vs Gemini: 6.6 ± 2.2, P =.003; Flesch Reading Ease mean ± SD: ChatGPT: 36.6 ± 19.0 vs Microsoft: 57.7 ± 13.3 vs Gemini: 65.0 ± 4.7, P =.001).

Conclusions

Hip arthroscopists find LLM outputs on patient questions regarding hip arthroscopy satisfactory but requiring moderate clarification and show no preference for one LLM’s responses over another. LLMs produce a substantial number of incorrect responses. ChatGPT outputs had a significantly worse reading level compared to those of Microsoft and Gemini.

Clinical Relevance

This study provides insights into the accuracy and readability of LLM-generated responses to commonly asked questions about hip arthroscopy. As patients increasingly turn to artificial intelligence tools for health information, understanding the quality and potential risks of misinformation becomes essential.

Patient access to accurate and comprehensible health care information is crucial to a patient’s ability to participate in shared decision-making with their physicians. ^, Increased usage of the Internet has led to an exponential growth in use of online resources by patients for medical information. Currently, more than two-thirds of patients use the Internet as their main source of health information. ^,

Recently, large language models (LLMs), novel artificial intelligence tools that are trained by deep learning algorithms to process information inputted by users, have emerged as a popular web-based source of health care information. ^,^,^,^, The most popular LLM, OpenAI Chat Generative Pre-Trained Transformer (ChatGPT), has 1 billion monthly users and reached a record 1 million users within 5 days of its launch in November 2022. Consequently, numerous publications within the orthopaedic literature have emerged on its current and future utility to improve patient care. ^,^,^,^,

Since the launch of ChatGPT, a wide array of other web-based LLMs have become publicly available, including Google Gemini, Microsoft Co-Pilot, Meta Llama, and Cohere Command. These LLMs have also gained significant traction, with Google Gemini alone recording 313.3 million monthly visits. With rapid growth and availability of various LLMs for patients, evaluating their comparative performance on answering questions about medical information is imperative.

Relative to other areas of orthopaedic surgery, hip arthroscopy for the treatment of acetabular labral tears secondary to femoroacetabular impingement is a newer field. ^, Patients undergoing hip arthroscopy encounter a wide range of online information that varies in readability and accuracy about the procedure. ^,^, Recent studies in orthopaedic sports medicine suggest ChatGPT responses are satisfactory in accuracy and adequacy when answering common patient questions on hip arthroscopy, anterior cruciate ligament reconstruction surgery, and shoulder stabilization surgery. ^,^, However, there is a lack of information on evaluating the comparative accuracy and readability of different LLM responses to commonly asked patient questions on hip arthroscopy.

The purposes of this study were to compare the quality of LLM responses to frequently asked questions regarding hip arthroscopy, assess the incorrect response rate of LLMs, and compare the readability among different LLM outputs. We hypothesized that all 3 LLMs would provide accurate but insufficient answers to patient questions regarding hip arthroscopy and that all 3 LLM responses would be at an inappropriate reading level relative to current National Institutes of Health readability guidelines for medical information.

Methods

This study was granted exemption by our institutional review board. In May 2024, the 30 most frequently web-searched questions on hip arthroscopy were sourced using the People Also Ask feature from SearchResponse.io ( https://searchresponse.io/people-also-ask ). SearchResponse.io provides a comprehensive database that ranks questions on key terms (“hip arthroscopy”) based on popularity, and it has been previously used in the literature. Specifically, the People Also Ask feature uses metadata from more than 100 million questions globally to rank the popularity of a question asked on Google for a specific topic by using the number of search engine result pages on which the question appears. The senior author (S.D.M.) and second reviewer (R.M.W.) narrowed these questions down to the 10 most relevant to patients in their clinical practices.

Three commonly used LLMs were selected for analyses, including ChatGPT 3.5, Microsoft Co-Pilot, and Google Gemini. Selection of the 3 LLMs was based on popularity and accessibility. ^, The 3 LLMs were then each queried in May 2024 by a single author (J.S.M.), with 10 patient-relevant questions on hip arthroscopy, with no follow-up questions or repetition ( Appendix Table 1 , available at www.arthroscopyjournal.org ). Furthermore, to prevent bias of the LLM models from previous questions, the cache was cleared, and the browser was closed after each output.

Two fellowship-trained orthopaedic surgeons (S.D.M., R.W.) with high-volume hip arthroscopy practices graded the performance (accuracy and adequacy) of responses on a 4-point Likert scale that had been previously created (1 = excellent, requiring no clarification; 2 = satisfactory, requiring minimal clarification; 3 = satisfactory, requiring moderate clarification; and 4 = unsatisfactory, requiring substantial clarification) to assess LLM performance. Accuracy was defined by the factual correctness of the information provided, while adequacy reflected the relevance and completeness of the response in addressing the question. Surgeons (S.D.M, R.M.W.) were blinded to which LLM produced each response. The scale has been adopted for assessing LLM performance, including recent studies evaluating responses to questions on anterior cruciate ligament reconstruction and ulnar collateral ligament injury. ^, Additionally, the 2 hip arthroscopists assessed performance of the 3 different LLMs by ranking their preference of responses for each of the 10 questions on a 3-point Likert scale (1 = best, 2 = intermediate, 3 = worst) ( Appendix Table 2 , available at www.arthroscopyjournal.org ).

The readability of the different LLM outputs was assessed using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) Metrics. FKGL and FRE are validated forms of assessing readability developed by the US Navy. Scores are calculated based on word and sentence length. Lower FKGL scores and higher FRE scores are associated with easier-to-read text ( Appendix Table 3 , available at www.arthroscopyjournal.org ). These 2 readability tools were chosen because they are commonly used tools to assess readability of online health education material in the literature. ^,

Statistical Analyses

One-way analysis of variance testing was used to compare multiple means for performance (Mika et al. 4-point Likert scale and Hip Arthroscopist Preference 3-point Likert scale) and readability (FKGL and FRE scores). Post hoc comparison analyses were performed using Tukey’s honest significant difference test to assess intergroup differences in performance and readability. A P value of less than.05 was deemed statistically significant. All statistical analyses were performed using SPSS v25 (IBM Corporation).

Results

ChatGPT 3.5, Microsoft Co-Pilot, and Google Gemini were similar in concordance for response accuracy and adequacy (mean ± SD: ChatGPT: 3.0 ± 1.0 vs Microsoft: 2.9 ± 1.1 vs Gemini: 2.6 ± 1.1, P =.481) ( Table 1 ). Mean scores for all 3 LLMs were closest to 3, indicating that overall responses were satisfactory but required moderate clarification. Furthermore, reviewers had no preference for 1 LLM’s responses over another (mean ± SD: ChatGPT: 2.0 ± 0.8 vs Microsoft: 2.1 ± 0.9 vs Gemini: 2.0 ± 0.8, P =.931). Fifty percent of Gemini responses received the highest preference ratings (5/10), while 40% of ChatGPT responses (4/10) and 30% of Co-Pilot (3/10) responses received the highest preference ratings ( Table 2 ). The inter-rater agreement was 33%.

Table 1

Comparison of ChatGPT 3.5, Microsoft Co-Pilot, and Google Gemini Artificial Intelligence Model Accuracy and Adequacy, Hip Arthroscopist Preference, and Readability Level

Variable	OpenAI ChatGPT 3.5, Mean ± SD	Microsoft Co-Pilot, Mean ± SD	Google Gemini, Mean ± SD	P Value
Satisfaction and adequacy	3.0 ± 1.0	2.9 ± 1.1	2.6 ± 1.1	.481
Preference ranking	2.0 ± 0.8	2.1 ± 0.9	2.0 ± 0.8	.931
FKGL score	11.0 ± 2.2	8.6 ± 2.3	6.6 ± 2.2	.003
FRE score	36.6 ± 19.0	57.7 ± 13.3	65.0 ± 14.7	.001

FKGL, Flesch-Kincaid Grade Level; FRE, Flesch Reading Ease.

Table 2

ChatGPT 3.5, Microsoft Co-Pilot, and Google Gemini Artificial Intelligence Model Response Ranking by (1 = Best to 3 = Worst) Hip Arthroscopist Preference

Questions (After hip arthroscopy…)	OpenAI ChatGPT 3.5	Microsoft Co-Pilot	Google Gemini
When can I begin driving?	2	2	1
How long will I need to be on crutches?	3	2	1
How long is recovery?	1	2	2
When can I start exercising?	2	2	1
What restrictions do I have after hip arthroscopy?	1	2	2
When can I start having sexual intercourse?	2	1	3
How do I sleep?	1	1	3
How long is my stay?	3	1	2
How do I shower?	3	2	1
How do you sit?	1	3	1

Twenty percent of responses among all LLMs were determined to be unsatisfactory and requiring substantial clarification (6/30) ( Appendix Tables 4 and 5 , available at www.arthroscopyjournal.org ; Table 3 ). Thirty percent of ChatGPT (3/10), 20% of Co-Pilot (2/10), and 10% of Gemini (1/10) responses were deemed unsatisfactory ( Appendix Tables 4 and 5 , available at www.arthroscopyjournal.org ; Table 3 ).

Table 3

ChatGPT 3.5, Microsoft Co-Pilot, and Google Gemini Artificial Intelligence Model Response Performance (1 = Excellent Requiring Minimal Clarification to 4 = Unsatisfactory Requiring Substantial Clarification)

Questions (After hip arthroscopy…)	OpenAI ChatGPT 3.5	Microsoft Co-Pilot	Google Gemini
When can I begin driving?	3.5	2.5	2.5
How long will I need to be on crutches?	4	4	2.5
How long is recovery?	4	4	4
When can I start exercising?	3	3	1.5
What restrictions do I have?	2	3.5	3
When can I start having sexual intercourse?	2.5	1.5	3
How do I sleep?	2	3	3
How long is my stay?	3.5	1.5	1.5
How do I shower?	4	3	3.5
How do you sit?	1.5	3	1.5