Widely Available Large Language Models Are Not a Reliable Source to Address Medical Treatment Recommendations of Patients After a First-Time Anteroinferior Shoulder Dislocation

Purpose

To assess the ability of ChatGPT 3.5 to aid in the treatment planning process of first-time anteroinferior shoulder dislocation.

Methods

Forty fictional patient cases were created varying in 15 different characteristics, whose distribution was randomized. Six orthopaedic surgeons (3 residents and 3 specialists in shoulder surgery) were then asked to determine the best treatment option for these patient cases. Their answers were compared with the treatment recommendations proposed by ChatGPT in 2 different sessions on the basis of preselected literature. To counteract the wide dispersion of responses, tendencies towards nonoperative, open surgical, or arthroscopic treatment were subsequently defined. The results were then analyzed descriptively.

Results

The mean age of the fictional patients was 44 years (13-80 years), with 57.5% of the patients female. The agreement between the ChatGPT responses in the 2 sessions was 70.0%. In contrast, the 3 assistant physicians agreed with each other in 35% of all cases and the 3 specialists agreed in 32.5% of all cases. There was an exact match of 12.5% between the ChatGPT responses and all human assessments. In 65.0% of all cases, the physicians showed similar tendencies in their choice of therapy resulting in a 55.0% match between ChatGPT and the surgeons.

Conclusions

There was no clear consensus regarding the treatment for first-time anteroinferior dislocations of the shoulder, neither among physicians nor with ChatGPT 3.5. However, ChatGPT 3.5 and physicians showed similar tendencies regarding the treatment in over half of the cases. Because of the inconsistent responses of ChatGPT 3.5, it cannot yet be considered as reliable tool for therapy planning.

Clinical Relevance

ChatGPT 3.5, widely available and free of charge, is increasingly used in clinical settings. However, it’s crucial to highlight its limitations in treatment planning for pathologies, especially when there’s no clear consensus even among experienced surgeons.

There is no clear consensus on the optimal treatment strategy for a first-time anterior shoulder dislocation. A variety of treatments are discussed, ranging from nonoperative to surgical treatment. Specifically, arthroscopic labral repair and open or arthroscopic bony augmentations are addressed in the literature. Various patient-specific factors are weighted before making a decision. Mainly, next to imaging, gender and activity level lead to the choice of nonoperative or surgical therapy. Hence, men are approximately 3 times more likely to experience an acute shoulder dislocation than woman. ^, Structural factors such as the degree of the glenoid defect, the extent of the Hill-Sachs and Bankart lesion, and associated additional soft-tissue injuries are described to be factors influencing decisions. In addition, the patient’s treatment goal must be considered, which can range from the ability to use the injured shoulder in daily activities to the expectation of a return-to-sport at a professional level.

Given the large number of different factors that must be taken into consideration, the treatment decision should usually be made by an experienced orthopaedic surgeon. In the context of nowadays digital innovations, the question arises whether this decision-making process can be facilitated by using artificial intelligence (AI). Innovative solutions with the use of AI are currently emerging in almost every industry branch with essential impact in several fields such as informatics, technology, finance, and many more.

Different large language models exist, with ChatGPT (OpenAI, San Francisco, CA), Google Gemini (formerly known as Google Bard) (Google LLC, Mountain View, CA), and the Microsoft Bing AI (Microsoft Corporation, Redmond, WA) being among the most known models. Like the other programs, ChatGPT is an advanced large language model (LLM) and has been accessible online since November 2022 with a well-known easy-to-use chat-function. “GPT” stands for “Generative Pre-trained Transformer“ technology, meaning that this AI has been pretrained with extensive datasets to carry out its function.

Currently, the latest freely available version of ChatGPT 3.5 has not yet been used as a tool for treatment planning in orthopaedic surgery. An authors group of Garg et al. have illustrated the use of ChatGPT in the medical field in research. According to this study group, ChatGPT is not designed for treatment planning in general.

The purpose of this study was to assess the ability of ChatGPT 3.5 to aid in the treatment planning process of first-time anteroinferior shoulder dislocation. It was hypothesized that ChatGPT 3.5 as a LLM would be a reliable tool for treatment planning of first-time anteroinferior traumatic shoulder instability.

Methods

General Study Design

Forty fictional patient cases were generated. Those cases were presented to ChatGPT and 6 orthopaedic surgeons. The surgeons consisted of 2 subgroups with either 3 residents or 3 specialists in shoulder surgery. The surgeons and the LLM were presented each case and were asked to select 1 of 5 given therapy options: nonoperative therapy; open surgical bony augmentation (Laterjet technique or J-Span technique); or arthroscopic approach (anterior stabilization with or without remplissage). The general study design is shown in Figure 1 .

Image, AltText currently not available — Fig 1

Literature Search

To prevent the LLM from using fake or inaccurate sources, papers were chosen from scientific publications by a search of current literature available. The LLM was asked to use those papers only for answering the given questions (see to follow). Literature research was performed on PubMed using the strings “anterior shoulder dislocation” AND treatment” AND “algorithm” as well as “primary anterior shoulder dislocation” AND “treatment” to mimic the methods through which a physician seeking information for clinical decision making would gather information. The publications found were discussed and included by all 6 physicians. The inclusion criteria for these studies were the following: studies categorized with a Level of evidence of I or II; open-access texts; and studies were available in English language.

Use of the LLM

The LLM product and version used in this study was ChatGPT 3.5, as it is widely available and free of charge. Because the LLM version used could only process up to 10,000 characters per request, only the introduction, results, and the discussion of the chosen papers were provided. To achieve a better comparability between the answers of the surgeons and ChatGPT, the LLM was used twice by 2 independent raters who asked the exact same questions. After obtaining all answers, tendencies toward nonoperative, surgical, or arthroscopic treatment were measured. A tendency was defined as given if there was a simple majority of responses in the subgroups. The answers and tendencies given by the physicians and the AI were compared in the according subgroups and among the other subgroups.

Design of Patient Cases

Each of the 40 fictive cases contained 15 different characteristics such as age, gender, dominant arm, affected arm, level of physical activity, comorbidities, alcohol and drug use, course of accident, hypermobility, frozen shoulder, shoulder instability, bony defects, Hill-Sachs defects, and rotator cuff tears. The qualities of the characteristics were randomly defined by using the RANDBETWEEN-function in Microsoft Excel for Mac (Version 16.91, 2024; Microsoft, Redmond, WA) that generates a random integer number in a given range. After a first evaluation of all fictive patient cases, the characteristics were adjusted by the shoulder specialists to depict more realistic cases. Each fictional patient was assigned to an age-group: “12-18 years,” “19-40 years,” and “41-80 years.” The final qualities and probabilities are shown in Table 1 . To make the patient cases accessible for the LLM and the physicians, the given characteristics per fictional case were summarized in case reports in a standardized way.

Table 1

Incidence-Rates and Expected Incidence Rates of All Patient Characteristics

Quality	Incidence Rate, %	Probability/ Expected Rate, %	Comment
Gender
Female	57.5	50.0
Male	42.5	50.0
Age
12-80 yr
Dominant arm
Left arm	42.5	50.0
Right arm	57.5	50.0
Course of accident
Traumatic	42.5	50.0
Atraumatic	57.5	50.0
Affected arm
Left arm	50.0	50.0
Right arm	50.0	50.0
Level of physical activity
No physical activity	22.5	25.0
Hobby sport	27.5	25.0
Competitive sport	20.0	25.0
Overhead sport	30.0	25.0
Comorbidity
No comorbidity	57.5	55.6
Diabetes mellitus type I or II	10.0	11.1
Diabetes mellitus type I or II, high blood pressure	12.5	11.1
Diabetes mellitus type I or II, high blood pressure, bleeding tendency	7.5	11.1
Diabetes mellitus type I or II, high blood pressure, bleeding tendency, stroke, or myocardial infarction	12.5	11.1
Alcohol consumption
None	20.0	25.0	Only patients with an age of 16 years or older could have alcohol consumption.
Low alcohol consumption	17.5	25.0
Moderate alcohol consumption	30.0	25.0
High alcohol consumption	27.5	25.0
Drug use
No	52.5	50.0	Only patients with an age of 16 years or older could have drug use.
Yes	47.5	50.0
Hypermobility
No (Beighton-score <4)	50.0	50.0
Yes (Beighton-score ≥4)	50.0	50.0
Frozen shoulder
No	60.0	50.0
Yes	40.0	50.0
Feeling of instability
No	50.0	50.0
Yes	50.0	50.0
Imaging: glenoid defect
No glenoid defect	55.0	50.0
Small glenoid defect	45.0	50.0
Imaging: Hill-Sachs defect
No Hill-Sachs defect	55.0	33.3	Patients without a glenoid defect have no Hill-Sachs defect with a probability of 100%.
Small Hill-Sachs defect	30.0	33.3	Patients with a glenoid defect could have a small Hill-Sachs defect with a probability of 50.0%
Big Hill-Sachs defect	15.0	33.3
Imaging: rotator-cuff injury
No	50.0	50.0
Yes	50.0	50.0