The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard

Introduction

This study aimed to evaluate and compare the performance of 2 artificial intelligence (AI) models, Chat Generative Pretrained Transformer-3.5 (ChatGPT-3.5; OpenAI, San Francisco, Calif) and Google Bidirectional Encoder Representations from Transformers (Google Bard; Bard Experiment, Google, Mountain View, Calif), in terms of response accuracy, completeness, generation time, and response length when answering general orthodontic questions.

Methods

A team of orthodontic specialists developed a set of 100 questions in 10 orthodontic domains. One author submitted the questions to both ChatGPT and Google Bard. The AI-generated responses from both models were randomly assigned into 2 forms and sent to 5 blinded and independent assessors. The quality of AI-generated responses was evaluated using a newly developed tool for accuracy of information and completeness. In addition, response generation time and length were recorded.

Results

The accuracy and completeness of responses were high in both AI models. The median accuracy score was 9 (interquartile range [IQR]: 8-9) for ChatGPT and 8 (IQR: 8-9) for Google Bard (Median difference: 1; P <0.001). The median completeness score was similar in both models, with 8 (IQR: 8-9) for ChatGPT and 8 (IQR: 7-9) for Google Bard. The odds of accuracy and completeness were higher by 31% and 23% in ChatGPT than in Google Bard. Google Bard’s response generation time was significantly shorter than that of ChatGPT by 10.4 second/question. However, both models were similar in terms of response length generation.

Conclusions

Both ChatGPT and Google Bard generated responses were rated with a high level of accuracy and completeness to the posed general orthodontic questions. However, acquiring answers was generally faster using the Google Bard model.

Highlights

•

ChatGPT and Google Bard models generated responses with a high level of accuracy and completeness.
•

ChatGPT and Google Bard models demonstrated relative consistency in their performance.
•

Response generation was faster with Google Bard.
•

ChatGPT and Google Bard models were similar in terms of response length generation.
•

A newly developed accuracy of information index was used in this study.

Artificial intelligence (AI) technologies and their application have increased exponentially in the last few years. Major parts of these AI technologies comprise machine learning and large language models, which were employed in various fields, including dentistry. These AI-based technologies play a significant role in improving dental practice by enhancing accuracy in diagnostics, treatment planning, and patient management, leading to increased efficiency of patient care across the different fields of dentistry. ^, Large language model programs, such as AI-powered Chatbot models, are based on intelligent human-computer interaction. These Chatbot models were designed to simulate conversations with human users over the Internet.

In late 2022, OpenAI (San Francisco, Calif) introduced Chat Generative Pretrained Transformer-3.5 (ChatGPT-3.5) as one of the public AI-powered Chatbot models (OpenAI, ChatGPT). ChatGPT model gained significant popularity shortly after its launch, attracting 1 million users within a few days. Subsequently, OpenAI introduced ChatGPT-4 on March 14, 2023, as an updated and subscription-based model, with the claims of better performance compared with the ChatGPT-3.5 model. As AI language models, such as ChatGPT, gained huge popularity among AI users, another model called Google Bidirectional Encoder Representations from Transformers (Google Bard) was introduced by Google (Bard Experiment, Mountain View, Calif) on March 21, 2023. Both GPT-3.5 and Google Bard models are available for public use, with a key distinction between their approaches: ChatGPT relies on preexisting training data, whereas Google Bard uses real-time access to the Internet to incorporate up-to-date information when generating responses.

Cyberspace platforms, such as social media, have the potential to assist patients seeking information related to their health issues by providing access to a variety of medical information, including information about orthodontics. A nationwide survey reported that 57% of patients prefer to initially turn to the Internet for health-related information. Furthermore, the number of orthodontic patients who are visiting different social media platforms for orthodontic information increased from 6.7% in 2013 to 30% in 2021. ^, There are various reasons for this, such as the need to gather initial information, convenience, real-time accessibility, and privacy. In orthodontics, the use of these platforms to provide educational information can impact the relationship between orthodontists and both active and potential patients. ^,

Many recent studies have extensively examined the capabilities of AI models, such as ChatGPT, to answer general questions and pass certification examinations across various fields, including medicine. However, a notable gap exists in the literature concerning the evaluation of Google Bard’s performance in these contexts. ^, In dentistry, a recent study evaluated the quality of AI-generated responses to questions in the field of oral and maxillofacial surgery (OMFS) and found that the ChatGPT model can be used by patients as a reliable source of information in the field of OMFS, whereas this model should be taken with caution when using it for dental clinical training. OMFS and other medical fields are broad specialties that are interconnected with various other fields, making the dataset potentially sufficient to address a wide range of related questions in these fields. However, the same may not hold for other fields like orthodontics. The potential of ChatGPT to provide accurate information was initially investigated by Professor Kevin O’Brien using 6 straightforward orthodontic questions. However, there remains a gap in conducting a thorough assessment of different AI models’ performance in orthodontics.

The dimensions of information quality include aspects related to the message (ie, accuracy and completeness) and aspects related to the receiver (ie, number of words). Therefore, this study aimed to assess the performance of 2 AI models (ChatGPT and Google Bard) through an evaluation of their quality regarding accuracy and completeness in responding to orthodontic questions commonly asked by patients. This evaluation will also consider factors such as response generation time and response length.

Material and methods

This study was performed on 3 stages presented in a diagram of the workflow steps from inception to completion ( Fig 1 ). In the first stage, a panel comprised 5 orthodontists collected different open-ended questions that are frequently asked by orthodontic patients. A consensus was reached on 100 different questions, which encompass a wide range of common topics within orthodontics ( Supplementary Table I ). These questions were categorized using 10 specific domains (10 questions for each domain), including age-specific considerations for orthodontic treatment, clear aligners, combined orthodontic-orthognathic surgery treatment, digital orthodontics, expansion, extraction-based treatment, impacted teeth, mini-implants, orthodontic treatment information, and retention.

In the second stage, a single investigator submitted each question (without modifications) to both AI models: ChatGPT (GPT-3.5; Available at: https://chat.openai.com/ ) and Google Bard (Available at: https://bard.google.com/?hl=en ), on July 16, 2023. All questions were submitted on the same day using the same laptop (MacBook Air M1 ship, 8GB Ram; Apple, Cupertino, Calif), 5G Internet connection, and virtual private network (VPN) server (version 3.9; Astrill Systems Corp, Santa Clara, Calif). The time taken to generate each response after the successful submission of the inquiry was recorded using a stopwatch. Afterward, the generated responses from both AI models were collected and randomly allocated using electronic randomization ( https://www.random.org/lists/ ). Simple randomization on the question level was undertaken for each domain. Consequently, responses from ChatGPT and Google Bard were typed into 2 separate forms (A and B) with the removal of all words related to each AI model to ensure the blindness of the evaluators. Then, those 200 randomly distributed responses were sent to each 1 of the 5 assessors for further evaluation. At this stage, the total number of words for each response was counted in addition to the AI-generated response structure, such as the inclusion of figures and tables for explanatory reasons.

In the third stage, 5 independent and blinded evaluators, each with expertise in research and clinical orthodontics, evaluated the accuracy and completeness of the AI-generated responses for both models. The evaluators were chosen based on their diversity in orthodontic training, including the American, British, and Chinese orthodontic schools, which helped to reduce selection and confirmation biases.

It has been reported that ChatGPT may potentially provide different and faster responses when asked the same question again or at different time points, whereas Google Bard generates 3 versions, or drafts, of each response. Consequently, all questions were posed only once, and the initial response was selected for further evaluation. To ensure consistency and standardization, each question was posed consecutively to both ChatGPT and Google Bard, and to minimize the influence of previous responses, the chat window was refreshed for every question. All collected data were entered in Microsoft Excel (Microsoft Corp, Redmond, Wash) for further analysis.

Regarding the quality of AI-generated responses, the evaluators used a newly developed accuracy of information (AOI) index to assess response accuracy and the well-known 10-point visual analog scale to assess response completeness.

The validation process followed the organized steps for tool validation. ^, A comprehensive search was carried out to retrieve studies assessing the quality of information on orthodontic Web sites. The tools used were subsequently extracted, and their validity for evaluating the quality of information was investigated. We found that the existing tools designed for assessing Web sites pertinent to orthodontic education were suboptimal in evaluating the quality of information. In this regard, previous studies ^, ^, evaluated the quality of information through a subjective assessment of the accuracy and completeness without an existing guide. This approach led to an arbitrary assessment of the information’s accuracy.

The researchers decided to develop the AOI index on the basis of their expertise in medical education while following a web intelligence publication. As such, the AOI index included 5 raw items: factual accuracy, corroboration, consistency, clarity and specificity, and relevance of the response ( Table I ). These items were scaled from 0 to 2, representing poor to excellent scores. The total score of AOI resulted from the summation of the subtotal scores for the 5 items. The completeness of information focused on whether the responses were comprehensive and sufficient to address all relevant aspects that are expected accordingly with each question on the basis of a 10-point visual scale. Thus, the higher the score that an examiner suggested, the more accurate and complete those responses were. If any AI model failed to respond to any question, the response was assigned a score of 0 in both accuracy and completeness. Moreover, data associated with these responses were not excluded from the analysis.

Table I

AOI index

Item	Definition	Maximum score
Factual accuracy	The response aligns with known facts, data, or established knowledge on the subject	2
Corroboration	The response is based on evidence from textbooks, studies, or guidelines	2
Consistency	The response is internally consistent and does not contain contradictory statements	2
Clarity and specificity	The response is clear and specific, avoiding vague or ambiguous language	2
Relevance	The response directly addresses and adheres to the question or topic posed	2
Total AOI score	The sum of all scores	10

The index was piloted using a sample of 30% of random questions and compared with the accuracy evaluation by an experienced specialist orthodontist (gold standard) to confirm or modify the weight of each item. A consensus was reached on the actual score of the tool against the gold standard. Subsequently, the evaluators underwent additional training, accompanied by a detailed discussion to address clarifications regarding the AOI index.

Five experts, 3 were university professors and 2 were PhD candidates, participated in the content validity process. The acceptable content validity ratio was 0.99 for each item in this index according to Lawshe’s law. As such, only items with a consensus were kept for this index.

The reliability of the tool was evaluated by comparing the scores of the 5 evaluators using Cronbach’s α interrater agreement. The intrarater agreement was assessed by reevaluating 30 random questions with an intervening period of 4 weeks for both accuracy and completeness by the same evaluators and measured by the intraclass correlation coefficient. The values of Cronbach α ranged from 0.83 to 0.88, and intrarater agreement ranged from 0.77 to 0.94.

Statistical analysis

Descriptive statistics for accuracy and completeness in each response were calculated using median and interquartile range (IQR), whereas the response generation time and word counts in each response were presented by mean and standard deviation. As 5 evaluators assessed each response (each domain had 10 questions), a univariable generalized estimating equation (GEE) model was used to examine the potential association between the accuracy and completeness of the response with the AI model (ChatGPT and Google Bard), and the main domains. A multivariable GEE model was planned to include the significant predictors in the univariable GEE, but this was not feasible. For continuous data without clustering, t tests were used to examine the difference between both AI models when the data were normally distributed. Otherwise, the Mann-Whitney U test was applied. The normality assumption was checked using a Q-Q plot and the Shapiro-Wilk test. All analyses were performed using the Stata software (version 15.1; Stata Corp, Tex) and the R software (version 4.3.0; R Foundation for Statistical Computing, Vienna, Austria), with a 2-sided 5% level of statistical significance. To assess the difference in generating the responses in terms of time and number of words, 20 random questions were submitted to both models after 2 months using the same settings, and the Bland-Altman plot was used to measure the agreement.

Results

The AI-generated responses from both ChatGPT and Google Bard were evaluated for accuracy, completeness, response generation time, and response length ( Tables II and III ; Supplementary Table II ).

Table II

Accuracy, completeness, generation time, and number of words obtained through responses generated with the ChatGPT and Google Bard models

Variable	ChatGPT		Google Bard		Difference	P value
Variable	Median or mean ± SD	IQR or 95% CI	Median or mean ± SD	IQR or 95% CI	MD (95% CI)	P value
Accuracy	9.0	8.0-9.0	8.0	8.0-9.0	1.00 (0.68-1.30)	<0.001 ^†
Completeness	8.0	8.0-9.0	8.0	7.0-9.0	0.00 (−0.16 to 0.16)	0.18 ^†
Generation time	16.2 ± 5.9	15.0-17.4	5.8 ± 1.3	5.6-6.1	10.40 (9.20-11.60)	<0.001 ^‡
No. of words	304.7 ± 85.6	287.8-321.7	311.4 ± 83.5	294.8-327.9	−6.60 (−30.20 to 17.00)	0.580 ^‡