Examination of the reliability and readability of Chatbot Generative Pretrained Transformer’s (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version





Introduction


This study aimed to assess the reliability and readability of Chatbot Generative Pretrained Transformer (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version.


Methods


Frequently asked questions about orthodontics by laypeople on Web sites were determined using the Google Search Tool. These questions were asked to both ChatGPT’s March 23 version and May 24 version on April 20, 2023, and July 12, 2023, respectively. Responses were assessed for readability and reliability using the Flesch-Kincaid and DISCERN tests.


Results


The mean DISCERN value for general questions was 2.96 ± 0.05, 3.04 ± 0.06, 2.38 ± 0.27, and 2.82 ± 0.31 for treatment-related questions; the mean Flesch-Kincaid Reading Ease score for general questions was 29.28 ± 8.22, 25.12 ± 7.39, 47.67 ± 10.77, and 41.60 ± 9.54 for treatment-related questions; mean Flesch-Kincaid Grade Level for general questions was 14.52 ± 1.48 and 14.04 ± 1.25 and 11.90 ± 2.08 and 11.41 ± 1.88 for treatment-related questions; in first and second evaluations respectively ( P = 0.001).


Conclusions


In the second evaluation, the reliability of the answers given to general questions and treatment-related questions increased. However, in both evaluations, the reliability of the answers was found to be moderate according to the DISCERN tool. On the second evaluation, Flesch Reading Ease Scores for both general questions and treatment-related questions decreased, meaning that the readability of the new response texts became more difficult. Flesch-Kincaid Grade Level results were found at the college graduate level in the first and second evaluations for general questions and at the high school level in the first and second evaluations for treatment-related questions.


Highlights





  • ChatGPT responses showed remarkable improvement even in a short-term evaluation.



  • Readability of the content accessed via ChatGPT is as important as its reliability.



  • ChatGPT provided reference-free information about orthodontics.



In recent years, artificial intelligence (AI) large language models (LLMs) in the form of machine learning and natural language processing have permeated many areas of science. Thus, the relationship between dentistry and medical sciences and AI systems is explored in the literature. The capabilities of LLMs are impressive. LLMs can create fluent and coherent texts, answer questions in dialogue, translate languages, and do many other language-related tasks.


A new AI model called Chatbot Generative Pretrained Transformer (ChatGPT) has been created by OpenAI and is available in 2 versions: the nonprofit ChatGPT and the ChatGPT Plus, which can be purchased for a set amount per month. It is a San Francisco-based AI research company and has received a lot of attention in the media and scientific communities for ChatGPT’s ability to process and respond to commands humanely. ChatGPT is fine-tuned from LLM training on the basis of big text data from the internet through reinforcement and supervised learning methods. According to previous research, ChatGPT appeared to be omniscient and could respond quickly and fluently evenly to odd requests.


Informing patients about treatments can help them streamline the process. However, it is not always possible for patients to reach doctors or their staff for information about their concerns. AI-based chatbots have great potential to inform patients about topics that interest them. Not just for patients and laypeople, the LLMs available are already useful tools for dentists and students alike. However, in addition to not fully addressing the positive and negative aspects of medicine, the fact that LLMs can give wrong answers, produce nonsense content, and present false information and disinformation as if they are real raises serious concerns in critical areas such as health.


The reliability and quality of Web-based information are very important as they can affect patients’ collaboration and harmonious progress in treatment, as well as their communication with their doctor and doctor-patient trust. Several previous studies have evaluated online content on different health-related topics, such as dental caries information on Web sites, pain information during orthodontic treatment, maxillofacial trauma information on the internet, and the development and evaluation of a reliable patient-based Web site.


Concerns about errors and inaccurate health information have led to the development of testing tools to assess the readability and reliability of health-related written content on the internet. In addition to reliability, the readability of health-related written content on the internet should also be evaluated, as the level of readability may limit the usefulness of Web site content. , , Ease and grade of readability have been explored in both previous medical and dental literature. ,


Health literacy refers to the ability of men and women to read and understand health-related information so that they can make decisions on health-related issues and manage treatment processes. The readability of relevant texts forms the basis of health literacy. Health literacy rates are low throughout the world. The average adult in the United States is reported to be as literate as an eighth grader. It has been reported that up to 20% of the adult population has difficulty reading and understanding information written in English. The United States Department of Health and Human Services has recommended that patient-targeted texts should not exceed the reading level of an individual in sixth grade to increase health literacy. The Flesch Reading Ease tool and the Flesch-Kincaid Grade Level tool , are valid and repeatable methods for assessing readability and grade comprehension difficulties.


Given that nonprofessionals can access AI programs such as ChatGPT as easily as professionals from anywhere, the reliability of ChatGPT’s answers to health-related questions gains importance. There are various tools to evaluate and rate the quality and reliability of health information available on the internet. The LIDA instrument (Minervation Ltd, Oxford, UK), Journal of the American Medical Association, benchmarks, and DISCERN instrument (Institute of Health Sciences, University of Oxford, UK, www.discern.org.uk ) tool are some examples of these methods. These tools are freely available and validated in evaluating reliability. The DISCERN tool is a reliable and valid tool developed for both health care providers and patients for use in assessing the reliability of health-related texts. , ,


To our knowledge, this is the first study to examine ChatGPT’s responses to questions asked by laypeople on orthodontics Web sites. Therefore, this study aimed to evaluate the reliability and readability of ChatGPT’s responses to orthodontics-related questions and the evolution of these responses in an updated version. The null hypotheses of the study were that (1) ChatGPT’s responses to orthodontic frequently asked questions (FAQs) were both easily readable for laypeople and reliable in both queries, and (2) there was no difference in the reliability and readability levels of the old version (first evaluation) and new version (second evaluation) of ChatGPT responses.


Material and methods


Ethical approval was not required as human or animal materials were not used in the study.


With the search term “frequently asked questions about orthodontics,” Web sites that answered these questions were searched using the Google Search Tool. The first 60 Web sites were reviewed for the study. “Irrelevant,” “duplicate,” and “non-English” Web sites were excluded. After applying the exclusion criteria to the first 60 Web sites, 41 Web sites that were found eligible were evaluated. A total of 307 questions from the remaining 41 Web sites were collected in a pool of questions. The exclusion criteria for questions were repetitive, similar, and irrelevant. After applying the exclusion criteria to 307 questions, 34 questions remained. All exclusion procedures were performed by one of the study’s authors, an experienced orthodontist of 21 years.


The questions were divided into 2 separate sections: general questions and treatment-related questions ( Table I ). Subsequently, ChatGPT was asked each of the remaining 34 FAQs on both April 20, 2023 (March 23, 2023 version) and July 12, 2023 (May 24, 2023 version). The answers to each question were recorded. The study’s 2 orthodontist investigators (D.D.K. and D.M.) then scored for reliability using the DISCERN tool. In addition, the response texts were measured individually by the researcher (D.D.K.) by using Microsoft Word for Mac Flesch-Kalcaid calculator (version 16.66.1 [22101101]; Microsoft, Redmond, Wash). The flowchart of the study is given in Figure 1 .



Table I

Questions used in the study














































































Questions
General
What is orthodontics?
What is an orthodontist?
What is the difference between an orthodontist and a dentist?
Why should you choose an orthodontic specialist?
What do braces cost?
Treatment-related
What does “Phase and Phase I” treatment mean?
What are braces?
Are there different types of braces?
How long does it take to get braces off?
How long do people have to wear braces?
Are there any risks/complications associated with braces?
Are there any risks to orthodontic treatment?
Do you have to lose all of your baby teeth before you can get braces?
What is the best age to get braces?
Can adults get braces?
What is the best age to start orthodontic treatment?
Are there any alternatives to braces?
What age groups can benefit from clear aligners?
Invisalign vs metal braces, which is faster?
What are retainers?
How long do you wear a retainer after braces?
Do braces change your face?
Will braces hurt?
How do braces move your teeth?
Can my teeth move after braces?
Do I have to avoid certain foods with braces?
What can you eat with braces?
What is the best way to brush your teeth with braces?
What is the best toothbrush for braces?
When wearing braces, do I have to brush my teeth?
Should I brush my teeth more often with braces?
Do I still need to have a dental checkup every 6 months if I have braces?
Will dental braces interfere with playing sports?
Will braces change the way I talk or activities like singing and playing instruments?



Fig 1


Flow chart of the study.


DISCERN consists of 16 questions addressing the following areas: (1) reliability (questions 1-8), (2) information about treatment (questions 9-15), and (3) final question (question 16) based on the optional final overall quality rating.


Because only the content will be evaluated for the general questions section, DISCERN’s questions 1-8 were used in this section. Treatment-related questions were subjected to 1-15 questions, in which DISCERN also evaluated treatment-related content. The last single question was not included in this study.


A blank page is opened in Microsoft Word. The text to be analyzed was pasted in here. “Spelling and Grammar” and “Readability Statistics” were selected in the Tools tab. Flesch Reading Ease and Flesch-Kincaid Grade Level results were given in this section.


The Flesch Reading Ease score was first used in 1948 to show the readability of text. The score lets you know the approximate level of education a person would need to be able to read a particular text easily. How clear a document is indicated by a number between 0 and 100 in the Flesch Reading Ease score. Scores around 100 mean the document is extremely easy to read, whereas scores around 0 mean it is rather complex and difficult to understand. Conversion tables are used to convert scores to educational levels. The formula for the Flesch Reading Ease score is as follows:


206.835 − 1.015 × (total words/total sentences) − 84.6 × (total syllables/total words)


The Flesch-Kincaid Grade Level indicates what education level a person will need to understand a particular text. The Flesch-Kincaid Grade Level is evaluated by examining how many words, sentences, and syllables a document contains. The formula for the Flesch-Kincaid Grade Level is as follows:


0.39 × (total words/total sentences) + 11.8 × (total syllables/total words) − 15.59


The interpretation of Flesch-Kincaid scores was done by using the information in Table II .



Table II

The interpretation of Flesch-Kincaid scores




































Reading ease score Reading level Estimated reading grade level
0-29 Very difficult Graduated college
30-49 Difficult Attended college
50-59 Fairly difficult High school (10th to 12th)
60-69 Standard and/or plain Eighth to ninth
70-79 Fairly easy Seventh grade
80-89 Easy Fifth to sixth grade
90-100 Very easy Fourth to fifth grade


Statistical analysis


Data were analyzed with SPSS (version 23; IBM, Armonk, NY). Conformity to the normal distribution was evaluated using the Shapiro-Wilk test. Independent 2-sample t test was used to compare normally distributed data according to paired groups, and Mann-Whitney U test was used to compare nonnormally distributed data. A paired 2-sample t test was used to compare normally distributed data according to first and second evaluations within groups, and the Wilcoxon test was used to compare nonnormally distributed data. In the analysis of the relationship between normally distributed data, the Pearson correlation coefficient and the relationships between the nonnormally distributed data were examined with Spearman’s rho correlation coefficient. Interobserver agreement was evaluated with the intraclass correlation coefficient. Analysis results were presented as mean ± standard deviation and median (minimum to maximum) for quantitative data. The significance level was taken as P <0.050.


Results


A very good statistical agreement was obtained between the first evaluation DISCERN data between researchers (D.D.K. and D.M.) (intraclass correlation = 0.926; P <0.001). Statistically, good agreement was obtained between the second evaluation DISCERN data between researchers (intraclass correlation = 0.748; P <0.001). Because there was a statistically acceptable good agreement between the DISCERN evaluations of the 2 researchers, the average of the DISCERN data of the 2 researchers was used as data in the first and second evaluations.


It was determined that there was a statistical difference between the median DISCERN first values according to the groups ( P = 0.001). The median DISCERN second values did not differ between the general questions and the treatment-related questions ( P = 0.098). It was determined that there was a statistical difference between the mean Flesch-Kincaid Reading Ease score in both the first and second evaluation values according to the groups ( P = 0.001). The mean Flesch-Kincaid Grade Level 1 evaluation values differed between general questions and treatment-related questions ( P = 0.011). Both the mean Flesch-Kincaid Grade Level 2 evaluation values differed between the question groups ( P = 0.005) ( Table III ).



Table III

Comparisons by groups


























































Evaluation General questions Treatment-related questions Test statistics P value
DISCERN evaluation 1 2.96 ± 0.05 3.00 (2.88-3.00) 2.38 ± 0.27 2.40 (1.70-2.83) 145.000 0.001
DISCERN evaluation 2 3.04 ± 0.06 3.00 (3.00-3.13) 2.82 ± 0.31 2.77 (2.20-3.43) 107.000 0.098
Flesch-Kincaid Reading Ease score evaluation 1 29.28 ± 8.22 27.10 (21.30-42.50) 47.67 ± 10.77 48.40 (16.50-67.30) −3.622 0.001
Flesch-Kincaid Reading Ease score evaluation 2 25.12 ± 7.39 22.80 (17.50-37.20) 41.60 ± 9.54 39.80 (23.00-64.30) −3.659 0.001
Flesch-Kincaid Grade Level evaluation 1 14.52 ± 1.48 14.70 (12.30-16.30) 11.90 ± 2.08 11.80 (6.90-18.20) 2.685 0.011
Flesch-Kincaid Grade Level evaluation 2 14.04 ± 1.25 13.90 (12.30-15.70) 11.41 ± 1.88 11.30 (7.70-15.70) 2.986 0.005

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Sep 29, 2024 | Posted by in ORTHOPEDIC | Comments Off on Examination of the reliability and readability of Chatbot Generative Pretrained Transformer’s (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version

Full access? Get Clinical Tree

Get Clinical Tree app for offline access