Abstract
This chapter summarizes the most suitable patient-reported outcome measures (PROMs) for assessing patients with hand and wrist arthroplasties, explains the measurement properties ideally required for a PROM and provides guidance on choosing the appropriate outcome measure for various aims and settings. Furthermore, the challenges of incorporating PROMs into daily clinical practice are described and some tips are given for the interpretation of outcome scores.
Key words
Patient-reported outcome measure – measurement properties – reliability – validity – responsiveness – Michigan Hand Outcomes Questionnaire – Patient-Rated Wrist Evaluation – core set – minimal important change – patient acceptable symptom state4 Outcome Measurement in Hand and Wrist Arthroplasty
4.1 Introduction
Without measuring treatment outcome, we can neither improve our hand surgical interventions nor demonstrate their effectiveness. Outcome measures help to not only quantify patient benefit but also identify problems and limitations of a particular intervention.
Besides the objective clinical measures such as range of motion, strength, or the evaluation of radiographs, subjective patient-reported outcome measures (PROMs) have become indispensable. The systematic use of information from PROMs leads to better communication and decision making between doctors and patients, and improves patient satisfaction with care. 1
Standardized and validated PROMs are essential to monitor the disease process, to evaluate its outcome as well as the associated socioeconomic consequences. In the modern world with its increasing focus on containment of healthcare costs, measureable outcome data can form the basis for negotiations with health authorities.
4.2 Frequently Used PROMs Suitable for Patients Undergoing Hand Arthroplasty
A wide variety of outcome measures are available for assessing patients with hand disorders. A literature review analyzing studies including patients with osteoarthritis (OA) of the thumb carpometacarpal (CMC) joint, for example, revealed that there were 21 different questionnaires in use. 2 Similar findings have been reported for Dupuytren studies, whereby only 14% used validated PROMs. 3 The diversity in reporting outcomes makes it difficult to compare results among studies, e.g., in meta-analyses. Table 4.1 highlights the most common and validated PROMs suitable for patients with hand or wrist arthroplasty.
4.2.1 Michigan Hand Outcomes Questionnaire (MHQ) 8 /Brief MHQ 16
The MHQ is a 37-item questionnaire, which is divided into six subscales: hand function, activities of daily living (ADL), pain, work performance, aesthetics, and satisfaction with hand function. It takes about 15 minutes to complete and yield results for each hand separately. The total score ranges from 0 to 100, with a higher score indicating better hand performance. The MHQ has been translated and cross-culturally adapted into several languages. Overall, it shows good measurement properties for many hand disorders. 9
In order to reduce responder burden, the brief Michigan Hand Outcomes Questionnaire (briefMHQ) was developed as a shorter version of the original tool with only 12 items. 16 Similar to the original MHQ, the brief version has excellent measurement properties for various hand disorders. 16 However, it is neither possible to derive subscales scores nor distinguish between the right and left hand using the brief version. The brief MHQ also yields a summary score between 0 and 100 with higher scores indicating better overall hand function.
More information and questionnaire templates (freely available) can be found here: http://mhq.lab.medicine.umich.edu.libproxy.lib.unc.edu/
4.2.2 Patient-Rated Wrist Evaluation (PRWE) 4 , 5
The PRWE is a 15-item scale specifically designed for patients with wrist disorders. The 15 items are divided into the two subscales of pain and function. The items are scored on a 0 to 10 numeric rating scale. Both subscales can be calculated independently and a total score combining these constructs can also be calculated. In contrast to the MHQ, the total PRWE score of 100 is associated with higher levels of pain and disability.
Recently, a decision-tree version has been developed 19 allowing a faster completion of the questionnaire. Based on the answer given to a question, the computer selects the most appropriate subsequent question. At the end, the patient only has to answer six questions instead of 15 yet still giving a score highly similar to that of the original version.
PRWHE 20 was introduced later by replacing the term “wrist” by “wrist/hand” allowing for a broader assessment of hand conditions. The PRWHE has two additional questions considering hand aesthetics. Both questionnaires are commonly used in hand studies and their measurement properties are known to be sound for various hand conditions. 6
More information and questionnaire templates (freely available) can be found at the following link: https://srs-mcmaster.ca/research/musculoskeletal-outcome-measures
4.2.3 Disability of the Arm, Shoulder, and Hand Questionnaire (DASH) 21 /QuickDASH 22
The DASH is an upper extremity-specific 30-item questionnaire; the QuickDASH is the shortened form with 11 items. They are the most commonly used questionnaires in hand surgery. Several studies attest to their sound measurement properties. 23 , 24 , 25 , 26 , 27 But the DASH and QuickDASH intended to measure function of the entire upper extremity and not just the hand. The questionnaires contain items relevant to shoulder-and-elbow function that significantly influence the total score. Therefore, it is recommended to use a hand-specific questionnaire for the primary evaluation of hand surgical procedures. The DASH might still have its place, for example, in assessing more widespread conditions such as rheumatoid arthritis, where the whole upper extremity is involved and the effect of a single intervention on global function requires evaluation.
More information and questionnaire templates (freely available) can be found here: http://www.dash.iwh.on.ca
4.2.4 Patient-Reported Outcomes Measurement Information System (PROMIS) 28
The PROMIS tool consists of an item bank related to physical, mental, and social health. The items can be administered either as a fixed short form or computer adaptive test (CAT). The CAT uses an algorithm based on the item-response theory that selects successive questions based on the answer to the previous item. There are many tools available for different health conditions. For patients with hand conditions, the PROMIS UE is most relevant. The CAT version includes 46 items and the short form consists of seven items all answered on a 5-point Likert scale. The results generate a final T-score with a standardized normative value of the mean (± standard deviation) equivalent to 50 (± 10).
The main benefit of the PROMIS lies in its speed of completion. It is reliable and highly correlated to the DASH/QuickDASH. 29 , 30 However, like the DASH, the tool includes items influenced by shoulder function. Responsiveness of the PROMIS UE is considerably lower than that of the MHQ or carpal tunnel questionnaire. 31
More information and questionnaire templates (freely available for individual research) can be found at the following link: http://www.healthmeasures.net/explore-measurement-systems/promis
4.2.5 Patient-Specific Functional Scale (PSFS) 32
The PSFS is an individual outcome measure allowing patients to rate their individual health problems. The patient indicates at least three specific activities that they are unable to do or have difficulty with and rates them on a scale ranging from 0 to 10, where 10 indicates being unable to perform the activity and 0 means the patient is able to perform the activity at the same level as prior to their injury or disorder, i.e., normally for them.
The advantage of this scale is that health issues not considered by traditional questionnaires can be assessed. For example, the PSFS might be useful for patients with unique functional demands such as athletes who would score highly in traditional outcome measures, but still experience problems specific to their discipline. On the other hand, the PSFS may also be used to assess patients with greater activity restrictions who are not able to perform the tasks given in traditional questionnaires.
The disadvantage is that the PSFS cannot be compared so easily across patients and especially studies. Although the score has been shown to be reliable, there is only a weak correlation with the DASH. Therefore, it is suggested that the PSFS is used as a complementary tool in conjunction with traditional outcome measures. 33 , 34
4.2.6 Single Assessment Numeric Evaluation (SANE) Score 35
The SANE is a global, single-question PROM. Patients indicate their answer on a scale ranging from 0 to 100 based on the question: How would you rate your [e.g., hand] today as a percentage of normal? (100% = normal). Originally developed for patients with shoulder conditions, the SANE has also been used to assess the outcomes of knee surgery. It correlates moderately to well with other shoulder- and knee-specific scores. 36 Although it is a very quick and easy evaluation of the patient’s subjective condition, it cannot replace existing comprehensive questionnaires. By using only one global question, the domains affecting the answer cannot be distinguished. The clinician is unable to conclude if the score is based on pain, function, appearance, or another related factor. Therefore, the SANE is recommended only as a supplementary evaluation. 36
4.2.7 Quality-of-Life Measures
Quantifying quality of life can be used as a secondary outcome measure and is essential for economic evaluations. In such studies, estimating the quality-adjusted life years (QALYs) is required. The most popular questionnaires from which QALYs can be derived include the EuroQol EQ-5D 37 and the Short Form-36 (SF-36) 38 or its brief version, the SF-12. 39
Two versions of the EQ-5D are available, the first of which includes three response options per question (EQ-5D-3L), while the second and more sensitive version consists of five response options (EQ-5D-5L). Each version addresses the five dimensions of mobility, self-care, normal activity, pain/discomfort, and anxiety/depression. Its measurement properties have been investigated widely for different musculoskeletal disorders. 40 The final score ranges from –0.285 to 1.0 (English value set) 41 with higher scores indicating better health status. It is freely available for noncommercial research.
The SF-36 and SF-12 include 36 and 12 questions, respectively, which generate two component summary measures of physical and mental health. The scores range from 0 to 100 with higher scores representing better health, the norm value being a mean of 50 (± 10). These health surveys show sound measurement properties in patients with various musculoskeletal disorders, e.g., OA and rheumatoid arthritis patients, 42 distal radius fractures, 43 as well as those with carpal tunnel syndrome. 44 License fees apply for the Short Form questionnaires.
More information can be found at the following link for the EQ-5D: https://euroqol.org
And the following link can be checked for the short form questionnaires: https://www.optum.com/solutions/life-sciences/answer-research/patient-insights/sf-health-surveys.html
4.2.8 Further Validated Hand-Specific PROMs
Apart from the PROMs suitable for patients undergoing hand arthroplasty described above, other PROMs are freely available for specific hand conditions.
The UnitéRhumatologique des Affections de la Main (URAM) scale is a 9-item questionnaire specifically designed for patients with Dupuytren’s disease. 45 The Boston Carpal Tunnel Questionnaire (BCTQ) or Levine scale 46 covers the domains relevant for patients with carpal tunnel syndrome. Both PROMs have sound measurement properties and are recommended for the assessment of these specific populations.
4.3 Core Sets
Outcome measures should cover all domains of interest to comprehensively assess the health status of a patient. For example, the dimensions of objective data (clinical measures, radiological criteria), functional outcome (hand- and extremity-specific), and patient-rated subjective data (quality of life, function, and pain) together with socioeconomic data and comorbidities are of interest. This implies the use of many different outcome measure tools, which may contribute to a large administrative burden for both the healthcare provider and patient. If available, core sets assessing clinical and patient-reported outcomes as well as complications are recommended, such as the already established core sets for assessing patients with distal radius fractures 47 or hand OA. 48
4.4 Measurement Properties
Colloquially, a “validated” outcome measure indicates that a tool has sound measurement or psychometric properties of reliability, validity, responsiveness, and interpretability. The COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) group has established the following categories and definitions 49 , 50 , 51 , 52 :
4.4.1 Reliability
Reliability is defined as the degree to which the measurement is free from measurement error and is usually established by test–retest reliability (intraclass correlation coefficient, ICC), internal consistency (Cronbach’s alpha), and measurement error (standard error of measurement, SEM). An ICC of greater than or equal to 0.7 is considered acceptable, with values of 0.8 or higher considered optimal. Cronbach’s alpha values lying between 0.7 and 0.9 indicate good internal consistency; higher values may demonstrate redundancy among the questionnaire items.
4.4.2 Validity
Validity can be subdivided into three separate components of content, construct, and criterion validity:
Content validity is the degree to which the content of an instrument is an adequate reflection of the construct to be measured. For example, an instrument aiming at assessing obesity should include information on both weight and height and not solely on the weight of a person.
Construct validity measures the degree to which the scores of an instrument are consistent with hypotheses. These hypotheses may include correlations with other tools measuring the same construct or differences between relevant groups. A hypothesis for a pain questionnaire might be that it is highly correlated (i.e., correlation coefficient >0.7) with the pain visual analog scale.
Criterion validity is often confused with construct validity by testing correlations to similar outcome measures. It correctly refers to the degree to which an instrument adequately reflects a “gold standard.” There are no existing gold standard PROMs, except for the long-version form of any short-version questionnaire; for example, the MHQ is the gold standard for the brief MHQ and the DASH for the QuickDASH.
4.4.3 Responsiveness
Responsiveness is defined as the ability of an instrument to detect change over time. In a similar manner to validity testing, responsiveness has to be assessed using a construct approach with the formulation of a priori hypotheses. These hypotheses may include assumptions about the expected effect, which can be analyzed by calculating effect sizes (ES) or the standardized response mean (SRM). Values of 0.2 and 0.5 indicate a small and a medium effect, respectively, whereas values greater than or equal to 0.8 indicate a large effect. 53 Such a hypothesis might be that the instrument under investigation yields an ES of >0.8 or that the ES is higher than that of the comparative instrument.
4.4.4 Interpretability
The interpretability of a questionnaire is defined as the degree to which qualitative meaning can be ascribed to its quantitative scores. It includes the minimal important change (MIC), minimal important difference (MID), or floor/ceiling effects (see below).
A detailed summary of the criteria for sound measurement properties is outlined by Prinsen et al. 54
4.5 Choosing an Appropriate Outcome Measure
The selection of an appropriate outcome measure is challenging because the tool must not only focus on the specific domain to be measured (e.g., pain), but needs to be suitable for the target population. Using an outcome measure designed for patients with Dupuytren’s disease is inappropriate for patients with distal radius fractures. Furthermore, the measurement properties of the instrument have to be suitable for the aim and target population and it is important that the instrument has been frequently applied and documented in the literature to allow for adequate comparisons to be made with one’s own results. Last but not least, the feasibility of the outcome measure needs to be evaluated for its regular use and integration into daily practice with low administrative burden. Patients should be able to complete the questionnaire quickly and without difficulties. A roadmap for the selection of a suitable outcome measure is outlined in Table 4.2, and a decision tree to find the appropriate PROM for patients with hand or wrist arthroplasties is shown in Fig. 4.1.
4.6 Collecting and Processing Outcome Measures
Having chosen an appropriate outcome measure, the next challenge awaits: the integration into daily business and the processing of resultant data. It is often forgotten that the standardized measurement of an outcome requires time and money. Prior to data collection, it is important to define what is going to be done with the data. It is our duty to patients to analyze all data provided by them. The collection of data without analysis is only a burden to the patient so is unethical. Therefore, it should be defined at the outset if the data is going to be used internally for patient monitoring or quality assurance. Is it intended for routine documentation in a registry or for a clinical trial?
Data collection requires a multidisciplinary team comprising a clinician, a study nurse, a data manager, and IT staff who are familiar with good clinical practice (GCP) guidelines. If the data are part of a clinical trial, a statistician, monitor, and medical writer may also be necessary.
A professional database that conforms to international laws and regulations as well as protects patients’ data has to be developed a priori. The use of Excel for research purposes is outdated, since data can be easily manipulated or misplaced. The development and maintenance of such a database, e.g.,REDCap (www.project-redcap.org) 55 or secuTrial (www.secutrial.com), requires the expertise of a research associate and IT staff.
It is preferable to distribute the questionnaires to patients electronically. If patients are unable to complete electronic questionnaires, a study nurse is required to hand out paper forms for completion as well as for transferring the patient information to the database increasing the administrative burden and the potential for transcription errors.
4.7 Interpretation of Outcomes
Traditionally, study outcomes were interpreted based on p-values. If a p-value is below the “magic” threshold of 0.05, the treatment was considered as significantly effective. However, as highlighted by the American Statistical Association, the p-value does not measure the magnitude of an effect. 56 It is influenced by the sample size, whereby a small difference in a large study population will most likely reveal a significant p-value, although the effect is small.
Therefore, the interpretation of study results based on what is important to the patient has become increasingly popular. 57 , 58 There are several underlying concepts looking at the patient’s perspective of a successful treatment.
4.7.1 Minimal Important Difference (MID) and Minimal Important Change (MIC)
The MID is the smallest difference between patients or groups that is considered important. 51 , 59 , 60
The MIC is the smallest change in score which patients perceive as important. 51 , 59 Changes exceeding this value can be considered relevant for the patient. There are several studies in hand surgery investigating the MIC and MID for several hand conditions and two reviews summarizing available data. 58 , 61
Apart from the MID and MIC, the related term of Minimal Clinically Important Difference (MCID) has also been defined. 62 However, it is suggested to adhere to the established terminology promoted by the COSMIN group, 51 , 59 whereby the MID considers differences between groups or patients and the MIC defines the “within group or patient” differences. For example, in a randomized controlled trial (RCT) investigating two different surgeries in patients with thumb CMC OA, it is important to look at the differences in outcome scores. If the difference at follow-up is higher than a defined MID value, which is 12 points for the brief MHQ, 63 a clinically relevant difference between the two interventions can be assumed.
In an observational study, the MIC helps in the interpretation of a treatment effect between baseline and follow-up. If the change in the brief MHQ, for example, is higher than the defined MIC of 16 points, 63 it can be concluded that the intervention resulted in a subjective relevant improvement for the patient.