Outcome measures need to be both reliable and valid.
Therapists should perform outcome evaluation using procedures that incorporate currently accepted standardized methods both in the clinical setting and in manuals or research paper(s).
Outcome measures are typically used for evaluation over time, but they can also be used to discriminate between patient groups and predict future status such as return to work.
The implementation of outcome measures within clinical practice requires information that may require purchase or permissions from developers and always requires knowledge about proper scoring and interpretation of scores.
Selecting the appropriate outcome starts with determining the purpose and scope of measuring and then requires matching the patient’s problem, level of difficulty, and communication capacity with these properties of the tool.
What Is a Health Outcome Measure?
A health outcome measure is any measurement of a patient’s health status. That view of health status can be broad, such as when we measure overall health or quality of life. We can also focus on very specific aspects of health. Pain and function are specific aspects of health that are of particular interest to hand therapists. Health can change over time as a result of time, treatment, or disease. Patients’ perceptions of their health status can change because of anatomic and physiologic changes that alter body functions, psychological changes that affect perception, or calibration of health or social changes that alter the experience of living with a specific health status. For this reason, measuring outcomes can be complex and requires a theoretical foundation, as well as different instruments to account for different perspectives and purposes.
The most internationally accepted standard of health is proposed by the World Health Organization. This organization produces both The International Classification of Diseases (ICD) ( www.who.int/classifications/icd/en/ ) and an International Classification of Functioning, Disability and Health (ICF) ( www.who.int/classifications/icf/en/ ). The latter is increasingly being used as a framework by which outcome measures are classified ( Fig. 16-1 ).
Body functions are physiologic or psychological functions of body systems. Body structures are anatomic parts of the body, such as organs, limbs, and their components. Impairment is the loss or abnormality of psychological, physiologic, or anatomic structure or function. Examples of impairments that hand therapists typically measure include hand size, appearance, strength, range of motion (ROM), volume, sensory threshold, and pain. Methods and interpretation for measuring impairments of the hand are the traditional focus of hand therapy and are detailed in many of the chapters in this book discussing evaluation.
Activity is the execution of a task or action by an individual. Participation is involvement in a life situation. Inability in these areas can be termed activity limitations or participation restrictions. Tests that measure performance of specific tasks include tests such as the TEMPA (Test d’Evaluation des Membres Supérieurs de Personnes Agées), Jebson Test of Hand Function, Purdue Pegboard, Minnesota Rate of Manipulation Test, the RNK dexterity test, , and other “hand function tests” can measure activity limitations. Activity limitations can also be measured by self-report by asking individuals whether they can perform a specific activity like lifting a grocery bag. Indicators that focus on resuming roles, like returning to work, reflect participation and can be considered as measures of actual status or by self-report. Many self-report functional scales contain both activity and participation type of items. In fact, a number of studies have now classified items of specific upper extremity scales , or hand problems using the ICF. However, moving ICF into practice has been slow. Recently, core and brief core measures were developed for hand conditions using an international evidence-based consensus process. These codes are available for all to use ( www.icf-research-branch.org/research/Hand.htm ).
In this chapter, we discuss principles that apply to all outcome measures, but emphasize self-report because many of the chapters in this book have focused on impairment measures.
How Can Clinicians Use Outcome Measures?
The basic functionalities about the measure’s scores include evaluation of change over time, discrimination between groups of patients, and prediction of future status. Hand therapy is characterized by development of advanced evaluative measures of hand impairment. Publication of Clinical Assessment Recommendations was one of the first accomplishments of the American Society of Hand Therapists, and the second edition remained in print for 20 years. This guide has traditionally focused on measuring hand impairments. However, increasingly it is becoming standard practice to include functional measures—particularly those involving self-report. In fact, some payers now mandate this practice. A new expanded version of the Clinical Assessment Recommendations is expected in 2010 and will include self-report measures and information on ICF.
A standardized outcome measure is one that has specific properties: it is published; there are detailed instructions on how to administer, score, and interpret the test; it has a defined purpose; it was designed for a specific population; and there are published data indicating acceptable reliability and validity. Standardization in clinical measurement is essential to ensure that outcome measures are capable of providing valid information about a patient’s health status.
Evaluation over Time
The most common application in clinical practice is evaluation over time. Optimally, evaluation over time includes using outcome measures to set goals and then determining whether detectable important changes have occurred.
When designing a treatment plan, hand therapists typically determine which pathologic processes or physical impairments are contributing to the patient’s complaint or compromised health. Treatment programs are then designed to allow optimal recovery and to minimize any residual impairment or functional limitation. Before and after intervention, examination is required to determine the effectiveness of the selected intervention. For example, if a therapist evaluates a tendon repair and is concerned about tendon gliding, then active range of motion (AROM) is appropriate to measure. Treatment for this patient might include a variety of interventions that are expected to improve tendon glide. It is essential that hand therapists evaluate impairment measures that are expected to change, such as AROM in this case. However, AROM may not be relevant to function for all patients, problems, or stages of recovery. Therapists should avoid “rote” use of any measure without considering the context and whether the measurement is able to provide useful information. If we think about a flexor tendon patient who achieves improved glide as result of hand therapy, the goal should be concomitant to improving the patient’s ability to perform activities such as gripping a handle and more successful participation in work. These effects can be measured using standardized outcome instruments. However, a review of outcome measures used when assessing patients with tendon and nerve repairs indicates a primary focus on impairments, particularly range of motion.
The first step in using outcome measures to evaluate change over time begins at the initial assessment. The therapist must select an appropriate measure that is both relevant to the patient’s problem and also has capacity to detect change. (See Table 16-1 and companion Web site for examples.) A short-term goal for improvement is then set using evidence from the literature about the minimal detectable change (MDC). Short-term goals are set to exceed the minimal detectable change so that you can be confident that the patient has improved beyond the amount that might occur due to random fluctuation in patient status. Longer-term goals can be set using your clinical experience about reasonable targets for that patient population or surgical procedure, with assistance from published outcome studies that provide scores for different patient subgroups and stages of recovery. Using MDC helps us determine whether patients have changed in subsequent reassessments. Using published outcome data we can determine if patients have met acceptable targets.
|Visual Analogue Scales (VAS)—Pain||To measure quantity of pain||10-cm line in a variety of formats; requires careful instruction; score is distance from 0 (no pain) to end (worst pain imaginable); commonly used for pain but has been adapted to other constructs (revalidated)||Can vary, but high test–retest has been demonstrated ,||High correlation between VAS and numeric pain-rating scale, finger dynamometer, and a verbal description of pain||Able to detect 21 levels of just noticeable differences some believe that it is less sensitive to change in acute pain than in chronic pain|
|Numeric Rating Scales (NRS)—Pain||To measure quantity of pain||0–10 scale for pain, administered verbally or on paper; typically 0 = no pain and 10 = worst pain imaginable; has also been applied to other constructs and used as a rating scale within questionnaires||Not specifically reported but high when used as a subscale in other measures||High correlation with VAS, finger dynamometer, and other outcome measures||High in distal radius fractures as subscale|
|Sickness Impact Profile (SIP)||Health status across demographic and cultural groups||136 items that are divided into two dimensions (physical and psychosocial) and 12 categories||Internal consistency, test–retest and inter-rater reliability (interview) are all high||Construct, concurrent validity reported; has not been widely used for upper extremity||Not as responsive as regional measures in upper extremity conditions|
|(Medical Outcomes) Short-Form SF-36 ,||Health status across demographic and cultural groups||36 items; 8 subscales (physical function, physical role, vitality, bodily pain, general health, social function, emotional role, and mental health and vitality); two summary component scales are calculated (physical and mental), which standardize the patient’s score to the U.S. population norms||Has been reported to be high ,||Numerous validity studies and normative data abundant , instrument well supported by Medical Outcomes Trust; may be preferable to SIP; authors suggest that for upper extremity it should be combined with a more specific instrument||Less responsive than the DASH or PRWE in evaluating wrist fractures; most commonly used generic measure|
|(Medical Outcomes) Short-Form SF-12||Health status across demographic and cultural groups||12 items; two summary component scales are calculated (physical and mental), which standardize the patient’s score to the U.S. population norms||Has been reported to be high ,||Numerous validity studies and normative data abundant; , instrument well supported by Medical Outcomes Trust; may be preferable to SIP||Summary scores less responsive than the DASH or PRWE in evaluating wrist fractures|
|Musculoskeletal Function Assessment (MFA)||Health status instrument for use with a broad range of musculoskeletal disorders; to complement SF-36||100 items; subscales include self-care, sleep/rest, hand/fine motor, mobility, housework, employment/work, leisure/recreation, family relationships, cognition/thinking, emotional adjustment||High||Appropriate correlations with other instruments and other clinical measures; normative and comparison data reported||Moderate to large effect sizes reported|
|Disabilities of the, Arm, Shoulder and Hand (DASH)||Upper extremity disability||30 items rated 1–5; majority of questions assess upper extremity function||Construct validity has been demonstrated ,||More responsive than generic measures for patients with upper extremity pathology; shown to be reliable, valid, and responsive for a variety of upper extremity problems ,|
|Quick Disabilities of the, Arm, Shoulder and Hand (QuickDASH)||Upper extremity disability||11 items rated 1–5; derived from the original DASH addresses regional symptoms and function of the upper extremity||Reliability has exceeded 0.90 for both the common 0–5 and VAS rating scales ,||Construct validity has been demonstrated||More responsive than generic measures for patients with upper extremity pathology and equivalent to full DASH|
|Upper Extremity Function Scale (UEFS)||To measure effect of upper extremity disorders on function||8 items scored 0–10||Internal consistency high||Excellent convergent and discriminative validity when compared with measures of symptom severity||More responsive than grip and pinch in CTS|
|Shoulder Pain and Disability Index (SPADI)||Shoulder pain and disability||Pain and disability scored as 50% each; items scored on a visual or numeric analog scale; 5 pain questions, 8 function; a numeric (0–10) version has been shown to be highly correlated with VAS version||High internal consistency and moderate test–retest||Construct and criterion validity have been evaluated||More responsive than regional or generic measures|
|Shoulder Rating Questionnaire||Severity of symptoms and functional status of the shoulder||21 questions in total, including 1 global rating on a VAS and 18 questions in a Likert format; 4 pain, 5 activities daily living (ADL), 3 recreation, 4 work, 1 satisfaction||High internal consistency and test–retest reliability||Moderate to high validity coefficients compared with arthritis effect measurement scales||Responsive|
|Western Ontario Rotator Cuff WORC||Quality of life in patients with rotator cuff pathology||21 questions on visual analog scales; sections on physical symptoms, sports/recreation, work, lifestyle, and emotions||High||Moderate correlation with strength and range of motion in patients with rotator cuff pathology||Manuscript in progress; presented but not published|
|Western Ontario Instability Index (WOSI)||Quality of life in patients with shoulder instability||21 questions on VAS; sections on physical symptoms, sports/recreation, work, lifestyle, and emotions||High||Correlated appropriately with other instruments||More responsive than 5 other instruments (DASH, ASES, UCLA, Constant score, and Rowe Rating)|
|American Shoulder and Elbow Surgeons (ASES) Shoulder Form||Patient-rated pain and disability for a wide variety of shoulder problems||1 pain question (VAS); 10 function questions (0–3)||Shown to have high reliability and internal consistency||Construct, content, and discriminative validity demonstrated||Insufficient data available|
|Shoulder Pain Score||For assessing pain in patients with shoulder pathology||7 questions; 5 on a 4-point Likert scale and 1 VAS for global rating of pain||Not published||Factor analysis used to determine that question; addresses two factors considered passive and active situations by authors||Not published|
|Constant Shoulder Score (patient component)||Used for a variety of shoulder problems, including instability||Pain rated 0–4; function (work, recreation, sleep position of arm, work)||Has been validated in a number of studies, and some aspects of the validity question such as the appropriateness of age correction, standardization of test methods and strength rating processes|
|Shoulder Disability Questionnaire||Functional disability in patients with shoulder disorders||16 yes/no questions on pain, related disability||Not published||Responsive in 349 primary care patients with shoulder disorders|
|Subjective Shoulder Rating Scale||To briefly measure subjective shoulder complaints||Multiple-choice question; 1 pain, 1 motion, 1 stability, and 1 activity||Not published||Highly correlated to Constant–Murley score but much faster to complete||Not published|
|Simple Shoulder Test||To assess shoulder function||Yes/no responses; 2 pain, 7 function, 3 motion questions||Discriminant in patients with rotator cuff pathology; appears to be more discriminative than other shoulder measures||Acceptable responsiveness in a number of shoulder studies|
|American Shoulder and Elbow Surgeons (ASES) Elbow Form||To measure pain, disability, and patient satisfaction in patients with elbow pathology||Patient rating scales for pain ranked 0–10 for 5 pain items, 0–3 for 12 function items, and 0–10 for 1 satisfaction question||High||Appropriate (high) correlation with PREE||Not published|
|Patient-Rated Tennis Elbow Evaluation (PRTEE)||To measure pain and disability in patients with lateral epicondylitis||5 pain questions scored 0–10; 10 function questions scored 0–10||Reliability coefficients exceed 0.90||Appropriate (High) correlation with ASES elbow form||More responsive than variety of other measures including ones devised for tennis elbow|
|Patient-Rated Elbow Evaluation (PREE)||To measure pain and disability in patients with elbow pathology||5 pain questions scores 0–10; 15 function questions scored 0–10||High||Appropriate (high) correlation with ASES elbow form||Not published|
|Patient-Rated Ulnar Nerve Evaluation||For patients with symptoms of ulnar nerve compression||Pain, sensory/motor scale, and function subscales||Not published||Not published; format similar to previous questionnaires by same author||Not published|
|Patient-Rated Wrist Evaluation (PRWE) and Patient-Rated Wrist/Hand Evaluation (PRWHE)||To measure pain and disability in patients with wrist pathology||Items scores 0–10; 5 pain items; 6 specific function tasks; 4 items on usual ability in personal care, work, household work, and recreation||High test–retest ,||Content based on expert survey/patient interviews construct and criterion validity evaluated; , has been validated in a variety of wrist and hand conditons||More responsive than DASH or SF-36 for wrist fractures; equally appropriate for wrist or hand pathology and formatted as PRWHE|
|Carpal Tunnel Symptom [CTS] Severity Scale and Functional Scale||To measure severity of symptoms in patients with CTS; to measure functional problems in patients with CTS||5-point Likert score questions in 2 subscales; symptom severity scale has 11 items; function subscale has 8 items||High in original format and a modified Swedish version ,||Symptom severity scales differentiated between patients with CTS and without CTS; a modified version added 2 items on palmar pain, 8 items on satisfaction, and 4 items on patients opinions on satisfaction with surgery; this scale was translated into Swedish and shown to be valid||More responsive than generic measures or impairment scores; responsive in Swedish version; more responsive than the Michigan Hand Questionnaire in patients with CTS; more responsive than other self-report sacles, a clinically important difference indicated by a change of one point|
|Michigan Hand Questionnaire||Health domains in patients with hand disorders||37 items: domains overall hand function, activities of daily living, pain, work performance, aesthetics, and patient satisfaction||Substantial test–retest reliability||Factor analysis supported subscales; appropriate correlations between subscales and with SF-12; discriminant validity demonstrated||Appropriate responsiveness; , responsive in patients with rheumatoid arthritis|
|Patient-Specific scale||To measure severity of problems in patients self-selected items||Patients select up to 5 items and rates these on a scale of 0–10||Acceptable; insufficient data for hand conditions||Has been validated for a number of different musculoskeletal conditions; has been associated with time to return to work in injured workers; some concerns around the interpretability of patient specific measures ,||Shown to be responsive when compared to the Michigan Hand Questionnaire and DASH for conditions like hand tumor, finger contracture, for CTS; likely to be most responsive in individual patients because the items are ones which the person has difficulty with and are important|
Discrimination between groups is required when the purpose is to discern different subgroups within a population. For example, the Katz hand diagram discriminates between individuals having carpal tunnel syndrome (CTS) and those who do not. Others have developed a diagnostic scale to assess the probability of CTS. Diagnostic tests are not outcome measures, but rather are designed to differentiate different groups (e.g., those having a pathology versus those who do not). In general, measures designed for diagnosis are not useful for evaluating change over time. For example, Phalen’s test is useful for diagnosing CTS but not for assessing treatment effectiveness or outcome. Discrimination can also be performed for other purposes than diagnosis. It can be important to differentiate among clinical subgroups that do not require different treatment approaches or have a different prognosis. For example, with constructs like readiness or capability to return to work, safety during mobility, or return to home, determining if illiteracy is a factor can help in deciding how best to optimize treatment planning and patient outcomes. Increasingly, we are seeing a move toward differentiating patient subgroups that require different treatment approaches. Scales (or clinical prediction rules) devised for this purpose are an example of discriminative measures.
Finally, outcome measures can be used to predict future outcomes. What will be the final strength 1 year after a fracture? Who will return to work? Who will require surgery for median nerve compression? These are prediction questions that might interest clinicians. When we predict outcomes, we use scores on rating scales at some preliminary stage to predict future scores or outcomes. For example, we demonstrated that patients presenting for conservative management of carpal tunnel that subsequently proceeded to have a surgical release had higher initial symptom severity scores than those whose conditions were successfully managed conservatively (3.3 vs. 2.9). Similarly, high baseline scores after a distal radius fracture were indicative of patients less likely to return to work at 6 months. In fact, baseline score is commonly a predictor of final status, suggesting that patients presenting with unusually poor scores are usually at higher risk of poor outcomes.
Outcome measures have specific measurement properties that determine how well they function in evaluating change, discriminating, or predicting. These measurement properties can be competing; therefore, an instrument designed for one purpose may not be suited to others. Generally, hand therapists are most interested in evaluative measures. Unfortunately, in some cases, clinicians use measures designed for discrimination as evaluative outcome measures without realizing they may not be appropriate for this purpose. For example, hand diagrams are useful for assessment, but not for evaluating treatment. Similarly, evaluative measures may not be predictive. For example, AROM may be used to evaluate a change in tendon glide over time, but does AROM predict the ability to return to work? We demonstrated that physical impairments were less predictive of time to return to work following distal radius fracture than were self-report measures. Although there is some evidence in the clinical literature on the evaluative aspects of AROM as an outcome measure, we really do not have much evidence on its predictive or discriminative properties. It is clear that it is important to “pick the right tool for the job.”
What Are the Important Measurement Properties of Outcome Measures?
The three measurement properties fundamental to how a tool can be used for clinical measurement are reliability, validity, and responsiveness (ability to detect clinical change over time).
Reliability is the consistency or repeatability of a measurement. Reliability is fundamental to other measurement properties because without stability, the utility of any measure is compromised. However, high reliability, in itself, does not ensure that other measurement properties are also acceptable. Therefore, both the reliability and validity of outcome measures should be documented before clinicians use them to make decisions.
Measurements can be repeated by the same therapist (intrarater), by different therapists (inter-rater), or on different occasions (test–retest). Generally, intrarater reliability is higher than other forms of reliability analysis because the measurement error attributable to differences between testers and occasions is not considered. However, for evaluating patients over time, it is important to know that a measure remains constant over time if the patient remains stable (i.e., test–retest reliability). When we expect to share our measurements with others through clinical assessment notes, progress reports, or research studies, it is important to note that the status we report is consistent with what others would have determined (i.e., inter-rater reliability). Some clinicians mistakenly assume that impairment measures are more reliable than self-report measures because they consider the latter subjective. Certainly, patient perceptions of their status have some random fluctuation based on factors like recent functional demands and mood. However, in general, self-report measures have higher reliability coefficients than many impairment measures. ,
Furthermore, a variety of factors affect measured impairments like grip strength. These include consistency of instructions and positioning, and interaction effects with the tester, time of day, fatigue, nutritional status, mood, and motivations. Hand therapists should be aware of methods to make their measurements more comparable to that described in the literature and consistent over time. Standardization, including elements like consistent technique, landmarking instructions, positioning, and instrument calibration, is used by hand therapists to reduce measurement error. It has been demonstrated that certain clinical measures such as ROM can be performed reliably by both novice and experienced therapists when a standardized method is used. By using methods described in reliability studies, therapists can be more confident of comparing their results with those of others.
Reliability can be assessed using different statistics. Basic understanding of these statistics is important for clinicians because it helps them comprehend how to use published reliability studies to improve clinical expertise. The intraclass correlation coefficient (ICC) is commonly used in the hand therapy literature to describe the relative reliability (the ratio between variability observed on repeated measurements within individuals compared with the variability between individuals). Reliability coefficients can be compared with benchmarks. Various benchmarks have been proposed. Fleiss suggested that less than 0.40 indicates poor reliability, that 0.40 to 0.75 is moderate, and that greater than 0.75 is excellent reliability. The problem with this approach is that it suggests that reliability is a pass/fail criterion that measures should achieve. However, a better way for hand therapists to think about measurement error is that it exists for all measures and it is more important to understand the extent to which it affects a given assessment. Statistics like the standard error of measurement (SEM), or mean error, allow therapists to view measurement error in more quantitative terms. For example, it has been shown that ROM measurements of elbow flexion and extension vary 3 to 5 degrees on average for the same tester and 5 to 8 degrees for different testers. SEM is important because it can be used to calculate the MDC. Minimal detectable change is a useful target for short-term improvement as it represents the amount of change that is likely to indicate a real change in status. MDC has been established for many self-report measures. Exemplars of how to apply these principles to setting goals and evaluating change are available in the hand therapy literature. ,
Validity is the extent to which the measure accurately portrays the aspect of health status that it was intended to describe. It can be thought of as the “trueness” of the measure. Validity is difficult to ascertain because in many concepts of interest to hand therapists, such as pain or disability, there is no single or measurable true answer. A measure may be valid for one purpose, but not for other purposes. For example, a general health instrument may be a valid indication of overall health but may not be valid when assessing change in upper extremity function after certain hand injuries. Therefore, validity needs to be assessed by a variety of methods and in various situations. Validity is the cumulative evidence provided to support the use of outcome instruments in specific situations to perform specific analytic functions. For this reason, various forms of validity are recognized.
Content validity is the extent to which a measure represents an adequate sampling of the concept being measured. This can be measured in the development of patient questionnaires by using focus groups or patient surveys to determine which items should contribute to the outcome scale. It can be determined through consensus reviews or expert panels that review existing items. For example, we expect a carpal tunnel instrument to include questions about classic symptoms of CTS, such as numbness, tingling, and waking at night. By looking at the items of the Symptom Severity Scale, we make a judgment that it has content validity.
Construct validity is the extent to which scores obtained agree with the theoretical underpinnings of that scale. Testing constructs derived from the theoretical underpinnings of an instrument requires that relationships be investigated. Does the instrument relate to other instruments the way one would expect? Scales measuring similar constructs should be correlated (convergent validity); whereas dissimilar constructs should not demonstrate a significant relationship (divergent validity). Another type of construct that is tested is evaluating whether subgroups expected to be different based on the theoretical construct or existing evidence demonstrate this difference on the outcome measure being evaluated (known groups validity). For example, do people with more severe fractures score more poorly? Do patients in a nursing home have lower scores than patients living independently?
The process of demonstrating whether an instrument is valid and reliable is ongoing and requires multiple studies to ensure that measurement scales can be applied to different clinical populations and examination needs. Clinicians are often tempted to devise their own instruments or modify existing instruments to make something that is directly applicable to their own clinical situation. This is not advisable as the new instrument is not validated, nor comparable to other scores.
The ability to detect change over time is critical to determining whether patients improve with treatment or deteriorate over time. Responsiveness is the measurement property that reflects this. Numerous studies in the hand therapy literature compare the responsiveness of different self-report and impairment measures. It is important for hand therapists to know about the relative responsiveness of different tools they might use in their practice. If an instrument is not able to pick up change, then insurers, patients, or members of the health-care team may not believe that the treatment efforts are effective. As a first step, therapists should consult the literature to find out about the relative responsiveness of different tools. As a general rule, the more specific the measure is to the problem or condition that is being treated, the more responsive the tool usually is. As an example, the Short-Form 36 (SF-36) is the common and important indicator of general health status. However, it is generally not very responsive in hand conditions. This might be expected when looking at the items since few of them relate to upper extremity function. However, therapists must also consider whether an instrument will be responsive in specific patients. A common example is patients with higher levels of functioning (younger, healthier patients) or higher demands (athletes, musicians, workers) who are unable to perform their normal roles but generally do not have difficulty with the lower-level items of many common functional scales. As a general rule, if a patient scores near the upper or lower range of a score, therapists should think about the potential for “ceiling” or “floor” effects. If the score is at a range where a MDC could not be achieved, then the tool is not appropriate for that patient. A different tool or a patient-specific tool might be indicated in these cases.