Effectiveness Evaluation of the Shoulder




Evaluating the effectiveness of medical treatment is an important component of the healthcare system. Patients and their families, healthcare providers, administrators, and governments and shareholders within the healthcare system need to be informed as to whether medical treatments are effective. This process is called “effectiveness evaluation,” and relies on the accurate assessment of a patient’s level of disability that is related to an injury or disease process. Assessing a patient’s level of disability before and following treatment provides a means to objectively evaluate the effectiveness of an intervention. Over the last 20 to 30 years, evidence-based medicine has become recognized as an integral part of medical and surgical practice. Accordingly, there has been increased interest in the objective assessment of health outcomes, including those following treatment of shoulder disorders.


The methodology of outcome assessment has evolved in recent decades. However, the ideal outcome instrument for the shoulder remains elusive. A recent review of all shoulder-related publications in the Journal of Bone and Joint Surgery (JBJS) between the years 2004 and 2014 concluded the following: “A consensus is needed in shoulder research for more consistent application of validated patient reported outcome measurement tools.” This chapter details the history of outcome assessment of the shoulder and the methodology used to assess outcome measures. This chapter also provides a critical analysis of the outcome measures currently available and recommendations regarding the proper use of outcome measures in clinical practice. These concepts are part of the greater field of study known as “psychometrics,” which measures the abilities, knowledge, attitudes, and psychological disposition of a patient.


History


Emery Codman, an orthopedic shoulder surgeon from Boston, is credited for introducing the concept of outcome assessment in the medical community in the early 1900s. He advocated the concept of the “end result idea,” wherein the clinician would critically evaluate the surgical results of all cases to identify and understand treatment failures with the goal of improving the care of future patients. This concept was adopted throughout healthcare and is the foundation of evidenced-based medicine. For nearly a century, the effectiveness of intervention was based on the objective clinical assessments such as range-of-motion (ROM), stability, and deformity, combined with the radiographic assessment of other variables such as fracture union and prosthetic position. While initially considered the best method to evaluate a patient’s outcome after treatment, such observer-based methodology ignores the patient’s perception of his or her condition and fails to address the larger concept of the patient’s general health and the role that the shoulder disorder is playing in the well-being of the individual.


Several studies have demonstrated that objective assessment of shoulder function can be inconsistent, with large margins of error. Hayes et al. assessed six different shoulder motions using visual estimation, goniometry, still photography, and reach and found that the standard error of measurement ranged from 14 degrees to 25 degrees between raters and from 11 degrees to 23 degrees for the same rater across different time points. Terwee et al. found that the interobserver agreement was low for assessing active and passive elevations of the shoulder, particularly patients who have severe pain and disability. In a prospective study, Rudiger et al. found a poor correlation between patients and surgeons for objective measurements of shoulder mobility. Dowrick et al. compared self-reported and independently observed disability levels in an orthopedic trauma population. They found that the disability-level ratings greatly varied and that observers consistently rated the disability levels lower than participants (i.e., patients). Furthermore, Hickey et al. found that experienced manipulative physiotherapists had difficulty determining the symptomatic status of patients using observational motion analysis.


As the awareness regarding the limitations of observer-based measurements has increased in the last 20 to 25 years, patient-reported (i.e., self-reported) outcome techniques and tools have emerged. Patient-based general health outcome measures have been used to determine the effect of shoulder conditions on the patient’s well-being; more specific instruments (e.g., shoulder specific and those for the entire upper extremity) have also been developed. Cook et al. assessed four patient-based shoulder outcome measures and found them to exhibit good internal consistency across surgical status. In contrast, the observer-based systems assessed in this study revealed inconsistencies across surgical status.


The typical method used for patient-based, self-reported outcomes is a questionnaire. A new method of assessment is emerging, known as Computer Adaptive Testing (CAT). This tool has been used for many types of nonmedical examinations but can also be used for medical questionnaires and functional outcome assessments. This tool involves the use of a computer to ask questions, where each successive question adapts to the responses that have come before it to generate the most relevant next question. This exciting new format enables a shorter and more relevant testing; however, this format is new to modern medicine, and outcome instruments are still in development and yet to be widely implemented.




Development of Outcome Measures and Tools


In the specialty of orthopedic surgery, health-related quality of life (HRQoL) is the main outcome measure of interest because orthopedic interventions for the shoulder are designed to improve the quality of life rather than prolong the lifespan. To understand HRQoL, one must understand the level of disability that a particular condition causes and appreciate the change in disability and function that a treatment renders.


Measuring HRQoL requires asking the right questions and performing the appropriate tests; therefore selecting the appropriate questions and tests is an important and rigorous process. The development of an outcome instrument involves the following steps:



  • 1.

    Identification of a specific patient population


  • 2.

    Generation of items (questions)


  • 3.

    Item (question) reduction


  • 4.

    Pretesting of the outcome instrument


  • 5.

    Determination of the instrument’s measurement properties (validity, reliability, and responsiveness)





Assessing an Outcome Measure


Assessing the measurement properties of different instruments is necessary to ensure that the clinical outcome of interest is accurately appraised when using a particular instrument. The properties of an outcome measure that are most important to clinicians include the validity, reliability, and responsiveness; such properties enable the objective assessment of the quality of an outcome measure. Given the plethora of outcome measures available to clinicians, the use of an outcome measure for which data are available is advisable. Background is provided here on the psychometric principles used to compare different outcome instruments published in the literature.


Validity


An instrument is considered valid if it measures what it was intended to measure. Several different terms are used to describe validity. Face validity assumes an instrument is valid if it appears to be measuring what it is intended to measure; this form of validity is the simplest and yet the weakest method of demonstrating validity. Content validity is satisfied when the items that constitute a measure are inclusive of all aspects of the condition of interest. Construct validity is the degree to which a theoretical construct (e.g., shoulder function) is measured by an instrument. For example, because direct measurement of all aspects of shoulder function is not possible, we must also measure other variables associated with shoulder function, such as pain and ROM. Testing construct validity builds confidence in an outcome measure. Validity can be tested by comparing data from a patient-derived assessment with data from a physician-derived assessment of impairment severity, level of pain, ability to perform normal activities of daily living (ADLs), and responses on other contemporary patient-completed questionnaires. Two measures of construct validity are convergent validity and discriminate validity. Convergent validity refers to the extent to which different ways of measuring the same attribute correlate with one another. In other words, convergent validity determines if one outcome measure correlates with another outcome measure that claims to measure the same attribute (e.g., shoulder function). In discriminant (or divergent) validity, a measure does not correlate with dissimilar scales or dimensions of a scale. Furthermore, discriminant validity is the ability of an outcome measure to discriminate across different levels of severity in a patient population. Various statistical measures, including Pearson’s product moment correlation coefficient, are used to determine validity.


Reliability


Reliability is the extent to which an instrument yields the same results following repeated administration in a population with stable health status. Shoulder outcome instruments should be sufficiently reliable such that the score derived from the use of an outcome measure does not change, even if the questionnaire is completed on different occasions, provided there has been no change in the patient’s clinical status. The term “intrarater reliability” (also known as test-retest reliability and intraobserver reliability) refers to the ability of an instrument to yield the same result (i.e., the agreement between scores) when a single rater or observer repeatedly administers the test to the same patients with stable health conditions. “Interrater reliability” (also known as interobserver reliability) refers to the ability of an instrument to yield the same result when administered to the same patient by two or more raters or observers when no change in the health status of the patient has occurred.


Reliability is assessed in the following manner: (1) recruit a cohort of patients appropriate to the outcome measure being examined, (2) administer the outcome measure, and (3) repeat the process at a predetermined time interval. At the time of the second testing, patients are questioned as to whether their condition has changed since the first test session; if so, they are excluded from the analysis because their score would be expected to change. Reliability is determined using one of the several statistical techniques such as one-way intraclass correlation coefficient, two-way analysis of variance to determine the intraclass correlation coefficient, and Spearman rank correlation coefficients. Furthermore, clinicians can assess the percentage of patients with identical scores and the percentage of patients who have moved between response categories between testing sessions.


Responsiveness


Responsiveness to change is the ability of an instrument to detect true changes in a patient’s status (i.e., differences between pretreatment and posttreatment scores) that is beyond random variability. Some authors believe that responsiveness may be the most important property for evaluating a health status instrument. Previous knowledge regarding an instrument’s responsiveness aids in the selection of an appropriate outcome measure and also permits estimation of the sample size in clinical studies that use the instrument to ensure that adequate statistical power is achieved.


To determine responsiveness, the difference between the pretreatment and posttreatment scores (i.e., change score) is determined. For example, patients with advanced rotator cuff disease or glenohumeral joint osteoarthritis could be compared before and after rotator cuff repair or total shoulder arthroplasty, respectively. A statistical measure called the standardized response mean (mean change score divided by the standard deviation of the change score) can be used to assess the relative responsiveness of an outcome measure. The standardized response mean transforms the change score into a standard unit of measurement, allowing comparison between different instruments. The standardized response mean is frequently used to assess the relative responsiveness of health status instruments used for determining the impact of musculoskeletal conditions. A higher standardized response mean indicates greater sensitivity to clinical change. As previously discussed, the responsiveness of an instrument is an important consideration in research study design because it can serve to minimize the sample size.


The responsiveness of an instrument is related to an important parameter known as the minimal clinically important difference (MCID). The MCID of an instrument refers to the smallest change in a test score that a patient would consider clinically relevant or important. For example, the MCID of the Constant score was recently determined to be 10.4 for patients with rotator cuff tears. Two other important parameters related to the responsiveness of an instrument are ceiling effects and floor effects. A ceiling effect occurs when an instrument cannot reliably differentiate patients who score very high on a scale, while a floor effect occurs when an instrument cannot reliably differentiate patients who score very low on a scale.




Application of Outcome Measures


Clinician Versus Patient Reported Outcomes


Outcome measures are of three main types: patient reported, clinician reported, and performance based. Performance-based outcome measures are typically a subtype of a clinician-reported outcome measure because performance, such as muscle strength or ROM, is usually assessed by a clinician. Clinician-reported and performance-based outcome measures were the mainstay of clinical outcome assessments for the first 75 years of this field. Such measures initially appeared logical because most scientific fields respected the objectivity of observer-based measurement and acknowledged the subjectivity in taking measurements. Performance outcome assessments are important to clinicians and allow the comparison of outcomes between different treatments for the same condition. Therefore, observer-based measurements are still useful to clinicians and patients alike for determining how a particular treatment affects functional outcomes.


Observer-based assessments fail to determine whether patients actually feel better after a given treatment. Patients’ perceptions of their disabilities are more relevant than observer-based measurements. However, a clinician’s involvement in the administration of a patient-based outcome instrument can be a problem. When an observer (e.g., clinician) questions a patient with regard to function and then records the response, the possibility of observer bias is introduced. Such an effect reduces the validity of the assessment tool.


Patient-reported functional outcomes have been developed and tested using both psychometric and clinometric methods and have several advantages over observer-based assessments: (1) assessment of the patient’s perception regarding his or her condition, (2) elimination of bias related to clinician observation, (3) ease of administration by telephone or mail, (4) physical examination not required, (5) cost-effectiveness, and (6) less time required to administer.


Furthermore, recent evidence points to the ability of patients to measure their own performance-based outcomes. In 2014, Yang et al. compared the functional results obtained from patient-reported and clinician-reported outcomes. Patients (n = 120) were mailed instructions explaining how to measure their own ROM using diagram-based questions and strength at home before visiting the clinic 1 year after shoulder arthroplasty. Comparison of these results to clinician-measured values indicated perfect agreement in strength measurements (κ statistic, 0.62 to 0.92); the majority (88%) of patients were within 1 grade of the physician-measured ROM, with patients tending to slightly overestimate their ROM.


Response Methodology


Response methodologies for outcome instruments can be either “yes” or “no” answers, Likert scales, or visual analog scales. Visual analog scales take longer to score than other response methodologies if the value of the patients’ responses must be manually determined. Simple responses such as “yes” or “no” limit the number of possible responses and thus may also limit the instrument’s responsiveness. Many contemporary outcome instruments use Likert scales; however, the ideal number of response options for a Likert scale has not been determined.


Scoring an Outcome Measure: Categoric Rankings Versus Aggregate Scores


The categoric rankings used to describe outcomes following treatment of a shoulder disorder differ among studies. Most commonly, outcomes are described using terms such as “poor,” “fair,” “good,” or “excellent.” The use of categoric ranking may lead to the erroneous assumption that independently designed scoring systems describe similar levels of impairment. The variability in the definition of the terms used for categoric rankings hinders accurate communication between investigators and is an impediment to the objective comparison of results between different studies. For example, different scoring systems have different weights assigned to each domain and different values assigned to each categoric ranking. The outcome measures currently used for the shoulder are based on clinical domains, such as ROM, pain, and ability to perform daily activities; each domain is separately scored. Scores are then aggregated and assigned categoric rankings that range from “poor” to “excellent.” Thus a patient with a given raw score may have different categoric rankings depending on which scoring system is used.


The differences between scoring systems are so pervasive that categoric rankings cannot be relied on to provide meaningful comparisons either within or between patient cohorts. Furthermore, study results that use categoric rankings from different scoring systems cannot be combined. Therefore, when reporting categoric rankings for research purposes, the aggregate score should also be included.


Computer Adaptive Testing


CAT is a computerized test administration process that is based on the test-taking philosophy known as item-response theory, in contrast to classical test theory , in which all subjects must answer the same questions. With CAT, subsequent questions are dependent on the answers provided to previous questions. Apart from the first question, which is identical for all, the test can be tailored to the patient’s needs and ability levels by removing redundancy within the test. For example, if a patient cannot lift a pen, the patient will not be asked if he or she can lift a 10-pound weight. The potential benefits of CAT testing include (1) reduced test time and decreased patient response burden, (2) reduced floor and ceiling effects, (3) one-dimensional testing that may clarify the interpretation of results, and (4) the ability to add or subtract questions from the item bank without the need to recreate and validate an entirely new scale.


CAT is the testing platform used by the National Institutes of Health’s Patient Reported Outcomes Measurement Information System (PROMIS), which was introduced in 2007. The PROMIS Physical Function score was the first score to be developed, followed by the PROMIS Upper Extremity Score. The PROMIS testing of the upper extremity has been proven to be reliable, with psychometric qualities comparable with other common upper extremity measures (e.g., the DASH score) for various upper extremity conditions. Scoring can be completed by the patient in one-tenth of the time required by other instruments. PROMIS and CAT are still in a nascent phase of development and proliferation. The potential of CAT testing is gradually being recognized, and it is likely to become a prominent element of future instruments developed for effective evaluations in the shoulder.




Types of Outcome Measures


A wide variety of outcome measures are available for use in the shoulder ( Box 5-1 ). In increasing order of specificity, these measurement tools include generic health instruments or health utility measures used to evaluate a patient’s overall health, limb-specific instruments used to evaluate upper extremity function, joint-specific instruments used to evaluate shoulder function, and disease-specific instruments used to evaluate patients with a certain disease, such as shoulder instability. The following are examples of instruments used to assess outcomes in patients with shoulder-related pathology.



Box 5-1

Outcome Instruments Used to Evaluate the Shoulder


Generic





  • Arthritis Impact Measurement Scale



  • Duke Health Profile



  • EuroQol 5-D (EQ5-D)



  • Index of Well-Being



  • Nottingham Health Index



  • Short Form 12 (SF-12)



  • Short Form 36 (SF-36)



  • Sickness Impact Profile



  • Nottingham Health Index



Limb Specific





  • Disabilities of the Arm, Shoulder, and Hand (DASH)



  • QuickDASH



  • Modified American Shoulder and Elbow Surgeons (M-ASES)



  • Musculoskeletal Functional Assessment



  • Toronto Extremity Salvage Score



  • Upper Extremity Function Scale



  • Upper Limb Functional Index



  • Functional Impairment Test–Hand and Neck/Shoulder/Arm



  • Gallon-Jug Shelf-Transfer Test



Shoulder Specific





  • American Shoulder and Elbow Surgeons Standardized Assessment



  • Constant-Murley



  • Abbreviated Constant score



  • Disability questionnaire



  • L’Insalata Shoulder Rating Questionnaire



  • Neer Rating Sheet



  • Oxford Shoulder Score



  • Shoulder-Arm Disability Questionnaire



  • Shoulder Disability Questionnaire



  • Shoulder Function Assessment Scale



  • Shoulder Pain and Disability Index



  • Shoulder Pain Score



  • Shoulder Questionnaire



  • Shoulder Rating Scale



  • Shoulder Severity Index



  • Simple Shoulder Test



  • Subjective Shoulder Rating Scale



  • UCLA End-Result Score



  • University of Pennsylvania Shoulder Score



  • Wheelchair User’s Shoulder Pain Index



Disease Specific





  • AC Separation Scoring System



  • Melbourne Instability Shoulder Scale



  • Rowe’s Rating for Bankart Repair



  • Rotator Cuff Quality of Life



  • Western Ontario Rotator Cuff Index



  • Western Ontario Shoulder Instability Index



  • Western Ontario Osteoarthritis of the Shoulder Index




Generic Health-Related Quality of Life Instruments


Generic health instruments provide information regarding the impact of a specific condition of interest and any coexisting conditions on a patient’s general health. These instruments can be used for almost any type of disorder and can also be used to evaluate a patient’s health outcome following treatment of a specific disorder. Such instruments also permit scores to be compared across other orthopedic and nonorthopedic conditions. Therefore the outcomes of different treatments as well as the disability associated with various conditions can be compared using such a score. This type of instrument has important implications for studying health economics in addition to allowing comparative analysis between specialties of the amount of disability that patients perceive themselves to endure. Using this technique, patient-perceived disability due to shoulder conditions was shown to rank similar in severity to five major medical conditions: hypertension, congestive heart failure, acute myocardial infarction, diabetes mellitus, and clinical depression.


Short Form 36 and Short Form 12


The Short Form 36 (SF-36) is the most commonly used tool in orthopedic surgery and clinical medicine to assess general health-related quality of life. This tool measures eight health concepts: physical functioning (10 items), social functioning (two items), role limitations due to physical problems (four items), role limitations due to emotional problems (three items), mental health (five items), energy/fatigue (four items), bodily pain (two items), and general health perception (five items). The reliability, validity, and responsiveness of the SF-36 for conditions of the upper extremity has been demonstrated. An abbreviated version of the SF-36, Short Form 12, has been developed and has been validated to a lesser extent for use in a variety of shoulder conditions, with good correlation to the longer version.


EuroQol 5 Dimensions Questionnaire


The EuroQol group was founded in 1987 and is described as “an international group of researchers dedicated to the measurement of health outcome.” Initially a European collaboration, there are now active members worldwide. The evolution of their efforts eventually resulted in the development of the EuroQol 5 Dimensions Questionnaire (EQ-5D), which is an internationally validated general measure of health-related quality of life. The EQ-5D consists of a questionnaire and a self-rated health status assessment using a visual analogue scale (EQ-VAS). The self-assessment questionnaire evaluates the patient’s current health in five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. The patient is asked to grade his or her current level of function for each dimension into one of three degrees of disability (severe, moderate, or none). The combination of these grades with the conditions “death” and “unconscious” enables description of 245 different health states. Each health state can be ranked and transformed into a single score (known as the health utility score). The utility score is an expression of the Quality Adjusted Life Years and is commonly used to make evidence-based decisions in analyses of cost effectiveness. Therefore the EQ-5D can be used for health outcomes and economic studies, making it a highly useful and versatile tool for health outcomes and quality improvement research.


The EQ-5D has been used in multiple shoulder studies, and while it is more commonly used in Europe, its use is increasing in North America. One criticism of the EQ-5D is that it may not have high enough responsiveness to detect small but clinically relevant differences in health status.


Limb-Specific Outcome Instruments


Limb-specific (i.e., regional) instruments are based on the presumption that the upper extremity functions as a kinematic chain. According to this paradigm, the shoulder as well as the elbow, forearm, and wrist position the hand for grasping and manipulating objects in the environment. Studies of limb-specific instruments have shown close correlation with joint-specific and disease-specific measures, although they tend not to have the same degree of responsiveness. Nevertheless, a limb-specific instrument may be appropriate when the diagnosis is less certain or when more than one part of the extremity is affected by injury or disease. For example, a condition involving either one of the shoulder, elbow, wrist joint, or the hand can affect one’s ability to use a telephone.


DASH and QuickDASH Questionnaires


The Disabilities of the Arm, Shoulder and Hand (DASH) Questionnaire was developed for assessment of single or multiple musculoskeletal disorders affecting the upper limb. The DASH is the third most commonly used upper extremity outcome measure in publications over the last 2 years, after the Constant Murley and the American Shoulder and Elbow Society (ASES) scores. The DASH is a patient-administered, 30-item questionnaire that was designed to quantify physical disability and symptoms in individuals with upper limb disorders. The DASH is divided into three domains: physical function (21 items), symptoms (six items), social/role functions (three items), and two optional modules including sports/music (four items) and work activities (four items). The response to each item is scored on a 5-point Likert scale, and a function/symptom score (from 0 to 100) is calculated; a higher score indicates greater disability. The MCID of the DASH is 10.2 points, with a minimal detectable change of 12 points. The DASH has demonstrated excellent responsiveness compared to joint-specific questionnaires and has been validated for use in both proximal and distal upper limb disorders. It has also been shown to correlate well with the SF-36, but with fewer ceiling and floor effects, supporting its use as a valid measure of health status in patients with a wide variety of upper extremity complaints. This instrument has been validated in many languages including Canadian French, Spanish, Portugeuse, Dutch, Chinese, and Taiwanese.


An abbreviated form of the DASH, the QuickDASH, was developed in 2005. The QuickDASH was developed using three item-reduction techniques on a cross-sectional field of data derived from a study of 407 patients with various upper-limb conditions. This instrument has been found to have similar discriminant ability and test-retest reliability to the DASH.


Shoulder-Specific Outcome Instruments


Joint-specific health instruments have been found to be more responsive than generic health status instruments in assessing conditions of the upper extremity. This greater responsiveness is probably because it includes specific items that are relevant to the patient population being studied. The disadvantage of joint-specific instruments is the need for numerous instruments (i.e., for each joint) and the inability to compare outcomes across various conditions, populations, or interventions. A systematic review of 610 articles published between 1992 and 2002 identified 44 different shoulder or upper-limb specific outcome instruments in active use in shoulder surgery research. A more recent review of the literature published in the JBJS (American and British), the American Journal of Sports Medicine , and the Journal of Shoulder and Elbow Surgery during the years 2013 to 2014 found that the most common shoulder-specific functional outcome instruments used are as follows (from most to least popular): the Constant-Murley score, the ASES, the Simple Shoulder Test, and the University of California–Los Angeles score. Interestingly, another recent review of literature from the past 10 years in JBJS shoulder-related publications found the exact same order of popularity of these instruments.


Constant-Murley Score


The Constant-Murley shoulder score (also known as the Constant score) is a 100-point scoring system (100 points representing the best score) in which 35 points are derived from the patient’s reported pain and function and the remaining 65 points are allocated to assessment of ROM and strength. The combination of patient-reported, clinician-reported, and performance outcome measures into a single numerical score make it relatively unique, bestowing a major strength compared to the other outcome measures commonly used. On the other hand, some investigators believe that patient- and observer-based measurements should not be combined into a single score and are best reported separately. The Constant score has been found to be affected by both patient age and sex, where the score tends to decrease beginning at age 50 for men and age 30 for women. Age- and sex-matched normative data are available for the Constant score and have permitted the development of an age- and sex-adjusted Constant score. The MCID for the Constant score was recently reported as 10.4 in the rotator cuff tear population and 15 for patients with general subacromial pain.


A major advantage of the Constant score is its popu­larity, which allows for comparison between various studies. However, despite its popularity, there have been several criticisms of this score, including its only “fair” reliability. This lower reliability is likely related to the objective measurements (e.g., strength and ROM components) that make up 65% of the score. In the initial publication, measurement techniques were not described in a standardized format, leading to subsequent publication of guidelines on appropriate techniques for administration of the instrument with specific reference made to the strength measurement component. Compared with patient-recorded outcome measures, the Constant score is also more difficult to administer since a large portion of the score is derived from observer-based measurement ; if patients are unable to return for follow-up examination, data collection is considered incomplete as telephone or mail-in questionnaires are not possible. Floor effects for trauma patients and ceiling effects for instability patients have also been noted. Ceiling effects are common with most instability-specific scores due to the unique clinical presentation of shoulder instability compared with other shoulder pathologies.


There is only one pain scale in the Constant score, which likely yields an inadequate representation of a patient’s pain since patients often experience different amounts of pain depending on their activities. Improving the accuracy of the Constant score by using the uninjured (i.e., contralateral) shoulder as a reference to yield an “individual relative Constant score” has been suggested; however, this method would be problematic for patients with bilateral shoulder disorders. Due to some of these potential problems with the Constant score, at least two authors have advised against its use in randomized controlled trials.


American Shoulder and Elbow Society Score


The ASES Score outcome measure was developed in 1994. This instrument consists of a patient self-assessment section that can be completed independently and a clinician-assessment section regarding the patient’s ADLs; however, the final score only includes the patient self-assessment section (100 points total: pain, 50 points; function, 50 points). The MCID has been reported as 12 to 17 points for rotator cuff pathology, while a lower estimate of 6.4 has been reported for general shoulder conditions.


The ASES assessment form has been found to be a valid, reliable, and responsive outcome tool for general shoulder dysfunction and for specific shoulder conditions such as instability, rotator cuff disease, and glenohumeral joint arthritis. A major advantage of this instrument is its ease of administration; however, this benefit is slightly outweighed by the requirement to perform calculations afterwards to obtain the final score. The ASES score correlates well with the Constant shoulder score ( r = 0.871; P < .01) and is a highly convenient method of outcome assessment.


One major criticism of the ASES score, as with the Constant score, is that only one pain scale is used. This particular scale asks “What is your pain level today?”, which for many patients may not accurately reflect their true pain level because pain level typically varies with the present activity level and time of day. Moreover, three of the 10 questions regarding ADLs pertain to sport or ability level and are not relevant to many patients. Finally, a ceiling effect has been noted in higher-functioning patients.


Simple Shoulder Test


The Simple Shoulder Test (SST) was developed at the University of Washington in 1992 with the belief that outcome instruments could be dramatically simplified yet still retain the same psychometric properties of longer measurement tools. The questionnaire consists of 12 yes/no questions, with two related to pain, seven to function, and three to ROM.


The SST has been shown to have acceptable test-retest reliability, content validity, and construct validity. Responsiveness has been found to be lower in younger patients and in patients with shoulder instability; however, this tool has demonstrated good responsiveness in rotator cuff pathology. Despite this instrument’s respectable psychometric properties, the small number of questions and the dichotomous nature of the responses limits its ability to detect small but clinically important changes in patient functionality, both between different patients and for the same patient followed longitudinally. The MCID of the SST is 2 for patients with rotator cuff pathology and 3 for patients undergoing shoulder arthroplasty.


Single Assessment Numerical Evaluation Test


Taking simplicity to the extreme, the Single Assessment Numerical Evaluation (SANE) test attempts to convey everything that the more detailed shoulder scores attempt to accomplish. The SANE test is a one-question test that simply asks the patient, “How would you rate your shoulder today as a percentage of normal (0% to 100% scale with 100% being normal)?”


The SANE test has shown increasing popularity in recent years due to suspicions that the more complicated outcome instruments unnecessarily complicate the assessment process. If the most important element of outcome is patients’ perceptions of how they are doing, perhaps we should simply just ask patients this question directly.


This practical approach has resonated with some in the orthopedic community. This test is easily administered and does not require a personal encounter, and patients easily comprehend the question and do not require guidance on how to complete the instrument. The SANE test has been used in several recent shoulder publications to assess patients with rotator cuff pathology, instability, superior labrum anterior to posterior (SLAP) tears, and arthritis and has demonstrated a good correlation with both the ASES and Rowe scores for rotator cuff and SLAP pathology ; however, this test does not correlate strongly with the QuickDASH or Numeric Pain Rating Scale. The SANE test has not been validated, and the MCID has not been reported.


University of California–Los Angeles End-Result Score


The UCLA end-result score was first described in 1981, making it one of the first published shoulder-specific outcomes measures. This test was developed before the current methodology that applies psychometric principles to develop outcome measures. Although it remains a relatively popular score, there are several criticisms related to the content of this instrument.


The UCLA score is a 35-point scale that evaluates five clinical domains: pain (10 points), function (10 points), active forward elevation (five points), forward flexion strength (five points), and patient satisfaction (five points). The scale has been criticized for its use of descriptive items for pain and function, and one of the items asks whether the patient is “satisfied,” which is not a meaningful question for patients who have not yet undergone treatment. Combining patient-based and observer-based items into a single score (similar to the Constant score) may not be appropriate or meaningful. Finally, in the original description, this single score is then reported in terms of a categoric ranking (poor, fair, good, or excellent), which negatively affects the responsiveness of the tool. When using this scale, it is best to report numeric results and avoid reporting categoric rankings. In psychometric testing, the UCLA scale has demonstrated good internal consistency, although the standard error of measurement for the scale was high and factor loading was inconsistent, suggesting that patients may not distinguish between pain and function. The MCID has not been established for this instrument.


Other Shoulder Specific Tests


Many other shoulder-specific instruments are currently in use (see Box 5-1 ). The popularity of any given test should ideally be based on the qualities of the test for the specific purpose of use; however, other factors such as regional preferences, ease of administration, and other nonpsychometric factors play into this decision. Furthermore, the popularity of a test does not necessarily translate into appropriate use of a test, as several instances have been reported in which a shoulder outcome instrument was used inappropriately.


Disease-Specific Outcome Instruments of the Shoulder


Disease-specific tools, designed to assess specific conditions in a specific joint, are the most specific outcome instruments. The assumption is that a person with a shoulder dislocation has different needs and perceptions of disability than a person with a rotator cuff tear. As a result, such instruments are theoretically more likely to exhibit higher validity, reliability, and sensitivity for a specific disease compared to an instrument with the same psychometric properties that is applied across several different diseases. As a general rule, disease-specific outcome measures are very responsive to small changes in the condition for which they were designed. However, a major limitation of these tools is that a specific tool for each shoulder disorder is required and comparison with other disorders is not possible. The disease-specific instruments most commonly used in shoulder surgery were created by the same working group and include the Western Ontario Shoulder Instability (WOSI) Index for instability, the Western Ontario Rotator Cuff (WORC) Index for rotator cuff disease, and the Western Ontario Osteoarthritis of the Shoulder (WOOS) Index for patients with glenohumeral joint arthritis.


Western Ontario Shoulder Instability Index


Shoulder instability is a particular diagnosis with features that make it unique to other shoulder conditions. Patients with instability typically have long periods of normal function punctuated by episodes of extreme discomfort related to dislocation events or instability episodes. Instability-specific tools were the first type of disease-specific shoulder tools developed. In 1978, the Rowe score was the first of its kind to be developed. Since then, other scores with better psychometric properties than the Rowe score have been developed, including the WOSI Index, the Oxford Shoulder Instability Score (OSIS), and the Melbourne Instability Shoulder Scale (MISS). Of these, the WOSI Index has demonstrated the greatest validity, reliability, and responsiveness and is currently the recommended disease-specific instrument for shoulder instability. The MCID for the WOSI Index is 220.


Western Ontario Rotator Cuff Index and Western Ontario Osteoarthritis Shoulder Index


The same group that developed the WOSI Index also developed disease-specific outcome instruments for patients with rotator cuff disorders (WORC Index) and glenohumeral joint osteoarthritis (WOOS Index). The MCID for the WORC Index is 245 and that of the WOOS Index has not been determined. Both of these scores have demonstrated desirable psychometric properties and are the most frequently used disease-specific instruments for their respective disorders. However, to date, neither instrument has been used as often as the shoulder-specific instruments described earlier in this chapter.


Rotator Cuff Quality of Life Questionnaire


The Rotator Cuff Quality of Life (RC-QOL) is a 34-item questionnaire developed at the University of Calgary that consists of five domains: physical symptoms, sports/recreation, work-related concerns, lifestyle concerns, and social/emotional concerns. Each item is scored with a 100-point visual analog scale. The sum of all items is divided by 34 to give a final score out of 100, with 100 being the highest and 0 being the lowest functional level. The RC-QOL was recently shown to be predictive of the need for surgery in patients with rotator cuff tears and was found to have similar construct validity to the WORC score. The MCID has not been determined for this instrument.


Outcome Measures That Assess Pain


Pain is a common, subjective, patient-reported measure that is somewhat controversial when included as part of a total score for functional outcome assessment. The amount of pain experienced by patients is certainly one of the most important determinants of outcome for musculoskeletal disorders before and after treatment. However, correlation between self-reported pain and both self-reported disability and observer-based measures of function are sometimes significant but often weak. The measurement of pain often yields much greater treatment effect sizes, or responsiveness, than do physical variables or condition-specific instruments. Furthermore, since the perception of pain is widely variable between patients with the same or similar conditions, some believe that pain scores should not be included in the scoring of functional limitation. This issue can be addressed by scoring and reporting pain and function separately.


Taking these concepts into consideration, the outcome of therapies designed for the treatment of shoulder disease should be described ideally on the basis of a separate assessment of pain, in addition to any generic, region-specific, or condition-specific outcome instruments. Several different scales are available for the measurement of pain intensity, but the most commonly used are the Visual Analog Scale (VAS), Numerical Rating Scales (NRS), and Verbal Rating Scales (VRS).

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jun 9, 2019 | Posted by in ORTHOPEDIC | Comments Off on Effectiveness Evaluation of the Shoulder

Full access? Get Clinical Tree

Get Clinical Tree app for offline access