Introduction
The last century has brought remarkable advances in the field of orthopaedic traumatology. The creation and dissemination of this new knowledge could not have been possible without the ability of surgeon scientists to measure and communicate their patient outcomes. Just as the delivery of skeletal trauma care has evolved through the decades, so have the techniques used to quantitate and report these advances.
Outcomes Research—Assessment of Clinical Outcomes
The concept of outcome measurement is not new to orthopaedic surgery. In 1914, the term “end result” was coined by Dr. Ernest Codman, an orthopaedist who is most recognized for his contributions to shoulder rehabilitation. In addition to his work on the shoulder, Codman carried the strong belief that all patients should be reevaluated at the end of 1 year to learn from the patient’s “end result” of treatment. Unfortunately the medical community was not receptive to these initiatives and the concept lay dormant for decades.
As skeletal trauma care advances burgeoned during the latter half of the twentieth century, pioneers recognized the need for improvements in the ways in which injuries were described and results reported. Contributing to the dramatic improvement in skeletal trauma care was the formation of the Swiss fracture study group Arbeitsgemeinschaft für Osteosynthesefragen (AO). Since its inception in 1958, one of the central tenets of this group was the rigorous documentation of cases. As the principles espoused by this group gained acceptance internationally, there soon became the need to report the outcomes of this novel approach to fracture care. Unfortunately, the standardized fracture classification scheme of the AO was not initially combined with standardization or consistency in outcome measure. Those authors sharing their experience in the medical literature were forced to devise their own outcome measures. A prototypical example of this is Anderson’s classic 1975 article describing his group’s success with compression plating of both-bone forearm fractures. In this article, the authors “arbitrarily” grouped patients into those with union, delayed union, and nonunion. Additionally, they characterized functional outcome as excellent, good, unsatisfactory, or failed without precise definitions of these categories. This article, alongside the vast majority of works from that era, was retrospective in design and without consistency in the manner in which results were reported. Standard data from that era focused on parameters such as range of motion, stability, alignment, radiographic union, lost fixation, and infection. These outcomes were used because they were the ones important to the orthopaedist and available by retrospective chart review. Clinical outcomes were simply grouped into large categories arbitrarily defined by the authors.
While this generation of outcome measurement played an important role in the advancement of skeletal trauma care, substantial shortcomings were inherent. Comparing rates of “excellent” results in one study to another was fraught with difficulty because of the lack of consistent standards. Additionally, objective clinical outcomes are known to have significant intra- and interexaminer inconsistencies bringing into further question the validity of these results. As orthopaedic clinical research design improved to include multicenter trials and prospective randomized design, a distinct need was recognized for improvement in outcome measures.
Region-Specific Outcome Measures
With these acknowledged shortcomings, the orthopaedic community zealously addressed this issue with the introduction of dozens of outcome measures directed at specific anatomic sites. These regional scores typically combined physician-assessed parameters, functional abilities, and patient’s perception of pain. Some of the more common tests that evolved included the Michigan Hand Outcomes Questionnaire, American Shoulder and Elbow Surgeons Elbow Questionnaire, Constant-Murley Shoulder Outcome Score, American College of Foot and Ankle Surgeons scoring scales, Iowa Knee Score, and the Harris Hip Score. These measures would typically take between 10 and 30 minutes for completion and vary between those that were completed exclusively by the patient to those that required an adjunct examiner. With this proliferation, the clinical research landscape rapidly transitioned from an environment void of published outcome measures to one in which there rapidly became too many to choose from. With this ever-increasing number of outcome instruments, a new body of literature emerged directed at choosing the best instrument for some of the various anatomic regions. Martin and Irrgang and coworkers have studied the performance of 14 patient-completed foot and ankle scores. They concluded that there were significant advantages and disadvantages for each of the 14 but that a consistently dominant choice was not available. Others examining the various measures for the hand, elbow, and hip have drawn similar conclusions. In pursuit of creating a greater degree of consistency and comparability between studies, scores are reflective of upper and lower extremities. The Disabilities of the Arm, Shoulder, and Hand (DASH) questionnaire is an example of such an instrument that applies to the upper extremity. This instrument is a 30-question, self-administered evaluation of physical activity, pain, symptom severity, and impact of upper extremity disease on everyday activities. Answers are weighted and compiled to produce a single score for intra- and interpatient comparison. Because it has been well studied in a wide range of upper extremity applications, translated into numerous languages, and recently been made available in a shortened form, it has become one of the more popular regionally specific outcome measures in orthopaedic trauma.
Patient-Reported Outcome Measures
Beyond focused regional outcome measures, it is becoming increasingly clear that these joint-specific measures do not always provide insight into outcomes that are of importance to patients. A significant amount of literature supports the observation that good clinical outcomes as determined by physicians do not necessarily indicate good functional outcomes from the patient’s perspective. This realization has led to an exponential growth in use of “patient-reported outcomes” (PROs) of general health in important prospective clinical trials. The driving philosophy of patient-derived outcome measures is that they reflect on the health and well-being of domains of life that extend well beyond those captured in a regional outcome measure. While data procurement may be facilitated through interviews, this only qualifies as a PRO if the interviewer is gaining the patient’s views, not where the interviewer uses patient responses to make a professional assessment or judgment of the impact of the patient’s condition. Because, by definition, general health outcome measures have application to a wide variety of medical and social conditions, their associated population norms, domain specificity, and reproducibility have been extensively studied. Some of the most useful measurement tools used in orthopaedic outcome measurement include the Medical Outcomes Study Short Form 36-item health survey (SF-36), the Quality of Well-Being (QWB) scale, the Sickness Impact Profile (SIP), and the EuroQol Groups’ five dimensions (EQ-5D) questionnaire. The use of generic instruments to obtain physical, mental, and functional outcome data has become so commonplace that some granting institutions require the incorporation of a generic questionnaire to the design of clinical projects. Each of these instruments assesses domains of human activity, including physical, psychological, social, and role functioning. The ultimate effect is a global evaluation of the patient as a whole being rather than a disease, an injury, or an organ system.
The SF-36 was developed by Ware and colleagues and the Rand Corporation as a part of the Medical Outcomes Study. The SF-36 is the most widely applied general health status instrument and has certain features that make it particularly appealing for studying musculoskeletal injury. The SF-36 consists of 36 scaled-response questions (0 = poor, 100 = best) concerning eight different functional subscales: bodily pain, role function–physical, role function–emotional, social function, physical function, energy/fatigue, mental health, and general health perceptions. Each scale is scored separately. These subscales can be combined into the Physical Component Score and the Mental Component Score. The SF-36 has been published with normative values for the U.S. population, which vary with age, gender, and comorbidities.
The SIP, a 136-question endorsable statement (yes/no) questionnaire, requires trained interviewers for administration and takes 25 to 35 minutes to complete. The SIP inquires about 12 different domains, which are first scored independently, then combined into physical and psychosocial subscales, as well as one aggregate score. The scale is 0 to 100 points—the higher the score, the worse the disability. Patients with scores in excess of the mid-30s have significantly diminished quality of life. The SIP has been used in patients with multiple health conditions and allows for comparisons of impact of disease on health. The SIP has been used in musculoskeletal trauma with good success. Lesser degrees of musculoskeletal dysfunction are not identified, and therefore, the SIP also suffers from the ceiling effect. Because of the difficulty and length of its administration, the SIP may be most useful for well-funded outcome studies or controlled trials.
The EQ-5D is so named because it assesses five dimensions of health status—mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. The first three dimensions reflect physical functioning. For pain level estimation, patients are asked to choose between three responses for level of pain and discomfort (none, moderate, or extreme). However, if chronic pain is well controlled, the patient’s best response may be most appropriately described as “mild,” which is not an option. The answers within each dimension are then weighted, based on preferences from 0 to 1 that correlate to worst versus best health, and then used to calculate a final score.
Although the SF-36, SIP, and EQ-5D are some of the most widely used general health outcome measures, a common concern over their use in musculoskeletal measuring outcomes of skeletal injury is the ceiling effect. This form of measurement failure occurs when a patient’s level of function scores at the top end (ceiling) of a given tool. This suggests that the range of function tested by the particular tool is not high enough for the condition under consideration. In response to this concern, the Musculoskeletal Functional Assessment (MFA) and the Short Musculoskeletal Functional Assessment (SMFA) questionnaires were developed. While continuing focus on the patient’s perceived general health outcome, the content of these tools was directed and function derived from use of the extremities. The MFA is a generic, 101-item instrument that assesses function in 10 domains, with emphasis on musculoskeletal function: self-care, emotional status, recreation, household work, employment, sleep and rest, relationships, thinking, activities using arms and legs, and activities using hands. The MFA requires approximately 15 to 20 minutes to complete and can be either self- or interviewer-administered.
The drawback of the MFA is that its detail becomes time-consuming for patients and staff. To address this concern, the SMFA was developed. Investigators selected questions from the longer MFA based on universality, applicability, uniqueness, reliability, and validity. The SMFA is a 46-question, self-administered instrument that can be completed in approximately 10 to 15 minutes. This instrument is divided into two parts. The first part has four categories: daily activities, emotional status, arm/hand function, and mobility, with an accompanying five-point scale for patients to estimate their function; part 1 questions are then totaled to create the “dysfunction index.” The second part contains 12 questions that assess the degree to which patients are bothered in recreation and leisure, sleep and rest, work, and family, also with an accompanying five-point scale; responses from the second part are combined to create the “bother index.” Table 27-1 provides an overview of some of the more common outcome measures used in the reporting of results in orthopaedic trauma.
Term | Definition |
---|---|
Performance | What is done and how well it is done to provide healthcare (JCAHO 2002) |
Performance Measurement * | The use of both outcomes and process measures to understand organizational performance and effect positive change to improve care (Nadzam and Nelson 1997) |
Performance Indicator † | Markers or signs of things you want to measure but which may not be directly, fully or easily measured (Alberta Government 1998) |
Performance Measure | A quantitative tool, such as rate, ratio or percentage, that provides an indication of an organization’s performance in relation to a specified process or outcome (JCAHO 2002) |
Process Measure | A measure focusing on a process that leads to a certain outcome, meaning that a scientific basis exists for believing that the process, when executed well, will increase the probability of achieving a desired outcome (JCAHO 2002) |
Outcome Measure | Not simply a measure of health, well-being or any other state; rather, it is a change in status confidently attributable to antecedent care (intervention) (Donabedian 1968) |
* Like this one, many performance measurement (PM) definitions included the use of measurement results for organizational improvement that implies performance management—and resulted in these two terms being used interchangeably in the literature.
† Despite these distinctions, the terms performance measure and performance indicator were usually used interchangeably in most general discussions about PM because either or both are used in PM.
Patient Reported Outcome Measurement Information System
Although the sophistication of outcome measures has rapidly advanced, there appears opportunity for further refinement through the use of modern item response theory (IRT) and computer adaptive testing (CAT). Recognizing the importance of PRO measures in demonstrating value for healthcare expenditures, the National Institutes of Health (NIH) has embraced these sciences with the Patient Reported Outcome Measurement Information System (PROMIS) initiative. PROMIS measures domains related to physical health, mental health, and social health for both adults and children. Upon this structure, specific tests have been developed for pain, fatigue, emotional distress, physical functioning, social role participation, and global health perceptions. For each of these areas of potential interest, PROMIS offers domain-specific short forms and expanded forms, which are similar to other existing PRO measures. PROMIS becomes novel in its use of CAT. Using computer-based “smart” algorithms, the questions presented to an individual patient are determined by the responses to previous questions. This IRT innovation more efficiently identifies domain performance while avoiding ceiling and floor effects by selectively presenting patient questions germane to the individual’s determined functional level. Evidence of improved precision and instrument efficiency has been demonstrated in the physical functioning domain in patients with rheumatoid arthritis. Given the rapidly expanding use of digital technologies in orthopaedic research and the growing experience with these publicly available outcome measures, it is anticipated that PROMIS methodologies will play an expanding role in outcome measures in orthopaedic trauma.
Choosing the most appropriate outcome measure requires careful framing of the questions to be answered combined with a detailed understanding of the relative assets and liabilities of the measures under consideration. In addition to the mechanics of a given measure, a significant amount of consideration needs to be given to the degree to which the measure has proven to be reliable, valid, and responsive in peer-reviewed testing. Reliability refers to a test’s ability to be free of measurement error. Several facets of this concept exist and include interobserver, intraobserver, and test/retest consistency. Validity refers to a test’s ability to accurately measure the domain that it purports to reflect. Many components of validation exist, including a subjective assessment of the methods by which instrument items were developed, the demonstrated ability of the measure to positively confirm the hypothesis behind the development, and the degree to which the test under consideration performs similarly to previously accepted “gold standards.” Despite common reference to a particular instrument being “validated,” this concept is not an all-or-none phenomenon. Rather, the validation process is a continuous one and improves as population norms are expanded and segmented. As an example, findings from the Lower Extremity Assessment Project (LEAP) found that factors other than treatment, such as level of education and socioeconomic factors, were far more important in determining outcome as measured by SIP scores. This increased appreciation of patient demographics as a driver of outcome has led investigators in subsequent trials to seek standardization in the collection of demographic data in addition to standardization of protocols and outcome measures. Finally, an instrument needs to be responsive to clinically significant change. This requires that a given instrument contains enough items capable of creating differentiation between functional levels and that the patient’s functional level is not above or below the target range for a given instrument causing a ceiling or floor effect.
Unfortunately, the literature that assesses the performance of outcome measures in these areas of reproducibility, validity, and responsiveness are necessarily heavily laden with advanced statistics that often challenges clinician reviewers in their attempts to make prudent choices. Recognizing the need to provide clinical researchers standardized criteria that assess the instrument performance, a number of organizations including the NIH, American Academy of Orthopaedic Surgeons (AAOS), and Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) have dedicated themselves to the interpretation and dissemination of instrument-validating reviews.
Economic Outcome Measures
In addition to regionally specific outcome measures and patient-reported general health outcome measures, increasing importance is being applied to the economic outcome of medical interventions. In the United States, where healthcare expenditures consume 18% of the gross domestic product, the projections of continually escalating medical costs produce grave concern. On this background, those financing the provision of healthcare are appropriately asking for demonstration of the value returned by these expenditures. Important steps in considering economic factors in healthcare treatment decisions have already been made in orthopaedic trauma. In their prospective analysis of intraarticular calcaneal fractures, Brauer and colleagues drew the conclusion that when considering direct and indirect economic factors, open reduction and internal fixation was a more cost-effective approach than closed treatment. Similar work has shown economic value to the early fixation of nondisplaced scaphoid fractures and distal tibial fractures. In another economic analysis, MacKenzie and colleagues projected future cost estimates for a group of patients with severe lower extremity injury who were prospectively studied. In comparing future cost estimates for those treated with limb-salvage to those treated by amputation, their analysis projected that lifetime costs of amputation were substantially greater than those associated with limb preservation.
Beyond predictions of total cost, it is becoming increasingly clear that the resources available to allocate to healthcare are not unlimited. If choices are to be made regarding fundable interventions, some measure of the cost-effectiveness of a particular intervention is required. Conceptually, meaningful estimates of societal value of a given intervention are dependent on a quantitated health improvement, the estimated remaining years of life, and the costs associated with a particular treatment. Of these three measures, a single quantitative health measure (utility score) is the most challenging to determine. Current methodologies used by most health economists define a range for the utility score between 1.0 and 0.0 with a score of 1.0 representing perfect health and 0.0 representing death. Determining the utility score of a given treatment has been the subject of much debate. Common mechanisms include the use of the QWB PRO scale, a population-derived “time trade-off” (TTO), or “standard gamble” (SG) methodologies. In the TTO method, respondents are asked to trade a year at a compromised level of health for a selected shorter period of perfect health. For example, if respondents chose 9 months of perfect health in exchange for 12 months of compromised health, this would produce a utility index of 0.75 for that type of compromised health. In the SG methodology, respondents gamble with variable probabilities the chance to restore perfect health or death. If respondents would accept an 80% change of being “cured” versus a 20% chance of death, this would produce a utility index of 0.8. Because of the intense interest in extrapolating utility scores from other studies, “maps” converting other outcome measures such as the SF-36 and EQ-5D into utility scores have been created and accepted by health economists. Once a utility score is obtained, the quality-adjusted life-year (QALY) is determined by multiplying the utility score by the patient’s life expectancy. Cost-effectiveness of a given procedure can then be derived by dividing calculated QALYs by the cost of intervention. Comparison between alternative treatment options can be calculated as an incremental cost-effectiveness ratio in which the QALY/cost between procedures can be directly compared.
Choosing an Outcome Measurement Tool
From the single surgeon private practitioner to the most active participant in a multicenter randomized trial, all those involved in the delivery of orthopaedic trauma care should be active in the measurement of their outcomes. For the former group, a disciplined monitoring of complication rate may be the most appropriate monitoring and is an effective means of providing a level of quality assurance within a practice. For the interested subspecialist, the routine application of validated site-specific outcome measures will provide a quantitative measure to patient outcome allowing for individual and practice comparisons to the published standards. For the active clinical researcher, choosing both a regionally specific measure in addition to a PRO general health measure provides the necessary site specificity while placing this injury in the context of overall health. Use of currently accepted outcome measures such as the MFA, SMFA, SF-36, EQ-5D, and SIP allow the advantage of comparison to historic controls. If there is concern over a ceiling effect for a given musculoskeletal condition, consideration should be given to either the MFA or SMFA. As the science and experience with IRT and CAT advance, it is anticipated that measures such as the PROMIS initiative will build on existing validation data and create computer-aided algorithms that will eventually supplant today’s standards. Finally, in our modern climate of cost awareness, health outcomes should be considered alongside cost data so that patients, providers, and payers can allow reliable value analysis to enter into their treatment-decision algorithms.