Assessing the effectiveness of treatment

Already in 1900, I had become interested in what I have called the End Result Idea, which was merely the common-sense notion that every hospital should follow every patient it treats, long enough to determine whether or not the treatment has been successful, and then to inquire ‘if not, why not?” with a view to preventing similar failures in future. —E.A. Codman


The concept of measuring outcomes and implementing quality standards is not new. Evidence-based quality improvement was present as early as the 1850s, when Florence Nightingale demonstrated that basic sanitation and hygiene standards led to decreased mortality when caring for soldiers wounded in the Crimean War. However, the modern outcomes movement began with E.A. Codman, who is best known for his work on the shoulder but whose most lasting contribution to medicine was his commitment toward transparent outcomes reporting that he called his “End Result Idea.” In 1910, Codman and a small group of American surgeons, then called the Society of Clinical Surgeons, visited London, where he presented on his End Result Idea, the first organized concept on clinical outcomes. His life became consumed by instituting this concept and led to his resignation from Massachusetts General Hospital in Boston to help found a hospital in which he could study his own outcomes and practice medicine as he thought it should be practiced. Codman believed that his outcomes system should be applied to every patient and with follow-up until the result of the procedure was known. He believed that these results should be used to compare surgeons, allowing them to specialize in what they did best, thereby benefitting both patient and surgeon. His ideas were initially well received, and he was appointed as the Chair of the Committee on Hospital Standardization, in the then fledgling American College of Surgeons (ACS). In 1917 the ACS adopted his End Result System for its Hospitalization Standardization Program. This program would establish “minimum standards” for hospitals, including the following :

  • Organizing hospital medical staffs

  • Limiting staff membership to well-educated, competent, and licensed physicians and surgeons

  • Framing rules and regulations to ensure regular staff meetings and clinical reviews

  • Keeping medical records that included the history, laboratory, physical examination

  • Establishing supervised diagnostic and treatment facilities such as clinical laboratory and radiology departments

Unfortunately, Codman received little credit for these innovations, because 10 years later, at a brief narration discussing the events of the founding of the ACS, he was not even mentioned.

By 1952 the American College of Physicians, the American Hospital Association, the American Medical Association, and the Canadian Medical Association would combine with the ACS to form the Joint Commission on Accreditation of Healthcare Organizations. , This concept was not addressed by the government, and the original Social Security Act, passed in 1935, failed to address medical benefits. It was not until 1965, when amendments XVIII and XIX created Medicare and Medicaid, respectively, that United States healthcare began to fall under federal supervision. Under Title XVIII, Congress enacted a set of rules called “Conditions of Participation.” These rules set hospital mandates, such as medical staff credentials and nursing services. , In the late 1980s the Joint Commission on Accreditation of Healthcare Organizations implemented accreditation standards, which reflected concepts presented by Avedis Donabedian in his 1966 article entitled “Evaluating the Quality of Medical Care.” , In this article, he noted three key aspects to quality measurement: structure—the characteristics of healthcare delivery systems; process—what and how care is provided; and outcomes—the consequences of care. These concepts of structure, process, and outcomes continue to form the basis for quality measurement nowadays.

By the 1980s, one of the more important developments in the healthcare field was the recognition of the centrality of the patient’s point of view in monitoring the quality of medical care outcomes. In 1992, Ware et al. published one of the first patient-reported outcomes (PROs) measuring quality of life, the Short Form-36 (SF-36). This score allowed a validated measure of outcomes from the patient perspective and once again revolutionized the quality assessment movement. For the first time, PROs could be tracked within and across disease processes. A multitude of similar scores have followed, and with the evolution of information technology, computer adaptive testing (CAT) has enabled the PRO to be reduced to a small number of questions that reflect the well-being of patients.

One final evolution is worthy of mention. In 2006, Michael Porter published the book entitled, Redefining Health Care: Creating Value-Based Competition on Results. In it he championed the idea of value—that is, patient-relevant outcomes/cost per patient to achieve these outcomes—and connected the ideas of outcomes and cost. Value-based healthcare was born and is currently one of the dominant metrics of quality measurement. The cost of providing healthcare to achieve quality outcomes is central to the current quality movement. Unfortunately, PROs scores are not currently part of the so-called value equation. This is a challenge in particular to the shoulder surgeon because “outcome” measures such as mortality and readmission rates do not apply. Most value measures are process or patient experience measures that, although important, may not capture the depth of true value of surgical care to the patient. Furthermore, the measurement of cost is even more challenging. It can mean different things to different stakeholders and is often purposefully vague. Is it the amount paid, the charges, or the cost of delivery? Should “cost” include time away from work by the patient or the cost of complications? Transparency is not the law of the land, and thus an accurate measure of cost and its effect on value remain elusive.

The idea of following every patient through a treatment was and is a noble one, but it is also a process that requires commitment on an individual, institutional, and even a societal level. Codman’s pioneering ideas on the End Result opened new roads that have forced us to forge through questions about the very definitions of “end” and “result.” These new roads connect with others in the fields of information technology, economics, and even psychology, deepening our understanding of what makes up an “End Result” and from whose perspective. The very concept of outcome and quality continue to evolve.

Development of outcomes tools

Outcomes assessment in the shoulder can be traced back to Codman. For most of the history of shoulder surgery since, outcomes were generally recorded as reviews of the doctors’ notes of the medical record. These reports were generally focused on physician impressions of pain, functional outcomes, and range of motion (ROM). This changed in 1978, when Carter Rowe published his landmark paper on outcomes after Bankart repair. It should not be lost on the reader that the subtitle to his article was “A Long-Term End Results Study,” with a nod to Codman’s legacy. Part of what made this paper a landmark was that Rowe presented an outcome score within the paper that took into account stability, motion, and function (which included at least a blunt assessment of pain and patient reported functional outcome), and assigned patients a point total based on this assessment. To our knowledge, this was the first outcome score ever reported in the orthopedic literature and was a combination of clinician reported and performance-based measures. Performance-based measures that include ROM and muscle strength are generally recorded by the examiner. They remain a valuable assessment tool, providing objectivity and the ability to measure how a particular treatment affects functional outcomes. However, observer-based assessments have several drawbacks. Most importantly, they do not measure the patient’s interpretation of his or her own outcome. Even if asking the patient’s opinion, observer bias is introduced and reduces the validity of the assessment tool. Another limitation of traditional observer-based assessments is the requirement of a patient to be physically present for the test. This presents logistical challenges that result in inconsistent examinations and poor follow-up. Steps can be taken such as telephone interviews, and patients can even accurately estimate their own ROM in certain situations.

As the science around health-related quality of life has advanced, the PRO has overtaken the traditional clinician report as the central method for assessing outcomes. Patient-reported functional outcomes have been developed and tested using both psychometric and clinometric methods and have several advantages over observer-based assessments: (1) assessment of the patient’s perception regarding his or her condition, (2) elimination of bias related to clinician observation, (3) ease of administration by telephone or mail, (4) physical examination not required, (5) cost-effectiveness, and (6) less time required to administer.

Development of patient-reported outcomes

The development of a patient-based outcome instrument was first described in 1985 by Krishner and involves the following five steps :

  • 1.

    Identification of a specific patient population

  • 2.

    Generation of items (questions)

  • 3.

    Item (question) reduction

  • 4.

    Pretesting of the outcome instrument

  • 5.

    Determination of the instrument’s measurement properties (validity, reliability, and responsiveness)

Creating a new outcome measure is an exhaustive process that should be carefully monitored and concentrated on only important and common conditions. The most important characteristic of a PRO is that it is developed with direct input from its target patient population. Item generation and reduction are the most critical steps in the development process because they “guarantee” that patients have communicated what is important to them and represent content validity.

Perhaps the best way to understand what goes into the development of an outcomes measure is to use a case example, and perhaps the best-case example comes from the creation of the Western Ontario Shoulder Instability Score, described by Kirkley et al. In that paper, the authors followed Kirshner’s five-step model and developed what is considered the gold standard score in the assessment of shoulder instability and, in so doing, established its own status as a landmark paper in outcomes development.

Kirkley et al. identified a specific patient population (step 1), as those with shoulder instability, with specific inclusion (e.g., apprehension) and exclusion (e.g., psychiatric illness) criteria . Item generation (step 2) was accomplished in three substeps. The first was a thorough literature search for existing established outcomes measures, and items from the Constant, University of California at Los Angeles, and American Shoulder and Elbow Surgeons (ASES) scores were included as “anchor” measures to compare with. The second substep was to interview colleagues who had specific experience with shoulder instability, to ensure the score asked questions that were clinically relevant. The third substep, and cited as most important by the authors, was to interview patients across different age, gender, type, and severity of symptoms. The authors determined to stop when no new items emerged after five consecutive patients. At the end of item generation, they had interviewed 33 patients and had 291 separate items.

The process of item reduction (step 3) involved the identification of a focus group who could remove duplicates, incomprehensible, or irrelevant items. These items were then given back to a group of patients with shoulder instability, who then answered whether they experienced the item and how important the item was to their shoulder function and well-being. In the Kirkley study, the top 50 items were statistically analyzed and those with high correlations were considered duplicates and again reduced. This item reduction resulted in 21 individual questions that make up the Western Ontario Shoulder Instability Index (WOSI) score in its current form.

Pretesting of the outcomes instrument (step 4) came next. The authors gave the questionnaire to two separate groups of 10 patients. Issues with question clarity or wording were addressed and changed with the first group, and no further changes were required with the second.

The final step (step 5) in the creation of an appropriate outcome measure is the determination of the instrument’s measurement properties including validity, reliability, and responsiveness.

Validity, in its simplest form, can be defined as whether the instrument measures what it is supposed to measure. The simplest way to do this is to compare the instrument with a gold standard and to see how well it correlates. However, in the development of a new outcome measure, there is no gold standard. In the case of the WOSI score, Kirkley et al. evaluated the construct validity of the WOSI by administering it, along with the Disability of the Arm, Shoulder and Hand (DASH), ASES, Constant, Rowe, SF-12, shoulder ROM, and shoulder global rate of change scores to a group of 47 patients and established construct validity for the WOSI and determining that scores correlated significantly.

Reliability refers to the consistency of an outcomes measure in the absence of a change in a patient’s clinical condition. In other words, a patient should score the same on an instrument at different time points, provided their condition has not changed. In the Kirkley example, instability patients were given the WOSI at 2 weeks and 3 months. They were asked if their condition had changed at each time point, and if it had, they were excluded. Reliability was therefore established by stable scores in patients who did not perceive a change in their status.

The final step in measurement in an outcomes tool is responsiveness. Responsiveness implies that the instrument will change when the patient’s condition changes, or the ability of a tool to measure change over time. Several statistical analyses can measure this, including the effect size (mean change score divided by standard deviation of baseline scores) or the standardized response mean (mean change in score divided by the standard deviation of the change scores). In the Kirkley study, the responsiveness of the WOSI was higher than any of the other scales. The main advantage of a highly responsive scale is that fewer subjects are required to show a statistically significant difference between treatment groups.

Evaluation of scores for clinical meaningfulness

Even when an outcomes tool has been shown to be statistically valid, reliable, and responsive, one must still determine if the applied score is clinically meaningful. For example, Jeong et al. published a large series of 415 patients comparing single versus transosseous equivalent repair of rotator cuff tears. The only outcome score that showed a statistically significant difference was the visual analog scale function score, which favored the transosseous equivalent group (8.47 ± 1.70 degrees) over the single row repair group (7.91 ± 1.66). Although the headline from this paper may be that “double row outperformed single row for rotator cuff repair,” it is important to realize that the visual analog scale function score was 0.5 points different, which is not likely clinically meaningful. This paper illustrates the importance of differentiating between something that is statistically significant and one that is clinically significant.

Several methods have been proposed to ensure that outcomes scores can be assessed for meaningfulness. Jaeschke et al. described the “minimal clinically important difference,” which defines the smallest amount an outcome must change to be meaningful to patients during the use of patient-reported outcome measures. Minimum clinically important difference (MCID) is considered a patient centered because it evaluates both the magnitude of the improvement and the importance patients place on the change. This concept has caught on, and MCIDs are available for many of the common outcomes scores that are used in shoulder assessment ( Tables 68.1 and 68.2 ). However, there are issues with the MCID. First, it may be different when an outcomes tool is applied across different patient populations. For example, the MCID for the ASES score has been reported between 6.4 and 17 depending on the shoulder pathology tested. Furthermore, the MCID itself may be misleading. Consider the Simple Shoulder Test (SST) in rotator cuff surgery, which has an MCID of 2. Improvements in SST from 0 to 3 and from 9 to 12 both meet the MCID but are quite different outcomes. If one considers this problem differently by the use of percent maximal possible improvement, defined as:

<SPAN role=presentation tabIndex=0 id=MathJax-Element-1-Frame class=MathJax style="POSITION: relative" data-mathml='(Posttreatment-Pretreatment)/(Perfect score-Pretreatment score)’>(PosttreatmentPretreatment)/(Perfect scorePretreatment score)(Posttreatment-Pretreatment)/(Perfect score-Pretreatment score)
(Posttreatment-Pretreatment)/(Perfect score-Pretreatment score)

TABLE 68.1

Generic Health Outcomes Instruments

Description Scoring Strengths Weaknesses
SF-36 36 items: 8 domains: physical functioning (10), social functioning (2), role limits physical (4), role limits emotional (3), mental health (5), energy/fatigue (4), bodily pain (2), and general health perception (5). Each domain has a score 0–100; total can be converted to 0–100 score. Most widely used and validated study of its kind. Widely adopted across many health conditions. Time consuming; score and scoring system are proprietary and not freely available to the public.
SF-12 , Shorter version (12 items) of the SF-36. Similar to SF-36. Validated in several shoulder conditions and demonstrates good correlation with the SF-36. Score and scoring system are proprietary and not freely available to the public.
VR-12 , 12 items transferrable to SF-12. Similar to SF-12. Comparable with the SF-12; in the public domain and therefore free to use. Not widely adopted.
EQ5D 5 items: 3 level Likert score in mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Utility score that can be expressed as a quality adjusted life year (QALY). Can be used to make cost effectiveness decisions as it is easily converted to QALY. Not widely adopted in North America; may lack responsiveness in some studies.
PROMIS 10 10 question survey used to assess healthcare-related quality of life measures for the general population. Multiple domains including overall health, pain, fatigue, social, mental, and physical health. Validated and can be used across numerous healthcare disciplines; in public domain; can measure health-related quality of life. Not appropriate for all populations; unclear how well it will measure disease-specific outcomes.

EQ5D , EuroQuol 5 Dimensions Questionnaire; MCID, minimum clinically important difference (variable depending on condition assessed); PF CAT, Physical Function Computer Adaptive Testing; PROMIS, Patient Reported Outcomes Measurement Information System; QALY, quality of life adjusted year; SF , Short Form; VR , Veterans RAND.

The difference in these two scores is meaningful. The patient who improves from 0 to 3 achieves only 25% of maximal improvement, whereas the patient who improves from 9 to 12 achieves 100% of the maximum possible improvement (F. Matsen, personal communication; 2018)

More recently, the Patient Acceptable Symptom Score has been proposed to determine the highest symptom level at which patients consider themselves well. Briefly stated, the difference between the Patient Acceptable Symptom Score and MCID score equates to the difference between feeling “good” and feeling “better.” Tubach argues that the former is of more value to the patient. There is much work to do in the area of determining the critical level of change in PROs across different disease states and demographics.

Two other aspects are important in determining the clinical meaningfulness of an outcomes score. These are ceiling and floor effects. A ceiling effect occurs when an instrument cannot reliably differentiate patients who score very high on a scale, whereas a floor effect occurs when an instrument cannot reliably differentiate patients who score very low on a scale. For example, administering a Kerlan-Jobe Orthopedic Score, designed to evaluate high-level throwing athletes, to a population of massive irreparable rotator cuff tears would not make much sense. In this case, one might expect floor effects as many of the patients would score low on the test. Conversely, applying the SST to a population of gymnasts might exhibit high ceiling effects given the level of function required of these athletes.

Types of outcomes measures

There is a plethora of shoulder outcomes scores available to the practicing clinician, and applying the correct score depends on the question that is asked by the clinician. These scores can be approximately divided up into general health assessment instruments, anatomic- (shoulder-) specific instruments, and even disease-specific instruments.

General health-related quality of life instruments

These instruments provide information on the impact of a health condition on a patient’s general health. One of the most valuable aspects of these conditions is that it allows comparison across different healthcare conditions. For example, Gartsman et al. used a general health assessment questionnaire to demonstrate that patients with shoulder conditions demonstrated lower patient perceived health than the US norms. Furthermore, these shoulder conditions were shown to rank similar to congestive heart failure, diabetes, clinical depression, hypertension, and acute myocardial infarction in terms of patient perceived disability. Such methodology allows the comparative analysis of disability across disease conditions.

There are several general health conditions used in the evaluation of shoulder outcomes. These are summarized in Table 68.1 . The first and most commonly used is the SF-36, which was developed in 1992 by the Rand corporation and has been validated across multiple health conditions and diverse populations. , It has eight different domains, including physical functioning, social functioning, role limitations due to physical problems, role limitations due to emotional problems, mental health, energy/fatigue, bodily pain, and general health perception. An abbreviated version of the SF-36, the SF-12, has been developed and validated in several conditions, with good correlation to the longer version.

The Veterans RAND 12 (VR-12) score was developed following the same methodology used to create the SF-12 score but just applied to an ambulatory population of US veterans. The main advantage of the VR-12 score is that it is within the public domain and thus is available free of charge to the public for use in research and quality assessment programs.

In an effort to establish efficient, valid, and generalizable PROs, the National Institutes of Health developed the Patient-Reported Outcomes Measurement Information System (PROMIS) in 2004. PROMIS instruments have the ability to use CAT, which allows efficient administration while maintaining the power and sensitivity of a large question bank. CAT assists in item reduction in that it removes duplicate questions based upon patient responses in real time. For example, if a person answers that he or she cannot lift the arm over his or her head, the person would not be asked if he or she could pick up a 10-lb weight over the head. PROMIS instruments cover many independent and unique domains; however, the most popularly applied domain within shoulder surgery is the Physical Function CAT (PF CAT). The PF CAT was compared with the ASES and SST in the evaluation of patients with rotator cuff disease. In this study, the PF CAT was found to have the best person reliability of the measures studied. Furthermore, it had far fewer floor effects than the SST (3.2% vs. 21%) and required significantly fewer questions (4.3) than the ASES (11) or the SST (12). Similar findings have been reproduced in a cohort of patients with subacromial impingement syndrome. CAT is a very promising technology of making outcomes collection and analysis as simple as possible but no simpler. However, a few disadvantages with CAT do exist. The most important is that it requires the patient to have access to computers and the Internet, which, although seemingly ubiquitous, may pose an access problem, especially for older, low-income patients, and may introduce a selection bias.

Shoulder specific outcomes measures

There are many well-studied region-specific shoulder-related outcome scores that have reliability and validity measures reported. The most commonly used scores are summarized in Table 68.2 . Each generally contains a mixture of pain, function, and ROM. There are a few unique aspects of each score that are worthy of mention. The DASH is unique in that it is not shoulder specific but evaluates the entire upper extremity regardless of whether the pathology comes from the shoulder, elbow, or hand. The DASH has demonstrated excellent responsiveness compared with other joint-specific questionnaires and has been validated for use in both proximal and distal upper limb disorders. It has also been shown to correlate well with the SF-36 but with fewer ceiling and floor effects, supporting its use as a valid measure of health status in patients with a wide variety of upper extremity complaints.

Aug 21, 2021 | Posted by in ORTHOPEDIC | Comments Off on Assessing the effectiveness of treatment
Premium Wordpress Themes by UFO Themes