Chapter 93 Scoring Systems and Their Validation for the Arthritic Knee
Background and Rationale
Total knee replacement has proved to be a highly effective surgical intervention for improving the health-related quality of life of patients suffering from knee arthritis. Postoperatively, patients have reported reduced pain, restored range of motion, high satisfaction, and the ability to return to a more active lifestyle.12,31,34,61,62 Moreover, detailed economic analysis has confirmed the favorable cost-effectiveness of the procedure.52 It is projected that the demand for total knee replacement in the United States will approach 3.5 million cases per year by 2030—an increase of 673% over current usage.47 Although the aging population is contributing to this increase, the obesity epidemic and the resulting prevalence of osteoarthritis are driving younger patients (i.e., younger than age 65) to seek knee replacement at astounding rates.23 In fact, Kurtz and associates have projected that, while younger patients make up half of all revised knee replacements, by 2016, they will also constitude the majority (i.e., >50%) of the demand for revision by 2011.48
As increasing numbers of patients elect to have their knee replaced, the need for revision surgery will also increase. As a result, measures to control the risk of complications and the resulting financial burden have to become top priorities in the formulation of health care policy.47,80 Today, the effectiveness of total knee replacement and other costly surgical interventions is being subjected to increased scrutiny. In this environment, it is of paramount concern to patients, health care providers, device manufacturers, and the community alike that the health-related benefit and cost-effectiveness of total knee replacement be demonstrated in an objective and scientifically valid manner.66 Quantification of the success of these procedures can be achieved in a number of ways. The simplest measure of the clinical success of knee arthroplasty is crude survivorship, defined as the cumulative number of initial procedures that have not been revised over a set period of time postoperatively, typically 5 or 10 years. However, to establish any insight into surgical outcomes, especially as they relate to subjective evaluations, more comprehensive assessment methods are required. The outcomes movement, a collection of efforts by investigators to address this concern, has spurred the development of several assessment systems that measure patient health status.11
Certainly, one can gather patient information and present the attributes of successful surgeries as a series of case reports. However, any conclusions drawn from such an exercise would lack broad applicability and would not allow valid comparisons between variations in patients and treatments. For continued improvements to be made in clinical decision making, implant design, and patient counseling, assessment tools are needed that apply uniformly to the appropriate patient group. In attempting to measure the subjective and objective results of the procedure in a clear and understandable manner, physicians and researchers have introduced numerous assessment instrument systems. Most often, these systems consist of questionnaires, surveys, or interviews that the patient and/or surgeon can complete during clinic visits or research studies, along with some data derived from the clinical examination. The purpose of any knee arthritis assessment system is to provide an objective evaluation of the patient’s condition, both before and after surgery, to measure the level of benefit to the patient achieved by the surgery.17
Traditional methods of evaluating different treatments have relied on objective endpoints, with little relevance to how the patient was affected by the treatment itself.11 Wright posited that a fundamental task for clinicians when evaluating patients or interpreting the results of clinical trials is to decide whether a particular outcome assessment relates to an important improvement in patients’ health. Because total knee replacement is generally an elective procedure, the patients’ goals and expectations need to be clearly determined if the outcome is to be accurately appraised.81 Gartland has called for well-designed, valid outcome assessment instruments to measure the effects of treatments on those who were treated, as well as the quality of their lives after the treatment is received.28 Rather than focusing on improvements in objective measures of joint function (such as range of motion and stability) or rates of postoperative complications, including revision, contemporary outcome assessment instruments aim to identify and quantify outcomes in terms of patients’ priorities, including pain relief, physical function, and long-term satisfaction.31 This emphasis on patients’ perspectives has led to the development of several patient-based outcome instruments that can elucidate factors affecting decisions about arthroplasty, as well as those that influence the presence and extent of benefit to the patient.
Although no instrument has been universally accepted for evaluating the outcomes of patients after total knee replacement, it is generally agreed that a quantifiable rating (usually a scoring system, e.g., 0 to 100 points) provides the most concise, clear measure, especially for comparing large groups of patients and treatments. Scoring systems are generally categorized as overall generic health measures (i.e., global health) or specific health measures (i.e., disease-specific), and each system may contain one or more domains that address a particular facet of health (e.g., functional capabilities, pain, mental health). Numerous outcome instruments have been developed, each with its own specific focus.4,11,17,19 The process used to determine each instrument’s composition, scoring system, and reporting method has been the source of much debate. Variations between these rating systems make it difficult for them to achieve their purpose, that is, to compare patients’ outcomes and assess the merits of prostheses and treatments across studies.42 Despite methodologic differences, when scoring systems are subjected to formal validation and are subjected to the rigors of standard psychometric principles and item response theory, they become generally accepted and can be compared across studies.
Scoring System Structure
The effects of specific treatments on the patient’s health consist of numerous factors, many of which are subjective perceptions. As a result, investigators have had difficulty in reaching a consensus as to how outcomes of treatment should be measured. Given the existence of a multitude of instruments that use an assortment of methods for patient evaluation, it can be difficult to compare treatment outcomes directly, or to draw appropriate or practical conclusions from data collected using different outcome instruments.19,42 In the case of total knee replacement, the ability to compare outcomes across different studies is imperative for the following:
As early as 1975, Kettelkamp and Thompson recognized the potential for discord and postulated that a uniform rating system should fulfill the following requirements41:
Regardless of the type of instrument used, the responses elicited will be influenced by patients themselves (i.e., their personality, educational and cultural background, health history, etc.), as well as by the design and content of the instrument itself.33 Therefore, an understanding of the structure and scoring method adopted by an outcome assessment instrument is critical for meaningful interpretation of its results.
Item Development
In practice, relief of symptoms, restoration of function, promotion of satisfaction, and a feeling of well-being are among the most important outcomes of total knee arthroplasty and are universally recognized as central concerns of patients and doctors alike.22 Consequently, any instrument that attempts to measure the outcome of treatment of the arthritic knee must assess these multidimensional facets of each patient’s experience.75 Patient outcome measures, especially self-assessed measures, are all subject to patients’ attitudes, abilities, expectations, and motivations to undergo surgery.75 Whether the assessment instrument is designed to evaluate the patient’s general health or only the physiologic function of the knee in isolation, optimally designed instruments clearly define these health dimensions and measure, with full coverage, the underlying trait or condition of interest.14
Designers of outcome instruments start by identifying the intended scope of their instrument and what it intends to measure. Those interested in the overall impact of an arthritic knee or a knee replacement on a patient’s health consider different outcome variables than those interested in the knee in isolation from the patient’s other health issues. In general, instruments’ scoring systems contain items that assess pain, function, range of motion, and satisfaction, among other components.12,13,19 Certain outcome instruments assess each of these components of patient health in different ways. For example, assessments interested in patient function may include any one or more of the following traits: the ability to walk, dependence on supports or walking aids, the presence of a limp, the ability to sit for an extended period, the ability to rise from a chair, the ability to ascend or descend stairs, and the ability to run.19
However, no causal relationship is evident between these variables. In other words, symptoms could limit physical activity, or abstaining from activity may exacerbate persistence of symptoms. Further, the true relationship between these components and a patient’s satisfaction or overall assessment is variable. Certain patients may have very different, even dynamic, expectations of their knee function and may change their personal definition of a satisfactory outcome. All interrelated health factors contribute to patient priorities, their internal processes of assessment of the need for surgery, and the value of the level of restored function and symptom relief that the procedure provides.40 Therefore, the items in an assessment instrument must distinguish between each of these contributing factors in striving to characterize patient outcome.22,58,75,81
Designers of assessment instruments must establish clear definitions and parameters for each variable they will use to measure outcome. This step is critical, yet it is subject to much debate. For instance, instruments commonly contain items asking patients about any persistent pain (e.g., “Have you had any sudden, severe pain—‘shooting,’ ‘stabbing,’ or ‘spasms’—from the affected joint”),18 but only a few consider whether patients mitigate their symptoms through the use of analgesics.17 Knee assessment instruments often focus on function as an outcome as well, but some inquire about the patient’s capabilities or limitations (“What degree of difficulty do you have bending to the floor?”),5,6 while others ask about the patient’s actual participation habits (“How often do you participate in squatting?”).9,63,84 Because patient-based assessment instruments are subject to interpretation and context, it is important that subtle differences are accounted for by the use of accurately worded items.65
It is also important that the items in an outcome instrument can reflect the clinical outcomes of all patients within its intended scope. With the demand for total knee replacements expanding, especially among young and active patients, and with technologies and techniques in constant development, a growing spectrum of patients will be reporting a myriad of conditions and outcomes.47,48,80 Items must be designed for all age groups, diagnoses, and treatments within their scope. Broad applicability allows differentiation to be made, even at extremes of patient distribution, such as those who are very young or very old; very active or very sedentary; or very healthy or burdened with complications or comorbidities.17,32,43,50 This means that investigations that quantify outcomes in terms of ability to perform physical activities should include items that distinguish among those who are limited and those who are relatively active.37,63,79 For example, in 1998 Amstutz and colleagues introduced the UCLA activity level rating, which distinguishes between patients’ activity levels as follows84:
Regularly participate in impact sports | |
Sometimes participate in impact sports | |
Regularly participate in very active events | |
Sometimes participate in mild activities | |
Mostly inactive: restricted to minimal activities of daily living | |
Wholly inactive: dependent on others, cannot leave residence |
From an array of patients with a broad spectrum of conditions, instruments aim to identify accurately and distinctly those who would benefit from treatment, as well as those who would be better served by an alternative approach.12,17,46,65
It is difficult for one item to measure completely a patient’s true condition or trait in isolation from the patient’s other related conditions. Each of the items in an instrument is designed to assess a separate underlying trait. However, although they may appear distinct, multiple items have a high probability of overlapping one another because patient traits are often interdependent. Most outcome instruments are based on questionnaires with multiple items that assess manifestations of the same symptom (e.g., pain) and are not completely independent; therefore, they have a significant statistical association with each other.22 This characteristic of patients’ conditions is referred to statistically as covariance. Thus, when an attempt is made to measure covariant traits, the risk of sampling the same underlying trait repeatedly through questionnaire items that are not entirely independent is high.65 This redundancy in questions, whether overt or subtle, can distort the measurement of the patient’s true condition and lead to spurious conclusions, such as overrepresentation of an underlying dimension of outcome. All of these considerations speak to the importance of ensuring that individual items measure only the condition of interest, without simultaneously measuring other conditions assessed by items elsewhere in the instrument.65,75
Scoring Methods
To achieve their purpose, outcome measures quantify or summarize patients’ responses into a score or classification. Although most instruments adopt a systematic approach to scoring a patient’s response, many features of their scoring systems were established arbitrarily and can vary widely from one another.19 Two factors contribute to variations in scoring methods: (1) the response format, and (2) the scoring allocation or classification system.
In most outcome instruments, test items may have nominal, partial-credit, graded, or other polytomous (more than two response categories) response formats when assessing multidimensional abilities.75 Current instruments contain items whose responses are scaled in many different ways, sometimes with too many response categories, other times with too few.14 When a question is asked, such as, “Do you have swelling in your knee?” some instruments offer responses in a Likert scale format, with ordinal answer choices corresponding to increments in symptom frequency or severity (e.g., “never/rarely/sometimes/often/always”).68,69 Other systems use a dichotomous (“yes/no”) response format for the same question. Items with too few response categories may not differentiate between patients with clinical differences, whereas those with too many response categories may introduce unnecessary error caused by variations in patient definitions of the gradations corresponding to the ordinal scale. Both of these issues can confound data and mask a patient’s true condition.82 Moreover, most Likert scale data are condensed into fewer categories in subsequent analysis, depending on whether clinicians view subjective assessment of symptoms in dichotomous (e.g., present/absent) or possibly trichotomous (e.g., absent/mild/intense) terms.
This raises the following question: “Is there value in the use of multiple-level options if responses are subsequently analyzed and discussed in a condensed format without weighting for frequency or severity within each category?” The appropriateness of alternative response formats can be assessed through statistical testing. For instance, Bach and coworkers used interobserver correlation analysis to conclude that pain is most reliably assessed using a simple four-point scale (“no pain/mild or occasional pain/moderate pain restricting activity/severe pain disturbing rest”), as in the Bristol Score or the Hospital for Special Surgery Score, as opposed to a more complex six-point or seven-point scale.4,35–38,54,60 Because the responses to each item contribute to a patient’s score, the item’s response format can have a significant impact on distinguishing patient outcomes.
Despite the proliferation of many different outcome scores in the literature, a movement toward standardization of scoring systems is under way. This has occurred because the scores allocated by different systems have proved difficult to compare, causing uncertainty regarding how outcome scores truly relate to patients’ clinical outcomes. If a scoring system uses a scale from 0 to 100 points, as many do, those points are allocated, or awarded, on the basis of responses to items throughout the instrument. This point allocation has been shown to vary widely by instrument.19 In a review of 34 different rating systems used to evaluate patients after total knee replacement, Drake and associates reported that the total points awarded for a perfect outcome ranged from 10 to 110, with scoring based on summation of points, deductions from a baseline score, or a combination thereof.19 The contribution (or weight) of certain components to the total score was notably diverse. For example, the contribution of pain symptoms to the total score could range from 7% to 69%, while the range of motion of the joint could be assigned between 4% and 30% of the total score. Diverse weightings for patients’ ability to perform functional tasks were also reported, with items such as “ability to climb stairs” accounting for 4% to 50% of the total outcome score.19 With the advent of instruments focused on patient satisfaction and expectations, investigators can potentially determine the primary subjective factors affecting patients’ postoperative satisfaction and their motivations for undergoing surgery. As investigators elucidate these and other factors affecting outcomes, guidelines will be developed to allow more consistent and appropriate weighting of each component of outcome scores.
In addition to component weighting, an understanding of a system’s point value allocation to specific responses is critical. In other words, two different systems can assign “pain” a maximum of 50 of the scale’s 100 total points, but a response of pain “once a month” may earn 40 points in one scale and 50 points in another. Moreover, the point scores necessary to achieve a particular categorical designation (e.g., 90 to 100 points is “Excellent,” 80 to 90 points is “Good,” etc.) are somewhat arbitrary and inconsistent across studies.19 Component point values may be the same in different scoring systems, but interpretation of responses may vary.19 In one system, an “excellent” or “acceptable” outcome may correspond to a score of at least 90 out of 100, but it may correspond to at least 80 out of 100 in another.1,19 Therefore, it is important to consider the weighting, allocation, and categorization method of the scoring system when interpreting a patient’s score.
Because patient outcomes are multidimensional, consideration of the outcome total score in and of itself does not necessarily shed light on the patient’s true condition. By analyzing a patient’s raw total outcome score (e.g., 70 out of 100), an investigator gains no insight into which conditions are primarily affecting the patient’s outcome. With this limitation, a raw total outcome score cannot readily describe or relate to a clinically relevant outcome. Jones and colleagues affirmed that a simple summative score may dilute the important effects of confounding conditions (e.g., other symptomatic joints, psychiatric disorder), masking the true effects of other conditions on functional recovery.40 Therefore, several investigators have emphasized the need for component subscores in addition to, or in opposition to, a sum raw total score.* Separation of health outcome components, sometimes called a dual rating system, can eliminate the falsely inflated or deflated scores associated with covariables in assessment systems that aggregate different parameters to a global score.37,42 Components of health, such as function, pain, and mental health, can be separated so that their component scores can be considered independently and as an aggregate composite score. Based on trends and comparisons of component scores, one can identify patient conditions that benefit most from treatment or even those that tend to predict success. With application of a distinct component scoring system, characterization of scores to clinically relevant outcomes is more feasible and appropriate.
Statistical Requirements
Psychometric Principles
Outcome instruments are most useful as a means to compare large sets of data (sets of patient subgroups, surgical techniques, component designs, rehabilitation protocols, etc.), so they must adhere to statistical principles that ensure broad applicability and comparison across groups. McDowell stated that a well-constructed and acceptable rating system uses “statistically correct procedures to refine an instrument whose content is based on clinical wisdom and common sense.”58 The conventional method of statistical testing, generally referred to as classical testing theory (CTT), states that valid methods of outcome assessment must adhere to three psychometric properties: validity, reliability, and responsiveness.*
The ability of an instrument to accurately measure the health of a patient or the effectiveness of a total knee replacement is termed the “validity” of the instrument. In other words, a valid instrument is one that measures what was intended, yielding results corresponding to the true state of the trait being measured.22,58
Many established assessment instruments consist of questionnaires designed to measure traits such as symptoms or pain, which are grouped together to form scales. Scaling systems allow subjective reports of health to be quantified and analyzed. Three general forms of validity—content validity, criterion validity, and construct validity—are tested in assessing whether or not a scale is truly valid.22,58 These forms of validity are defined as follows:
In the case of instruments assessing patient outcomes after treatment of knee arthritis, no instrument has been universally accepted. Therefore, constructs are sometimes validated through comparison against other questionnaires or against surgeons’ conceptual definitions of patient health. This circuitous logic is problematic and should be considered when construct validity is examined.21
Reliability, or reproducibility, is the extent to which repeated measurements of a stable phenomenon produce similar results across time, patients, and observers. According to classical testing theory, every assessment instrument has an inherent error because of its design, wording, response formats, and other contextual factors that influence individual responses.33,58 Consequently, as individuals complete outcome instruments, the responses elicited, and any scores derived from them, are a combined function of the patient’s true responses and this inherent measurement error. The proportion of variation in observed scores that is attributable to the true outcome is defined as the reliability of an instrument.58 In practice, the reliability of an outcome instrument can be measured through its test-retest reproducibility, which is most frequently estimated using an indicator of internal consistency, called Cronbach’s alpha. Cronbach’s alpha represents the average of all intercorrelations among test items of an instrument. It is used when the instrument’s items have more than two response options (e.g., a five-point Likert scale), and it indicates the degree to which a set of items measures an underlying latent trait.58
Reliability and validity are not altogether independent. Reliability is a necessary but insufficient quality of valid instruments; an unreliable measurement cannot be valid, and a valid instrument must be reliable.22,58 When a measurement is reliable, any change noted in a patient’s scores can be attributed confidently to a true change in the clinical status of the patient. Because the most direct measure of patient health outcomes after total knee replacement often includes subjective measurements (e.g., pain, function, satisfaction), a measurement tool that minimizes sources of variability and bias, while maintaining reliability, can instill confidence in those depending on its measurements.82
Responsiveness is a crucial component of outcome measures in distinguishing those patients who benefit from a procedure from those who do not. A more responsive test is more sensitive to subtle changes in patient health status.46 Thus, highly responsive scales allow clinical trials to be performed using fewer patients.83 Several methods of determining responsiveness have been introduced. Generally, the change in mean raw scores of a given sample of patient responses divided by the standard deviation of the sample will yield a measure of responsiveness called the standard effect size.58 Alternatively, the change in raw score may be normalized with respect to its standard error (t-test approach) or to the standard deviation of the changes in scores of all respondents (standardized response mean).58 Although no clear consensus has been reached as to how responsiveness should be demonstrated, it is generally agreed that a responsive instrument will be sensitive enough to detect changes in health or physical functioning that are revealed by existing established measures of the same or similar traits. Accordingly, the responsiveness of new knee function scales can be assessed by comparing the distribution of scores derived from the new instrument against those obtained with a basket of conventional instruments (e.g., Knee injury and Osteoarthritis Outcome Score [KOOS], Short Form-36 Health Status Questionnaire [SF-36]).22,58
For an outcome measure to be effective, it must accurately and reliably quantify a set of data that often consists of subjective and objective measurements describing the health of a variety of patients, spanning a broad range of conditions and perspectives. To date, a number of established methods have been proposed to assess patient-related quality of life and procedure cost-effectiveness, yet few have proved valid and reliable.4,19 When an instrument’s outcome measures have been demonstrated to be valid and reliable, its responsiveness can be a factor in deciding which instrument should be selected to measure the outcome variables of interest.58
Item Response Theory
The multidimensional character of patients’ perspectives influences the score of the instrument by introducing error and variability in responses. Classical statistical testing theory (CTT), including psychometric testing, although valuable and essential, has its limitations. To determine validity and responsiveness, CTT uses estimates based on correlational data, most often maximizing Cronbach’s alpha. Resulting analyses of response data using CTT may reflect only a small part of the underlying condition and may depend on the size and nature of the sample of patients and the set of items available to elicit patient responses.65 CTT assumes that systematic differences between responses of patients are due only to variation in patients’ underlying conditions of interest; other sources of variation are ignored and are assumed to be constant or random by nature.75 CTT is also limited by the fact that patients’ abilities and item difficulties cannot be estimated separately. CTT yields only a single reliability estimate and corresponding standard error of measurement, despite the fact that precision of measurement is known to vary according to patients’ ability levels.33 In other words, CTT methods do not capture the notion that different types of patients (e.g., male vs. female, active vs. sedentary) exhibit different response patterns; this could affect the estimate of instrument reliability.
Recent development of a modern statistical analysis method, called item response theory (IRT), has stemmed from recognition of these limitations. IRT addresses the need for deeper insights into patients’ response processes and into the interaction between each item of an assessment instrument and the abilities of the respondent.75 Using specified mathematical models, IRT can describe the association between a respondent’s underlying condition and the probability of a particular item response.33,75 IRT models are designed to cope with natural experimental error, noise, and other confounding variables that are unavoidable under less rigorous standardization of assessment, such as self-administered surveys.75 Application of IRT to outcome instruments has begun to enhance investigators’ ability to evaluate, modify, compare, and score existing instruments that reveal more useful health outcome information. Investigators have even used IRT methods to create and validate outcome instruments. One such instrument was administered by Parsley and colleagues to total knee replacement patients who received prostheses of two different designs. Patients receiving a posterior cruciate ligament (PCL) sacrificing design reported less ability to perform specific functional activities than those receiving PCL retaining designs.15 These results were unexpected because in the same patient groups, conventional instruments developed using CTT testing regimens showed no difference in clinical outcomes with the two prosthesis designs.15 This shows that instruments developed using IRT may facilitate the identification of clinically significant factors that would otherwise go unnoticed. Going forward, IRT can be used to develop briefer, more flexible, more efficient, and more precise instruments than could be constructed using classical approaches.14,33,43
Although the content of the questions in an outcome measure is certainly important, IRT models are explicitly directed at the rating scale.14 IRT models (e.g., Rasch, generalized partial-credit models) describe how difficult an activity is and the degree to which the patient’s response varies as his or her condition changes. With sets of patient response data, these models are also used to conduct an item fit analysis. An item fit analysis measures the level of agreement between observed and model-predicted probabilities of selecting each response option. IRT models can calculate a differential item function that identifies inefficient, inaccurate, or inappropriate items whose responses differ systematically across groups of respondents.43 For example, in an IRT analysis of common assessment items, responses to the item, “need help with grooming,” differ significantly between males and females. Responses for “difficulty getting in and out of bed” are shown to differ significantly between patient age groups.43 These items, although they seem to have face validity, may be inappropriate for a standard instrument applied to a diverse set of total knee patients. After an instrument has been tested using IRT models, its items should measure a single concept, should fit the chosen IRT model, and should not function differently across groups.43
Compared with CTT estimates, IRT models may better depict patients’ actual response patterns, and IRT estimates may more accurately reflect patients’ true conditions.14,32 Thus, use of IRT for large sets of patient responses should lead to an outcome assessment that is more sensitive to true cross-sectional differences and more responsive to changes in health over time.14,32 These models use measurement units with interval measurement properties that remove bias at the extreme ends of a patient condition. Therefore, they can be used to discriminate among patients who are very active or very disabled.14,43 This capability of discriminating between levels of an underlying trait enhances the usefulness of IRT models in relating the results to clinically relevant outcomes, especially within single diagnostic groups.14
Compared with conventional statistical analyses, IRT is more difficult to perform and understand, although its use results in discriminative, informative, nonredundant items that adhere to psychometric principles. Therefore, IRT could best be used in conjunction with CTT.65 For future development of outcome measures using IRT, it is essential that collaborative efforts between statisticians and clinical investigators be successful, leading to a consensus regarding acceptable standards for use and reporting of outcome data.32