Knee Rating Scales for Clinical Outcome
David C. Ayers, MD
Patricia D. Franklin, MD, MPH, MBA
Matthew E. Deren, MD
KNEE RATING SCALES FOR CLINICAL OUTCOME
The evaluation of orthopedic surgical treatments dates back to Ernest Amory Codman in the early 20th century at Massachusetts General Hospital. Traditionally, measures of success after surgery were based on physical examination and radiographic parameters. Since the 1980s, outcome assessment after orthopedic surgery has focused increasingly on the patient’s perspective. While this evolution toward the incorporation of patient-based measures is appropriate, traditional measures of outcome, including physical examination, imaging studies, and measures of knee laxity, are complimentary and should not be viewed as unnecessary.
Knee surgery is generally performed for symptoms and disability. Pain is the most common symptom for which surgery is performed. Disability varies among patients who undergo knee surgery and depends to a large extent on the individual. Disability for an elite athlete may involve inability to perform at their desired level of competition. For an elderly individual with knee arthrosis, disability may involve difficulties with activities of daily living (ADLs) or walking.
The objective of treatment must be taken into account when selecting a measure with which to evaluate an orthopedic procedure or treatment. If an inappropriate outcome is used to evaluate the result of anterior cruciate ligament (ACL) reconstruction or total knee arthroplasty (TKA), incorrect treatment decisions may be made for future patients. It is therefore critical to use measures of clinical outcome that are of importance to the patients who are evaluated, while also being relevant to the surgeon.
This chapter discusses measures of clinical outcome that may be used to evaluate different treatments for patients with disorders of the knee. The measurement properties of reliability, validity, and responsiveness are reviewed. Last, general health status measures, joint- and condition-specific instruments, and measures of activity level are reviewed.
Reliability, Validity, and Responsiveness
A measure of any kind is only useful if it is reproducible (reliable) and accurate (valid). In the assessment of health status, measures must also be able to detect improvement or worsening (termed responsiveness or sensitivity to change). This section is devoted to the concepts of reliability, validity, and responsiveness.
Reliability
An instrument is reliable if it is measuring something in a reproducible fashion.1 Reliability is also known as reproducibility, because repeated administrations of the same questionnaire to stable patients should produce more or less the same results.2
There are two schools of thought with respect to the measurement of reliability for health status instruments. The first is test-retest reliability, which involves having patients who are in a stable state respond to the questionnaire at two points in time. The time period must not be too short, because the subject will remember their prior responses. As well, the time period must not be too prolonged, which will allow for the possibility of clinical change. In general, a time period ranging from 2 days to 2 weeks is used.
Measures of agreement, such as the intraclass correlation coefficient3 or the limits of agreement statistic,4,5,6 or both, are typically used to compare the scores.7 The intraclass correlation coefficient is an index of concordance for dimensional measurements ranging between 0 and 1, where 0.75 or more is adequate for patients enrolled in a clinical trial.8 This statistic is important to differentiate from measures of correlation, such as the Spearman or Pearson correlation coefficients, which do not measure agreement. These statistics may indicate excellent correlation in situations in which agreement is poor, and, therefore, they should not be used for studies of reliability. For example, if the first measure is twice as high as the second measure for all subjects in a study of reliability, the correlation would be perfect but the agreement would be poor.
The limit of agreement statistic is a descriptive measure of reproducibility. This value is the mean difference between the two tests ±2 standard deviations.5 Ninety-five percent of the differences between the two test administrations will lie within this interval,5 providing the investigator with an estimate of the precision of the measure.
Internal consistency is another method for measuring the reliability of rating scales. This concept was borrowed by clinicians from the field of psychometrics. The latter discipline
involves the measurement of psychologic phenomena (e.g., depression or anxiety) or educational achievement.9
involves the measurement of psychologic phenomena (e.g., depression or anxiety) or educational achievement.9
The concepts evaluated by psychometric scales are difficult to define or may involve learning or both. In these situations, it would not be possible to have the patients complete the questionnaire on two separate occasions, owing to recall or learning effects. The calculation of internal consistency involves a measurement of the intercorrelation of the responses to the questions on a single administration. The statistic generally used to describe internal consistency is termed Cronbach alpha, which ranges from 0 to 1, with 1 indicating perfect reliability.10 Cronbach alpha has been used to evaluate the reliability of rating scales in orthopedic surgery11; however, it is questionable whether the principles of psychometric theory apply to the measurement of symptoms and disability. In practice, orthopedic scales that measure a wide variety of clinical phenomena have also been demonstrated to have high internal consistency.12
Validity
An instrument is valid if it measures what it is intended to measure. There are several types of validity that are reviewed briefly below.
The simplest way of validating a rating scale is to provide evidence that its results match a gold standard.13 This is known as criterion validity, although it is generally not possible for instruments that measure quality of life. In such situations, we must rely on face validity, content validity, and construct validity.
Face validity is present when an expert clinician reviews the questions in the scale and believes that they appear to measure the concept in question. This form of validity is rather simple; however, it is important nevertheless.
Content validity is a more formal application of face validity. Content validity measures whether the scale includes representative samples of the concept that the investigator is attempting to measure. For example, if a rating scale was measuring quality of life, the content of the scale should include measures of physical, mental, and social health to provide adequate content validity.
Construct validity determines whether the questionnaire behaves in relation to other measures as would be expected. This requires several hypotheses about how the results of the questionnaires should correlate (positively or negatively) with other related or unrelated measures and in testing these hypotheses.
Responsiveness
Orthopedic surgeons generally use rating scales to measure improvement in health-related quality of life after treatment. An instrument that is not able to measure improvement in a patient who has been treated successfully would not be useful for clinical research or evaluation. Therefore, the characteristic of responsiveness is critical for the practical application of a rating scale.
There are many statistics that are available to determine responsiveness.14,15 The standardized response mean (observed change/standard deviation of change) is most commonly used in orthopedic research.16,17,18 This statistic incorporates the response variance, allowing statistical testing of the response means.19
Generic and Specific Measures
Specific measures may pertain to a certain pathologic entity (disease-specific), condition (condition-specific), or anatomic location (joint-specific). These measures focus not only on specific aspects of the condition (or anatomic location), but complaints are also usually attributed to the disorder (or anatomic location).13,20,21 For example, a joint-specific instrument for the knee may ask patients if they have difficulty dressing because of their knee problem.
Generic tools have a broader perspective, including emotional, social, mental, and physical health, and do not restrict attribution to a particular disorder.13,21 The advantage of generic health status instruments compared with specific instruments is that they allow comparisons across conditions and treatments. The disadvantage of these tools is that they may not be responsive to clinically important change, because a change in an isolated problem may not be reflected in the score of this more global measure.13,21,22,23 The advantage of disease- or joint-specific measures is that they are generally more responsive to change in the specific phenomenon of interest, and they are more relevant to patients.
The most commonly used generic health status instrument is the Short Form 36 (SF-36). It is a 36-item questionnaire that measures general health.24,25,26 Its use has been encouraged in conjunction with knee-specific instruments for studies of ACL-injured patients27 and is commonly used in studies of TKA to describe the patients’ overall status.21 A physical component scale (PCS) and a mental component scale (MCS) can be derived from the SF-36, SF-12, VR-12, or PROMIS global. The PCS provides a summary score of the patients’ physical function. The MCS provides a summary score of the patients’ emotional function and accurately measures a patients’ emotional health. The MCS is an excellent screening tool for a patients’ emotional fitness for surgery. For example, a patient that has subclinical depression with trait anxiety disorder will have a MCS score less than 45.
KNEE RATING SCALES FOR ATHLETIC PATIENTS
There are many rating scales available to measure outcome in athletic patients with disorders of the knee. What defines an athletic individual may not always be clear. The activity level of the patient is an important prognostic variable, because active patients place greater
demands on their knees than sedentary individuals and have different expectations of the results of treatment. Activity level is not always directly related to symptoms and disabilities and should be measured separately. This topic is discussed at the end of the chapter. A review of eight commonly used rating scales for athletic patients with disorders of the knee is presented.
demands on their knees than sedentary individuals and have different expectations of the results of treatment. Activity level is not always directly related to symptoms and disabilities and should be measured separately. This topic is discussed at the end of the chapter. A review of eight commonly used rating scales for athletic patients with disorders of the knee is presented.
The modified Lysholm scale28 is an eight-item questionnaire that was originally designed to evaluate patients after knee ligament surgery.29 It is scored on a 100-point scale, with 25 points attributed to knee stability; 25 to pain; 15 to locking; 10 each to swelling and stair climbing; and 5 each to limp, use of a support, and squatting.28 Although this scale was developed without patient input, it has been used extensively for clinical research studies.27,30,31,32 It has been demonstrated to have adequate test-retest reliability and good construct validity.29,33
The first version of the Cincinnati Knee Rating System was published in 1983 with additional modifications that were developed for occupational activities, athletic activities, symptoms and functional limitations with sports, and daily activities.34,35 There are 11 components in the Cincinnati Knee Rating System. In addition to measuring symptoms and disability, there are sections of this rating system that measure physical examination, laxity of the knee based on instrumented testing, and radiographic evidence of degenerative joint disease.36 This instrument is reliable, valid, and responsive to clinical change.33,36
The American Academy of Orthopaedic Surgeons Sports Knee Rating Scale37 was included in the Musculoskeletal Outcomes Data Evaluation and Management System for athletic patients with disorders of the knee. There are five parts and 23 questions in this instrument: a core section, including stiffness, swelling, pain, and function (seven questions); a locking or catching on activity section (four questions); a giving way on activity section (four questions); a current activity limitations due to the knee section (four questions); and a pain on activity due to the knee section (four questions).
The five subscales are independent and are meant to be reported separately. As well, this scale has the response “cannot do for other reasons” for many questions. The scoring manual states that an item should be “dropped” if the patient selects that response, which may be interpreted as “scored as missing.” These factors may lead to practical difficulties when using this questionnaire.33 Despite these concerns, the measurement properties of this instrument were found to be satisfactory when the five subscales were combined and the mean was calculated.33
The Activities of Daily Living Scale of the Knee Outcome Survey was published with an evaluation of its reliability, validity, and responsiveness.11 It was developed based on a review of relevant instruments and clinician input. This scale is designed for patients with disorders of the knee ranging from ACL injury to arthrosis. It includes 17 multiple-choice questions divided into two sections: one for symptoms (7 questions) and one for functional disability (10 questions). This instrument was found to have slightly higher correlations with the Lysholm, Cincinnati, and American Academy of Orthopaedic Surgeons scales, as well as other measures of disability, indicating excellent construct validity.33 It was also found to be slightly more sensitive to clinical improvement (responsive) than the three other scales in a group of athletic patients.33 The questions that make up this tool are presented in Appendix A.
The single assessment numeric evaluation was devised to evaluate college-aged patients after ACL reconstruction.38 The single assessment numeric evaluation asks the patient how they would rate their knee, from 0 to 100, with 100 being normal. This score was found to correlate well with the Lysholm scale in this patient population.38 The advantage of this single question is its simplicity and the ease with which it can be administered. One potential pitfall is that a single, relatively broad question may be interpreted differently by patients with different disorders and varying levels of symptoms and disability. In the setting of a very homogeneous cohort, such as college-aged patients recovering from a specific procedure (such as ACL reconstruction), the range of pathology is relatively narrow and the instrument correlates well with a standard measure of knee function. The applicability of this tool to patients with a variety of diagnoses is unknown.