CHAPTER 4 Health outcome assessment
Introduction
Evaluation of the effect of treatment is an essential component of clinical practice. Such evaluations, or outcome assessments, range from simple measures like death (i.e. the patient is dead or alive) to more complex assessment relating to a broader, more holistic view of a person’s life that includes physical, psychological and social issues. Assessments such as these allow clinicians to monitor the progress of a patient in response to advice or treatment. They also provide important objective information for the patient about whether their condition is improving or worsening. Equally important, however, is the role of outcome assessment in research evaluating health interventions (Spilker 1991) including clinical trials assessing treatments for musculoskeletal disorders (Keller et al 1993). Accordingly, health outcome assessment spans both clinical practice and clinical research, and is therefore an important issue about which both clinicians and researchers need to have a clear understanding.
Outcome measurement
Surrogate outcome measurement
Until relatively recently, clinician-based or laboratory measures (e.g. foot alignment or plantar pressure), referred to as surrogate outcomes (Guyatt & Rennie 2002), were often used to evaluate outcomes of treatment. Unfortunately, patients do not generally consider these surrogate measures important for understanding their health. Surrogate outcomes place little or no emphasis on an appraisal from the person being treated (Jenkinson & McGee 1998), whose only concern might be that they want their pain to reduce. Consequently, decisions regarding the effect of an intervention based on surrogate outcomes rely on the health professional’s judgement, rather than on the patient’s appraisal of their own health (Friedman et al 1998, Staquet et al 1999). Reasons for using surrogate outcomes include:
• they are usually easier to measure than a broader evaluation of an individual’s health
• they often have the effect of decreasing sample size requirements in clinical trials.
To illustrate this issue further, a common example of a surrogate outcome from medicine is that of measuring tumour size as the primary outcome of success of a cancer treatment (Gebski et al 2002). Of course there would be a high likelihood that tumour size and progression of the disease are highly correlated. However, it is not necessarily the case that a decrease in tumour size correlates to a good therapeutic outcome from the patient’s perspective. Indeed, adverse effects of treatment can have a negative impact on the individual’s quality of life, both in the short and long term. This clinician-based focus is an historical problem in medicine and medical research where patient input had customarily been considered too subjective.
Due to the limitations of surrogate outcome measures, which tend to focus on disease, health outcome assessment has developed rapidly over the past two decades. Not surprisingly, health outcome assessment focuses more on the concept of what is important to the patient. The World Health Organization’s definition of ‘health’ is that which is ‘not merely the absence of disease or infirmity, but a state of complete physical, mental, and social well-being’ (World Health Organization 1958). Clearly, this definition of health is more holistic and consumer-centred than clinician-based measures (Bowling 2005). However, in their early development, health outcome measures were created by clinicians, incorporating issues that they thought were important to patients. Whether these clinician-generated health outcome measures were, indeed, important to patients is an area of great conjecture. In the light of this, health status measurement has developed to include a broader account of ‘health’, which encompasses the patient’s (or consumer’s) perspective of their health (Jenkinson & McGee 1998). Put more simply, health status (and health-related quality of life) measurement attempts to evaluate what is important to patients, ideally from the patient’s perspective.
Health status and quality-of-life measurement
With the aforementioned concerns about surrogate outcomes in mind, health status measurement has taken on increasing importance. Health status incorporates the notion of health-related quality of life, which is an important, and some would argue, fundamental measure of the impact of a disease or abnormality on an individual (Testa & Simonson 1996). Health-related quality of life characterises and measures what consumers experience as a result of healthcare (Staquet et al 1999). Emphasis is taken away from the disease process, while highlighting the broader appraisal of an individual’s health by that individual (Muldoon et al 1998). Measurement of health-related quality of life is usually by self-report (i.e. self-reported questionnaire), although in paediatrics it can be acceptable to use a proxy (e.g. a parent/guardian). Health-related quality of life includes such areas as somatic sensation (e.g. pain), physical function, cognitive function and psychological well-being, social function and life satisfaction (Jenkinson & McGee 1998, Bowling 2005). A summary of health outcome terms used thus far in this chapter can be found in Table 4.1.
Term | Explanation |
---|---|
Health outcome | A measurement that evaluates the health of an individual, in particular as a result of an intervention (Kane 1997). Can be a laboratory or clinician-generated measure, or one that is a subjective measure from the patient |
Surrogate outcome | A proxy measurement that is generally known to be associated with a construct of interest (Herbert et al 2005). Tends to reflect ‘disease’, rather than ‘ill-health’ and its consequences (Bowling 2005). For example, a simple clinical or laboratory test may be used rather than a detailed assessment of subjective issues that are important to patients (e.g. health-related quality of life) (Guyatt & Rennie 2002) |
Health status | A broad health construct that encompasses an individual’s subjective perception of their health, including physical, psychological and social issues that impact on well-being (Jenkinson & McGee 1998, Bowling 2005). Sometimes used interchangeably with health-related quality of life. Provides a broader interpretation of an individual’s health compared with a surrogate outcome |
Health-related quality of life | A subjective measure of one’s ability to perform usual tasks and their impact on one’s everyday physical, emotional and social well-being (Fayers & Hays 2005, Jenkinson & McGee 1998). Sometimes used interchangeably with health status. Provides a broader interpretation of an individual’s health compared with a surrogate outcome |
Quality of life | A comprehensive measure of the quality of an individual’s life. It incorporates, but is not limited to, health-related quality of life, and takes into account factors such as the environment that one lives in (Bowling 2005) |
Measurement of health status and quality of life via self-reported questionnaire is a complex area that is developing rapidly (Scientific Advisory Committee of the Medical Outcomes Trust 2002). These questionnaires or instruments pose relevant questions relating to a person’s quality of life. Each question or item is then scored individually or added together to produce a summary score. Scores can also be weighted to place more importance on certain sections of the instrument (Streiner & Norman 1995). In addition, many questionnaires group items together into like areas, known as domains or subscales (e.g. ‘pain’, ‘function’, ‘vitality’, etc.). Domains can also be scored individually or combined to give an overall numerical score of health and well-being, although this feature is not available for all instruments (McDowell 2006). Questionnaire development is a science of its own and rigorous methods should be employed to develop an instrument that will measure appropriately and accurately. Questions and their grouping into domains need to be developed using sound measurement principles, including item inclusion, psychometrics, validity and reliability testing (Streiner & Norman 1995). Appropriate development will ensure that instruments measure what they are supposed to measure and that they do so in a reproducible manner (Muldoon et al 1998).
Health status measures can be classified under two broad categories: (i) generic measures, which assess universal aspects of general health and well-being; and (ii) specific measures, which assess a specific medical condition or region of the body (Patrick & Deyo 1989). It is generally accepted that when measuring health status, both a generic and a specific measuring instrument should be used, as they measure distinct but complementary aspects of patients’ health (Kantz et al 1992, Bombardier et al 1995, Hawker et al 1995). Generic instruments allow some commonality of measurement, and as such, comparison between different conditions (Patrick & Deyo 1989). However, due to their non-specific nature, they are generally less responsive to change compared with specific instruments (Patrick & Deyo 1989). Table 4.2 presents a summary of the categories discussed above.
Category | Subcategory | What does it measure? |
---|---|---|
Generic | None | Universal aspects of general health and well-being |
Specific | Health status relative to a region of the body or medical condition |
Before discussing specific examples of health outcome measures in more detail, a brief look at the important issues of validity and reliability is needed.
Choosing an outcome measure: issues of validity and reliability
It is not the intention of this chapter to discuss in detail the concepts of validity and reliability. Readers are referred to many excellent texts if further detail is required (Bowling 2005, Bowling & Ebrahim 2005, McDowell 2006, Portney & Watkins 2008, Streiner & Norman 1995). However, some mention of the issues that relate to validity and reliability need to be covered – this is important when deciding whether to use an outcome measure in research or practice, or when evaluating the results of research that has used a particular outcome measure.
When measuring the outcome of an intervention, either for research purposes or in the clinical setting, it is necessary to ensure the outcome measure being used is appropriate (Boynton & Greenhalgh 2004). For health outcome measures, often self-reported questionnaires, to be ‘appropriate’ they need to be both valid and reliable. Validity refers to whether the outcome measure or the instrument measures what it is intended to measure (Portney & Watkins 2008). Reliability is the degree of consistency with which an outcome measure or an instrument measures a variable (Portney & Watkins 2008). When choosing an outcome measure the researcher or clinician should ensure it has been subjected to appropriate validity and reliability testing; that is, it has been proved through rigorous investigation to be both valid and reliable. Importantly, an instrument or questionnaire that has been shown to have good validity and reliability is confirmed to be of good quality and can be used with confidence (McDowell 2006).
Validity has a number of constructs, including: face validity, content validity, criterion validity and construct validity (Peat 2001). Briefly, face validity (also known as measurement validity) refers to the extent that an outcome measure measures what it is intended to measure. Content validity relates to the extent that the items included in an outcome measure cover the area being researched. Criterion validity refers to whether the outcome measure agrees with a gold standard (i.e. does it agree with another measure currently considered to be the best). Construct validity relates to whether the outcome measure agrees with other tests used for measuring the construct in question.
Reliability of an outcome measure refers to the consistency or repeatability of an instrument or questionnaire. That is, if the outcome measure was used multiple times on one patient would the same answer be arrived at, provided that the patient’s health status did not change (Bowling 2005). Reliability encompasses both test–retest reliability (often measured via intraclass correlation coefficients (ICCs) and measurement error) as well as internal consistency (often measured via Cronbach’s alpha) (Streiner & Norman 1995).
Issues relating to responsiveness to change (an instrument’s ability to detect clinically important changes over time), floor and ceiling effects (where many patients achieve the lowest or highest score), and interpretability are also important (Terwee et al 2007). However, because of the complex nature of these issues they will not be covered in depth in this chapter, and the interested reader is referred to many excellent texts on this subject (Fayers & Hays 2005, McDowell 2006, Streiner & Norman 1995). Clearly, the aforementioned issues of validity and reliability are relatively complex, particular if one is new to the area of health outcome measurement. In such a case, the vast array of statistics associated with these issues can detract from a clinician or researcher wanting to take the plunge into health outcomes assessment. With this in mind, Suk and colleagues (Suk et al 2005) recommend the following checklist to assist in choosing (or even developing) an outcome measuring instrument for use in clinical practice or research:
1. Is the instrument internally consistent?
2. Is the instrument reproducible (reliable)?
3. Does the instrument demonstrate criterion validity?
4. Does the instrument demonstrate construct validity?
5. Does the instrument demonstrate content validity?
6. Does the instrument detect changes over time that matter to patients?
7. Will the instrument be deemed acceptable to patients?
Ideally, an instrument should be able to satisfy each of the questions in the above checklist. In reality, some instruments will not satisfy this checklist, so if this is the case, then the most appropriate one should be chosen. If available instruments are simply not appropriate, or indeed none exist, development of a new instrument is warranted. However, many instruments already exist, so the indiscriminate development of a new instrument should be discouraged until it is absolutely clear that a new one is warranted.
Specific outcome measures
Visual analogue scale
Often clinicians and researchers want to assess a single (uni-dimensional) construct, such as pain, in as simple a manner as possible. One such method that is well accepted is the visual analogue scale (VAS). The VAS is a 100 mm straight line, which has at each end labels that define the extreme limits (i.e. range) of the sensation or response being measured (McDowell 2006). The line is most frequently presented horizontally on a page, but it can be presented vertically (Scott & Huskisson 1979b).
Although primarily used to measure pain, the VAS can be used to measure other constructs as well (e.g. anxiety, satisfaction, comfort). When used for the measurement of pain, the extreme left end of the scale (i.e. 0 mm) is labelled ‘no pain’ and the right end of the scale (i.e. 100 mm) is labelled ‘pain as bad as it could be’ (Huskisson 1974, Scott & Huskisson 1976). Alternatively, the labels ‘worse pain imaginable’ or ‘agonising pain’ can be used at the extreme right end of the scale (McDowell 2006). Figure 4.1 shows a typical example of a VAS formatted for the measurement of pain.
The VAS has been validated as a measure for the following types of pain: experimentally induced pain; pain intensity; unpleasantness of pain; and chronic clinical pain (Duncan et al 1989, Price et al 1983, Price & Harkins 1987). When used to measure pain, the VAS has also been shown to be reliable, with test–retest correlations ranging from 0.95 to 0.99 (Revill et al 1976). With respect to serial assessment of pain over longer periods of time (e.g. every month for a year), one study found that it is acceptable to show patients their initial scores (Scott & Huskisson 1979a). Caution is needed with non-literate patients where reliability, while still acceptable, is not as good as with literate patients (Ferraz et al 1990).
Clearly, the VAS has been extensively validated and been shown to be a reliable tool to measure pain, but how are the results obtained interpreted? For example, what scores on a VAS relate to mild, moderate or severe pain? One study pooled the findings of 11 randomised trials (1080 patients) that evaluated the effects of analgesia using VAS (Collins et al 1997). The results showed that patients who scored their pain as ‘moderate’ had a mean of 49 mm, with 85% of them scoring over 30 mm. Those reporting ‘severe’ pain had a mean of 75 mm, and 85% scored over 54 mm.
In addition, what amount of change in pain can be viewed as positive for the patient? Although this is a highly complex and contentious area, there has been some research on this issue. For example, one study found that a 33% decrease in postoperative pain represented a reasonable standard for determining a meaningful change from the patient’s perspective (Jensen et al 2003). Another study established that patients experiencing pain postoperatively of less than 40 mm on a VAS were ‘adequately’ satisfied, and the complete elimination of pain was not required for good patient satisfaction (Jensen et al 2005).
The issue of meaningful change has received some attention of late, with researchers attempting to determine what change in pain level is important to patients. More broadly, the amount of change required on any outcome measure – not just pain – to be considered important to a patient has been recently referred to as the ‘minimal important difference’ (Schunemann & Guyatt 2005), although it has also been referred to by many other terms including: minimal clinically important difference, clinically significant difference and meaningful change. Unfortunately, there are few examples of this type of research (i.e. calculation of the minimal important difference) in the foot and ankle literature. The literature on emergency medicine indicates the minimal important difference for the VAS ranges between 9 mm and 13 mm (Kelly 1998, Kelly 2001). That is, a change in pain of 9–13 mm on a VAS is, on average, considered important by patients. The minimal important difference is a valuable piece of information, not only when interpreting findings from instruments such as the VAS, but also for clinical trials, in which it can be used to assist in prospective sample size calculations.
Although the VAS is one of the most frequently used instruments to measure constructs such as pain, there are other methods such as the numerical rating scale, the verbal rating scale and the Oucher scale. The numerical rating scale is a simple mechanical device where the patient moves a slider and the numerical rating is read from the back of the device (McDowell 2006). Alternatively, the verbal rating scale can be used where the patient indicates their level of pain, for example, using the numbers 0–10. This simple and clinically useful method correlates well with the VAS (Murphy et al 1988). It has also been shown to have adequate validity and reliability (Jensen et al 1999). The Oucher scale is a pain scale that is specifically used for children between the ages of 3 and 12 years (Knott et al 1994). To indicate their level of pain, children can choose a face from a series of faces with varying expressions of pain.
Finally, more complex, multi-dimensional measures of pain have also been developed, and the McGill Pain Questionnaire (Melzack 1975) is one of the most well known. In contrast with the VAS, the McGill Pain Questionnaire measures beyond just pain intensity; however, in its long form it is significantly more time consuming to use (Kahl & Cleland 2005). Moreover, the newer health status measures also generally have a pain component embedded in them, making more complex pain measurements, such as the McGill Pain Questionnaire, somewhat redundant. If a specific component of pain is required to be measured in addition to the more generalised pain assessed in the newer health status measures, a simple VAS would probably be sufficient. For example, in a randomised trial evaluating low-Dye taping for plantar heel pain (Radford et al 2006), ‘first step’ pain – the pain experienced in the heel on first stepping out of bed in the morning – was measured using a VAS, as well as more generalised foot pain using a foot specific health status measure (the Foot Health Status Questionnaire, covered below).
Clinician-based outcome measures
American Orthopaedic Foot and Ankle Society clinical rating scales
The American Orthopaedic Foot and Ankle Society (AOFAS) first reported on these scales in the medical literature in 1994 (Kitaoka et al 1994). Since this time, they have been widely adopted in orthopaedic foot and ankle research. The scales, or ‘clinical rating systems’, were generated by clinicians (i.e. members of the AOFAS) and hence, are clinician-based outcome measures. Four rating systems were originally developed: the Ankle-Hindfoot Scale; the Midfoot Scale; the Hallux Metatarsophalangeal-Interphalangeal Scale; and the Lesser Toe Metatarsophalangeal-Interphalangeal Scale.
However, numerous studies have evaluated the validity and reliability of the AOFAS scales and have revealed important concerns. For example, one study found that the AOFAS scales correlated poorly with the extensively validated Medical Outcomes Study Short Form-36 (SF-36) (SooHoo et al 2006a). In this study of 91 participants, Pearson correlation coefficients ranged from 0.02 to 0.36 for the overall study sample. Correlations of this magnitude suggest poor construct validity of the scales. In addition, a small study of 25 participants showed that components of the SF-36, a generic outcome measure, had levels of responsiveness approaching that of the AOFAS and as such, the SF-36 could be used instead to monitor outcomes without loss of sensitivity (SooHoo et al 2006a). Further validity and reliability testing of the AOFAS Hallux Metatarsophalangeal-Interphalangeal Scale and Lesser Toe Metatarsophalangeal-Interphalangeal Scale was done in a small study of 11 people with rheumatoid arthritis (Baumhauer et al 2006). Although the scales were reliable, their validity was questionable with certain subscales correlating poorly with another footspecific outcome measure, the well-validated Foot Function Index (FFI).
In addition, the agreement between prospective and retrospective AOFAS Hallux Scale assessments has been assessed. Poor agreement was found between prospective and retrospective evaluations of hallux surgery, indicating they could not be used interchangeably (i.e. a prospective assessment should not be compared with a retrospective assessment) (Schneider & Knahr 2005). Retrospectively acquired AOFAS data have also been shown to overestimate the benefit of surgery (Toolan et al 2001, Schneider & Knahr 2005). Accordingly, these studies highlight that prospective rather than retrospective study designs are preferable. Another study showed that population distributions of the AOFAS scales could be badly skewed; therefore the use of parametric statistical tests to analyse AOFAS scores should be viewed with great caution (Guyton 2001).
One recent study has reported validity of the AOFAS scales in a more positive light. This study compared the subjective component of the AOFAS scales with the Foot Function Index in an attempt to evaluate their criterion validity (Ibrahim et al 2007). The researchers found that AOFAS scales demonstrated moderate correlation (Pearson correlation coefficient = −0.68) suggesting acceptable criterion validity, although this component of the study had a small sample of 45. The study also attempted to evaluate test–retest reliability on 37 participants (8 of the original 45 participants dropped out). They reported no significant difference (p = 0.27) in the group mean AOFAS scores measured at baseline and then again after 2 weeks. However, this analysis did not use the more modern ICC approach, in which individual scores are correlated (i.e. true agreement), so these findings should be viewed with caution.
In summary, it seems that the AOFAS scales lack sound methodological construct, and as such, have questionable validity. This is supported by a recent publication by Suk and colleagues that extensively assessed and compared the AOFAS scales with other foot and ankle outcomes (Suk et al 2005). The AOFAS scales were rated highly for their ease of use (clinical utility), however, they were rated poorly for their methodological qualities. Clearly, although being widely used, the AOFAS scales (Clinical Rating Systems) require further development and validation. Accordingly, these instruments cannot currently be recommended for assessing outcomes relating to the foot and ankle.