Outcome Assessment



Outcome Assessment


Steven Yeomans

Craig Liebenson

Richard Nicol






Introduction

Valid, reliable outcome assessments (OAs) are essential in modern health care, contributing to improvements in clinical practice and research.1 OAs, including patient-reported outcome measures (PROMs) enable clinicians and/or researchers to gather patient-specific health information, such as quality of life or functional status and gain insight into the health experience of a patient. Information gathered from these measures can be used in many ways. Beyond evaluating how a patient is progressing, PROMs can guide patient care, drive changes in health care delivery, and facilitate performance comparison and monitoring.2 PROMs are inexpensive, easy to administer, and are time efficient to complete and score, making them invaluable to both researchers and clinicians.

Patient-centered care improves the quality of care and empowers patients to be actively involved in their health care.3 The use of PROMs helps clinicians gain an insight into the complex biopsychosocial aspects of a patient’s experience and thus allows the two to work together to achieve better outcomes.4,5,6

A questionnaire cannot assess all aspects of the health experience of a patient, making the selection of appropriate OAs important.7,8 PROMs need to be able to gather the pertinent information, be easy to administer, and accurately reflect the health experience of the patient while being valid, responsive, and reliable.1,7 In general, there are two types of PROMs, generic health questionnaires and condition-specific, each type covering multiple domains such as pain and/or function. Additionally, questionnaires come in a variety of formats utilizing visual analogue, numeric rating scales (NRSs), or descriptive answers from which respondents select.9,10

Depending on the needs/goals of a patient, such as reducing activity limitations (walking, sitting, standing, etc.), increasing participation (social events, sports, etc.), and/or pain reduction, it is important to establish an appropriate baseline to aid in goal setting. PROMs are a valid and reliable method of establishing this information and are typically administered during the initial visit, then re-administered at a later, often predetermined time point during the course of treatment to assess how a patient is progressing, therefore being able to guide care delivery, enhance quality, efficiency, and patient satisfaction, while improving accountability.1,4,11


What Outcomes?

Traditional medical care has emphasized “objective” measures such as imaging and laboratory modalities. However, multifactorial conditions such as low back pain are best explained by a biopsychosocial model of illness rather than a biomedical model of disease. This necessitates that “subjective” measures of pain, distress, and perception of functional abilities (i.e., disability) also be utilized.12,13,14

In the spine care field in which patient dissatisfaction with care runs high, measuring outcomes that matter to the patients themselves is of paramount importance.12 Patient-centered outcomes include pain severity, distress, and ability to perform common activities of daily living (ADL) (disability). Interestingly, the “objective” tests such as imaging, laboratory tests, and physical impairments (muscle strength and range of motion) correlate poorly with self-reported symptoms and functional status (i.e., disability).15,16,17 Thus, patient-centered outcomes derived from self-administered questionnaires have achieved a surprisingly high level of significance.18,19,20

Because of the multidimensionality of the biopsychosocial model, a broad spectrum of outcomes can be potentially measured. Deyo et al and Bombardier have listed the following domains as most relevant to a patient’s clinical status—pain, disability, well-being, work status, and satisfaction.21,22 Psychosocial status, especially fear-avoidance beliefs, is also an important and relevant area to document.23,24,25,26


Criteria Regarding Outcomes Assessment

An effective outcome measure (OM) should be valid, reliable, responsive to clinical change, and practical.27,28,29,30,31 Many OAs are also very specific to certain populations. For example, the Kerlan-Jobe Orthopaedic Clinic (KJOC) Shoulder and Elbow questionnaire is specific to overhead throwing athletes (pro baseball players),32 whereas the Oslo Sports Trauma Research Center (OSTRC) overuse injury questionnaire has been validated in elite athlete populations.33,34 Both PROMs could be used in clinical practice, but a clinician should be aware of when it is appropriate to use, in order to obtain the most useful information.32 Ironically, the so-called subjective measures have been shown to be more psychometrically valid than the socalled objective measures. Many of the latter, such as
muscle strength and mobility, are vulnerable to submaximal effort and impairment exaggeration35,36,37 (see Chapters 10 and 12).



Validity

Validity refers to the ability of an OM to accurately quantify what it purports to measure.

Face validity: The extent to which a test appears to measure a purported construct.

Content validity: The extent to which the OM incorporates all relevant features of the domain in question.

Criterion validity: Generally refers to a comparison of a measure against some sort of “gold standard,” or criterion measure. There are no gold standards for health status measures because health is a latent (or nonobservable) trait, so one can never quantify it with certainty. In such a case, validity is established by testing a construct.

Construct validity: The extent to which the measurement corresponds to theoretical concepts (constructs) concerning how the phenomenon under study is expected to react.38

Concurrent validity: The comparison of two measures completed at the same time. There are two subtypes: (i) convergent: the expectation that the scores between two related variables will be correlated. In other words, the scores of the two measures will increase (or decrease) together; and (ii) divergent or discriminative: this is tested when two or more variables that measure something totally unrelated are studied. Good discriminative validity is shown if two unrelated measures do not correlate with one another. For example, if anxiety is independent of intelligence, then we should not find a strong correlation between the two.38

Predictive validity: The ability of a test to predict a future event/state (i.e., readmission rates to a hospital).


Reliability

Reliability is the amount of error associated with a measurement. It is defined as “the degree of stability exhibited when a measurement is repeated under identical conditions.”39 Thus, if a reliable measure is used, any change that occurs over time is caused by an actual change in patient status.

Test-retest reliability: demonstrated when repeated test scores on an individual whose health status is unchanged give the same result. This is a measure of an instrument’s standard error of measurement (SEM).40

Interobserver reliability: reflects the consistency of measurement application when different observers measure the same phenomenon.

Intraobserver reliability: a specific type of test-retest reliability in which the degree of consistency within the same examiner is evaluated.


Responsiveness

Responsiveness is defined as “the accurate detection of change when it has occurred.”41 If a tool is responsive to change, the score on a questionnaire should improve as a person’s health status improves. This is clinically significant change that is not caused by a
random occurrence. Responsiveness is essential when an outcomes assessment (OA) tool is used by a health care practitioner (HCP) to show clinical improvement in health status as a result of care over time.

The minimal clinically important change (MCIC) The change in score that maximizes the accurate classification of those patients who change (improve, not change, or worsen) by an important amount compared with those who do not. The term “clinically” in MCIC represents that which is important to the patient and/or those associated with the patient (such as an employer). Although each individual tool often has specific psychometric properties reported, as a general rule based on literature review and expert consensus, a 30% change in score in most scales can be considered to be “meaningful” and a 50% change to be “substantial.”42


Several methodological categories of determining the MCIC have been proposed of which the single anchor method is used most frequently.44 Here, the patients’ Global Rating of Change (GROC) is utilized posttreatment to determine important from trivial change. However, MCIC has been reported to suffer from recall bias and other weaknesses, therefore reducing its validity.45 This prompted a study designed to compare chronic low back pain (cLBP) patients’ view of an acceptable change before treatment began (MCIC-pre) to that reported by patients after treatment (MCIC-post).46 Utilizing the Oswestry Disability Index (ODI), Back Bournemouth Questionnaire (BBQ), and 0 to 10 numeric rating scale for pain (NRS-pain) at baseline and follow-up in 147 cLBP patients, the MCIC was calculated before and after treatment. The results showed that the pretreatment MCIC was larger than the post-MCIC (4.5 times larger for the ODI versus 1.5 times larger for the BBQ and NRS-pain). The authors conclude by stating that cLBP patients overestimated their initial/before treatment acceptable change in function and psychological/affective aspects but less so for pain. In the end, assessing treatment affects posttreatment rather than trying to predict them pretreatment is most accurate using the MCIC.

Another common way in which responsiveness is determined is by the effect size. This is the size of an effect from a treatment intervention.47 It is determined from a comparison of different instruments measuring the same thing. The larger the effect size, the greater the treatment effect (signal) as related to the variability (noise) in the sample. An effect size of 0.2 is small, 0.5 is moderate, and 0.8 or more is large.

Different methods are used to calculate effect size. They each use a ratio with the same numerator of the mean pretreatment score minus posttreatment score across the study population. The denominator is usually the range of scores or standard deviation (SD) of the entire group.

In individuals who classify themselves as having improved greatly, a responsive instrument should have a large effect size. Whereas in individuals who classify themselves as not improving, the effect size should be small. Thus, it would be expected that in chronic patients (who are less likely to show improvement) an instruments effect size would be much smaller than in acute patients (who are more likely to show improvement).

Minimum detectable change (MDC) is another psychometric property of a tool which, like MCIC, for a change to be significant, must be equal to or greater than the MDC. However, the MDC tends to vary significantly between different questionnaires as it represents a statistical or mathematical assessment of change and may not correlate with what is “clinically important” to the patient. Therefore, the MCIC of 30% to 50% improvement rather than MDC has become the focus and trend. In the end, this information must be correlated with the remaining factors unique to each individual case, as the meaning of MCIC may vary significantly between the patient and other stakeholders such as family, employers, insurers, etc.


Ceiling and Floor Effects

A ceiling effect occurs when a respondent begins at a high level of function and therefore if they improve,
the instrument cannot accurately detect this improvement. An example would be an athlete. A floor effect occurs when a respondent begins at a low level of function and further deterioration in function cannot be detected by the measure. An example is a frail or postoperative person. Ceiling or floor effects are caused by the inability of the instrument to discriminate at the higher or lower end of the dimension being measured. The impact of ceiling and floor effects is that clinically important change will not be measured or detected.


Practicality

An outcome tool should be simple to administer and understand, time efficient, and easy to score and interpret. Disability questionnaires should have simple and unambiguous wording so that patients will easily complete the entire form. Scoring should be possible with a simple computer program that shows a percent improvement over time. “Yes” and “no” responses are ideal for research questionnaires because they are easier to administer with telephonic follow-up. However, HCPs may prefer forms with 0-to-10 visual analog scales (VASs) that give patients more options for their answers. A practical tool is time- and cost-efficient as well as valid, reliable, and responsive.


Likelihood Ratio

The likelihood ratio (LR) is defined as the likelihood that a given test result would be expected in a patient with a known disorder versus the likelihood that the same result would be expected in a patient without the target disorder. It is used to determine how good a diagnostic test is and to help select appropriate test(s).48

When a test is ordered to rule-in or rule-out that a disease is present in a patient, we want to run tests that are most likely to do this accurately. Basically, we make an initial assessment of the likelihood of a disease (termed “pre-test probability”), run a test to shift our decision in one direction or another, and then make a final determination as to the likelihood of that disease (termed “post-test probability”) (http://omerad.msu.edu/ebm/diagnosis/diagnosis6.html). The following chart helps define the LR interpretation49 (Table 8.1).

Because tests can be positive or negative, there are at least two LRs for each test. The higher the “positive likelihood ratio” (LR+), the more likely the probability of the disease and vice versa for “negative likelihood ratio” (LR−). The value of some tests is to rule-in the target disorder whereas others may be helpful to rule-out a target disorder and therefore, both are very important.








Table 8.1 Likelihood Ratio



































LR


Interpretation


>10


Large and often conclusive increase in the likelihood of disease


5-10


Moderate increase in the likelihood of disease


2-5


Small increase in the likelihood of disease


1-2


Minimal increase in the likelihood of disease


1


No change in the likelihood of disease


0.5-1.0


Minimal decrease in the likelihood of disease


0.2-0.5


Small decrease in the likelihood of disease


0.2-0.5


Moderate decrease in the likelihood of disease


<0.1


Large and often conclusive decrease in the likelihood of disease


Reprinted with permission from Ebell MH, Barry HC. OMERAD online evidencebased medicine course, diagnosis module. Michigan State University. http://omerad.msu.edu/ebm/diagnosis/diagnosis6.html. Accessed August 17, 2017.



Domains

There are two broad categories into which OA tools can be assigned, subjective and objective.50 Subjective OA tools are patient-driven, whereas objective measures are driven by the HCP. This chapter discusses the subjective OA tools and the objective tools are discussed in Chapters 10 and 12. There are several outcomes assessment tools included in the appendix to this chapter. When available, the MDC score is reported.

There have been several classifications of the various domains or groups of OA tools.51 Bombardier describes a core set of measures that should be considered when managing patients with spinal disorders—pain, generic health status, disability or functional status, work status, and patient satisfaction (Table 8.2).21 Psychological distress is a sixth domain that should also be addressed and emphasized.


Pain

In the assessment of pain, there are several measures to consider, including the pain severity, pain affect, pain location, and pain persistence (chronicity). The severity of pain is related to how much a person hurts, whereas pain affect measures the mental or emotional component of pain. When assessing pain severity for
chronic and recurrent pain conditions, assessing the pain severity during a specified time period such as 1 week, 1 month, 6 months, etc., may be more important than reporting the pain status at a particular point of time.52 Von Korff describes key parameters of pain status based on a retrospective report to include: (a) the number of days pain is experienced during a specified time frame; (b) the average or usual pain intensity (PI) when in pain; (c) average interference with activities; and (d) the cumulative number of activity limitation days caused by pain.52








Table 8.2 An Example of Recommended Outcomes Assessment Tools for Low Back Pain Patients









































Domain


Instrument


Number of Items


Score (Best to Worst)


Time to Complete


Pain


NRS


1 item


0-10; clinically meaningful change = 30%


<1 minute


Generic health status (including well-being)


Single self-rated health question


1


0-10


<1 minute


Function/disability


PSFS or Oswestry


3


10 (6 levels)


0-30


0-100


1 minute


3-5 minutes


Work status


Time off work


1


Number of days


<1 minute


Satisfaction


Satisfaction with care


10


1-5


2-3 minutes


NRS, numerical rating scale; PSFS, Patient-Specific Functional Scale.


Pain Severity/Intensity Measuring PI can be accomplished using verbal rating scales, VAS, and/or NRSs. Von Korff concludes that “… 0-10 NRSs have many advantages over the alternatives for clinical use and for research in clinical populations in which a simple and robust measurement method is needed.”53 Hence, a 0-to-10 NRS anchored by “no pain” at the “0” end and “extreme pain” at the “10” end (or vice versa) is a commonly used and practical approach.

A VAS of current pain has been shown to be less responsive than a rating of pain over the past 24 hours, week, or 2 weeks.29,54,55,56 Therefore, when asking a patient to rate pain, the usual or average pain level may be the best choice when limiting the number of questions asked regarding PI to one. The report of average PI has been found to correlate with a 3-month daily pain diary in a number of studies.57,58,59 These validity studies support using measures of average or usual PI for up to a 3-month recall period with acceptable discrimination. An example of a simple 0-to-10 NRS for PI using the “usual,” “typical,” or “average” is depicted in Figure 8.1.


Hagg and colleagues found a change of 18 to 19 out of 100 in the VAS of cLBP patients to be clinically significant.40 Turner studied the correlation of pain with disability.60 If the initial VAS was 5 or more, a change of at least 2 points was needed to influence disability scores significantly. If the initial VAS score was <5, then a VAS change of at least 1 point would have a clinically relevant effect on functioning.







Figure 8.1 The numerical rating scale using the pain intensity at usual, typical, or average.

Regarding the validity of weekly recall ratings of PI in neck pain patients, Bolton and colleagues reported the following: Average pain over the prior week (Pearson r = 0.95); worst pain over the prior week (Pearson r = 0.93); least pain over the prior week (Pearson r = 0.92).61 This was calculated by having 78 patients with nonspecific neck pain complete a 7-day neck pain diary rating their pain levels four times a day on a 0- to 11-point NRS. From the 28 ratings, the patients’ “actual average” pain was computed. On day 8, they were asked to rate their current pain as well as their pain “on average,” at its “worst,” and its “least” over the prior week. Recall of average pain over the prior week was shown to be a valid measure using the data as stated above.

The MCIC for PI was reported studying 1,349 subacute and cLBP patients with and without leg pain (LP) as seen in routine clinical practice over a 12-week time frame in Spain to represent the Southern European LBP patient population.62 Three different methods of calculating the MCIC were used: (a) the mean change score (MCS); (b) the MDC; and (c) the optimal cutoff point (OCP) in receiver operant curves. External criterion included the patient’s own “global perceived effect.” The effect on MCIC of initial scores, duration of pain, and existence of LBP was addressed. Different results were calculated with each method with OCP being the smallest. The MCIC for LBP ranged from 1.5 to 3.2 PI-NRS points when baseline scores were <7/10; from 2.5 to 4.3 points with a baseline score of ≥9/10.

Pain Affect PI may be defined as the amount a person hurts, whereas pain affect can be defined as the emotional arousal and disruption created by the pain experience.63,64,65,66,67 The McGill pain questionnaire68,69 includes 20 category scales of verbal pain descriptors categorized in order of severity and clustered into four subscales:



  • Sensory discrimination


  • Affective


  • Evaluative


  • Miscellaneous

A detailed description for scoring this instrument is available elsewhere.51

Pain Diagrams The pain diagram or drawing is perhaps the best way to obtain the patient’s perception of the location of their symptoms.51,70,71 Improvement or exacerbation can quickly be determined by comparing current to previously completed pain diagrams. Pain diagrams enhance the HCP’s ability to differentiate between a mechanical low back, nerve root, and psychogenic problem (Fig. 8.2).

Abnormal illness behavior or somatization is suggested if the pain diagram shows multiple types of pain qualities (achy, stabbing, burning, numbness, pins and needles, etc.) in all four extremities and the trunk, and/or if markings outside of the body such as lightning bolts are present. This can then be correlated with other subjective information such as psychometric “yellow flags” that include poor coping strategies, depression, and anxiety, as well as objective tools such as the Waddell Non-Organic Low Back Pain signs (see Chapter 7).

Though the pain drawing is usually used qualitatively, there are several validated methods for scoring pain drawings.72,73,74,75,76,77 One scoring method is accomplished by overlapping the patient’s pain drawing with a transparency that includes the same drawing but with grid lines and adding up points based on the number of body regions/extremities marked and the quantity of pain quality markings used.

Summary The NRS and pain diagram have the greatest utility for the typical practitioner.


General Health

Patient-based general health OMs can be classified into two general categories: generic and disease or condition-specific measures.78,79 Generic measures
include global ratings of health status and multidimensional measures of health-related quality of life, which include the Sickness Impact Profile (SIP),80,81 SF-36 Health Survey,82 Nottingham Health Profile,83 Dartmouth COOP Health Charts,83,84,85 and others. The strength of generic measures of general health is that these are not specific to any one condition or disease and, therefore, are applicable across populations regardless of their health status. However, this is also a weakness because they are not as responsive to change over time compared with condition-specific tools.51 An example of a highly responsive condition-specific version of the SIP General Health Questionnaire is the Roland-Morris questionnaire.






Figure 8.2 Pain diagram. (A) Example of a well-delineated, anatomically correct depiction. (B) Example of a poorly delineated, anatomically incorrect, exaggerated depiction.

The SF-36 is a popular generic outcome tool that has been used in outcomes-based research has been translated into >40 languages as part of the International Quality of Life Assessment, and it is often utilized in clinical settings.86,87 The strength of the SF-36 lies in the fact that normative data exist for healthy and nonhealthy populations.88,89,90

Both versions 1.0 and 2.0 are divided into eight scales representing different aspects of general health.82,91,92 Utilizing the eight individual scales, version 2.0 yields two composite scores, which include mental health and physical health. Table 8.3 lists the eight scale titles, the number of items or questions that are used to compute the score, the specific scale items, and the minimum number of items needed to compute a score.

The physical component summary (PCS) is made up of the following four scales: Physical Function, Role Physical, Bodily Pain, and General Health. The Mental Health component (MHC) is made up of Mental Health, Role Emotional, Social Function, and Vitality. The advantage of grouping all 36 questions into two rather than eight scales results in an improvement in reliability. The mean score for a healthy adult population regarding both scales is 50 ± 10 points, which carries a reliability level of 0.92 and 0.88 for the PCS and the MHC, respectively.

The SF-36 has generally been shown to be a responsive instrument for measuring clinically meaningful change in low back pain (LBP) and sciatica individuals in certain studies,47,93,94 whereas in others it has not.95 Even in the Taylor et al study in which it was found to be responsive, it was not as good as the ODI.47 The scales with the greatest responsiveness were Physical Function, Bodily Pain, and Social Function. In fact, the Physical Function scale was more sensitive to change than the ODI.47

The SF-12 is an abbreviated version derived from the SF-36 that was designed to improve the practicality and utility of the longer 36-item version introduced.96,97,98

The SF-36 can also be utilized to form two distinct scales, the physical function and mental health scales. The advantage of the SF-12 over the SF-36 is that the length of time needed to complete the form is only 2 to 5 minutes. Standard and acute versions of the SF-12 and 36 are available in multiple languages.87

Summary If the clinician is planning to assess other outcome domains, it may be more practical to use the SF-12 instead of the SF-36 for measuring general health status. If time is still deemed excessive, a single question about self-perceived health can be utilized as has been used in “yellow flags” questionnaire99 (see Chapter 8).









Table 8.3 The SF-36 Subscales




















































Scale (SF-36 Scale Titles in Parentheses When Different)


No. of Items


Scale Items


Minimum No. of Items Needed to Compute a Score


Health perception (general health)*


5


1, 33, 34, 35, 36


3


Physical functioning*


10


3, 4, 5, 6, 7, 8, 9, 10, 11, 12


5


Role limitations caused by physical health*


4


13, 14, 15, 16


2


Role limitations caused by emotional problems**


3


17, 18, 19


2


Social functioning**


2


20, 32


1


Mental health**


5


24, 25, 26, 28, 30


3


Bodily pain*


2


21, 22


1


Energy/fatigue (vitality)**


4


23, 27, 29, 31


2


* Four scales used to calculate the physical component.

** Four scales used to calculate the mental component.


Republished with permission from Yeomans SG. The Clinical Application of Outcomes Assessment. New York, NY: McGraw-Hill Education; 2000.



Patient-Reported Outcome Measure Information System

The Patient-Reported Outcome Measure Information System (PROMIS) is a continually evolving set of PROMS that are reliable and precise. The PROMIS Health Organization collects information from multiple domains within global, physical, mental, and social health.100 All of the items in each of the PROMIS tools have been extensively tested to ensure the responsiveness, validity, and other psychometric properties are appropriate for use across diverse populations.11,101,102,103,104,105,106,107

Assessment can be delivered in a number of different ways with the option of different short or long forms in each domain, and the option to conduct computer adaptive testing, which will tailor the questions depending on the responses given.102,103,107,108,109,110 PROMIS also includes pediatric and parent proxy instruments, in multiple languages, enabling data to be captured across all age groups.103,111,112,113,114

Each of the domains in PROMIS has extensive item banks. For example, in the domain of mental health, there are item banks for the Profile Domains of Depression and Anxiety, and further item banks in the Additional Domains of Anger, Cognitive Function, Alcohol Use, Consequences & Expectancies, Smoking, Substance Abuse, Psychosocial Illness Impact, and Self-efficacy. Each item bank includes a large number of items that can be used to assess the specific domain.101,102,104,106,107,110,111,115 The variety of assessments available enables a researcher or clinician to appropriately select an instrument that will capture the data that is relevant. Factors that may influence instrument selection could include time available, the level of precision required, method of collection, the age of the patient, specific domains, etc.


Apr 17, 2020 | Posted by in PHYSICAL MEDICINE & REHABILITATION | Comments Off on Outcome Assessment

Full access? Get Clinical Tree

Get Clinical Tree app for offline access