Principles and Applications of Measurement Methods



Principles and Applications of Measurement Methods


Steven R. Hinderer

Kathleen A. Hinderer



Objective measurement provides a scientific basis for communication between professionals, documentation of treatment efficacy, and scientific credibility within the medical community. Federal, state, private third-party payer, and consumer organizations increasingly are requiring objective evidence of improvement as an outcome of treatment. Empirical clinical observation is no longer an acceptable method without objective data to support clinical decision making. The lack of reliability of clinicians’ unaided measurement capabilities is documented in the literature (1, 2, 3, 4, 5, 6, 7), further supporting the importance of objective measures. In addition, comparison of alternative evaluation or treatment methods, when more than one possible choice is available, requires appropriate use of measurement principles (8, 9, 10, 11).

Clinicians and clinical researchers use measurements to assess characteristics, functions, or behaviors thought to be present or absent in specific groups of people. The application of objective measures uses structured observations to compare performances or characteristics across individuals (i.e., to discriminate), or within individuals over time (i.e., to evaluate), or for prognostication based on current status (i.e., to predict) (12, 13). It is important to understand the principles of measurement and the characteristics of good measures to be an effective user of the tools. Standards for implementation of tests and measures have been established within physical therapy (14, 15), psychology (16), and medical rehabilitation (17) to address quality improvement and ethical issues for the use of clinical measures.

The purpose of this chapter is to discuss the basic principles of tests and measurements and to provide the reader with an understanding of the rationale for assessing and selecting measures that will provide the information required to interpret test results properly. A standardized test is a test administered and scored in a consistent manner. The tests are designed in such a way that the questions, conditions for administering, scoring procedures, and interpretations are consistent and are administered and scored in a predetermined, standard manner. A critical starting point is to define what is to be measured, for what purpose, and at what cost. Standardized measurements meeting these criteria should then be assessed for reliability and validity pertinent to answering the question or questions posed by the user. Measurements that are shown not to be valid or reliable provide misleading information that is ultimately useless (18).

The initial section of this chapter discusses the psychometric parameters used to evaluate tests and measures. Principles of evaluation, testing, and interpretation are detailed in the second section. The fourth section provides guidelines for objective measurement when a standardized test is not available to measure the behavior, function, or characteristic of interest.

The complexity and diversity of the tests and measures used in rehabilitation medicine clinical practice and research preclude itemized description in a single chapter. A prerequisite for a measure to be objective requires that adequate levels of reliability have been demonstrated (18). Measures that have acceptable levels of reliability must also be shown to have appropriate types of validity to ultimately be labeled as objective. It is, therefore, imperative that the user be able to recognize the limitations of tests he or she employs to avoid inadvertent misuse or misinterpretation of test results.


PSYCHOMETRIC PARAMETERS USED TO EVALUATE TESTS AND MEASURES

The methods developed primarily in the psychology literature to evaluate objective measures generally are applicable to the standardized tests and instruments used in rehabilitation medicine. The topics discussed in this section are the foundation for all useful measures. Measurement tools must have defined levels of measurements for the trait or traits to be assessed and a purpose for obtaining the measurements. Additionally, tests and measures need to be practical, reliable, and valid.


Levels of Measurement

Tests and measures come in multiple forms because of the variety of parameters measured in clinical practice and research.

Despite the seemingly overwhelming number of measures, there are classified levels of measurement that determine how test results should be analyzed and interpreted (19). The four basic levels of measurement data are nominal, ordinal, interval, and ratio. Nominal and ordinal scales are used to classify discrete measures because the scores produced fall into discrete categories. Interval and ratio scales are used to classify continuous
measures because the scores produced can fall anywhere along a continuum within the range of possible scores.

A nominal scale is used to classify data that do not have a rank order. The purpose of a nominal scale is to categorize people or objects into different groups based on a specific variable. An example of nominal data is diagnosis.

Ordinal data are operationally defined to assign individuals to categories that are mutually exclusive and discrete. The categories have a logical hierarchy, but it cannot be assumed that the intervals are equal between each category, even if the scale appears to have equal increments. Ordinal scales are the most commonly used level of measurement in clinical practice. Examples of ordinal scales are the manual muscle test scale (20, 21, 22, 23, 24) and functional outcome measures (e.g., functional independence measure [FIM]) (25).

Interval data, unlike nominal and ordinal scales, are continuous. An interval scale has sequential units with numerically equal distances between them. Interval data often are generated from quantitative instrumentation as opposed to clinical observation. It is important to note that it is statistically possible to transform ordinal data into interval data using Rasch logit scale methods. This has most notably been utilized in medical rehabilitation for analysis of FIM data. Detailed information regarding how to perform this transformation can be found elsewhere (26).

Examples of interval measurements are range-of-motion scores reported in degrees and the visual analogue pain scale (continuum from 0 to 10).

A ratio scale is an interval scale on which the zero point represents a total absence of the quantity being measured. An example is force scores obtained from a quantitative muscle strength-testing device.

Analysis of nominal and ordinal scales requires special consideration to avoid misinference from test results (27, 28). The major controversies surrounding the use of these scales are the problems of unidimensionality and whether scores of items and subtests can be summed to provide an overall score. Continuous scales have a higher sensitivity of measurement and allow more rigorous statistical analyses to be performed. Sensitivity can be defined as the proportion of people with a condition/trait who test positive for that condition/trait.


Purpose of Testing

After the level of the measurement has been selected, the purpose of testing must be examined. Tests generally serve one of two purposes: screening or in-depth assessment of specific traits, behaviors, functions, or prognosis.


Screening Tests

Screening tests have three possible applications:



  • To discriminate between “suspect” and “normal” patients


  • To identify people needing further assessment


  • To assess a number of broad categories superficially

One example of a screening test is the Test of Orientation for Rehabilitation Patients, administered to individuals who are confused or disoriented secondary to traumatic brain injury, cerebrovascular accident, seizure disorder, brain tumor, or other neurologic events (29, 30, 31, 32). This test screens for orientation to person and personal situation, place, time, schedule, and temporal continuity. Another well-developed screening test is the Miller Assessment for Preschoolers (MAP) (33). This test screens preschoolers for problems in the following areas: sensory and motor, speech and language, cognition, behaviors, and visual-motor integration.

The advantages of screening tests are that they are brief and sample a broad range of behaviors, traits, or characteristics. They are limited, however, because of an increased frequency of false-positive results that is due to the small sample of behaviors obtained. Screening tests should be used cautiously for diagnosis, placement, or treatment planning. They are used most effectively to indicate the need for more extensive testing and treatment of specific problem areas identified by the screening assessment.


Assessment Tests

Assessment tests have five possible applications:



  • To evaluate specific behaviors in greater depth


  • To provide information for planning interventions


  • To determine placement into specialized programs


  • To provide measurements to monitor progress


  • To provide information regarding prognosis

An example of an assessment measure is the Boston Diagnostic Aphasia Examination (34). The advantages of assessment measures are that they have a lower frequency of false-positive results; they assess a representative set of behaviors; they can be used for diagnosis, placement, or treatment planning; and they provide information regarding the functional level of the individual tested. The limitations are that an extended amount of time is needed for testing, and they generally require specially trained personnel to administer, score, and interpret the results.


Criterion-Referenced Versus Norm-Referenced Tests

Proper interpretation of test results requires comparison with a set of standards or expectations for performance. There are two basic types of standardized measures: criterion-referenced and norm-referenced tests.


Criterion-Referenced Tests

Criterion-referenced tests are those for which the test score is interpreted in terms of performance on the test relative to the continuum of possible scores attainable (18). The focus is on what the person can do or what he or she knows rather than how he or she compares with others (35). Individual performance is compared with a fixed expected standard rather than a reference group. Scores are interpreted based on absolute criteria, for example, the total number of items successfully completed. Criterion-referenced tests are useful to discriminate between successive performances of one person. They are
conducted to measure a specific set of behavioral objectives. The Tufts Assessment of Motor Performance (which has undergone further validation work and has been renamed the Michigan Modified Performance Assessment) is an example of a criterion-referenced test (36, 37, 38, 39, 40). This assessment battery measures a broad range of physical skills in the areas of mobility, activities of daily living, and physical aspects of communication.


Norm-Referenced Tests

Norm-referenced tests use a representative sample of people who are measured relative to a variable of interest. Norm referencing permits comparison of a single person’s measurement with those scores expected for the rest of the population. The normal values reported should be obtained from, and reported for, clearly described populations. The normal population should be the same as those for whom the test was designed to detect abnormalities (35). Reports of norm-referenced test results should use scoring procedures that reflect the person’s position relative to the normal distribution (e.g., percentiles, standard scores). Measures of central tendency (e.g., mean, median, mode) and variability (e.g., standard deviation, standard error of the mean) also should be reported to provide information on the range of normal scores, assisting with determination of the clinical relevance of test results. An example of a norm-referenced test is the Peabody Developmental Motor Scale (41). This developmental test assesses fine and gross motor domains. Test items are classified into the following categories: grasp, hand use, eye-hand coordination, manual dexterity, reflexes, balance, nonlocomotor, locomotor, and receipt and propulsion of objects.


Practicality

A test or instrument should ideally be practical, easy to use, insensitive to outside influences, inexpensive, and designed to allow efficient administration (42). For example, it is not efficient to begin testing in a supine position, switch to a prone position, and then return to supine. Test administration should be organized to complete all testing in one position before switching to another. Instructions for administering the test should be clear and concise, and scoring criteria should be clearly defined. If equipment is required, it must be durable and of good quality. Qualifications of the tester and additional training required to become proficient in test administration should be specified. The time to administer the test should be indicated in the test manual. The duration of the test and level of difficulty need to be appropriate relative to the attention span and perceived capabilities of the patient being tested. Finally, the test manual should provide summary statistics and detailed guidelines for appropriate use and interpretation of test scores based on the method of test development.


Reliability and Agreement

A general definition of reliability is the extent to which a measurement provides consistent information (i.e., is free from random error). Granger et al. (43) provide the analogy “it may be thought of as the extent to which the data contain relevant information with a high signal-to-noise ratio versus irrelevant static confusion.” By contrast, agreement is defined as the extent to which identical measurements are made. Reliability and agreement are distinctly different concepts and are estimated using different statistical techniques (44). Unfortunately, these concepts and their respective statistics often are treated synonymously in the literature.

The level of reliability is not necessarily congruent with the degree of agreement. It is possible for ratings to cluster consistently toward the same end of the scale, resulting in high-reliability coefficients, and yet these judgments may or may not be equivalent. High reliability does not indicate whether the raters absolutely agree. It can occur concurrently with low agreement when each rater scores patients differently, but the relative differences in the scores are consistent for all patients rated. Conversely, low reliability does not necessarily indicate that raters disagree. Low-reliability coefficients can occur with high agreement when the range of scores assigned by the raters is restricted or when the variability of the ratings is small (i.e., in a homogeneous population). In instances in which the scores are fairly homogeneous, reliability coefficients lack the power to detect relationships and are often depressed, even though agreement between ratings may be relatively high. The reader is referred to Tinsley and Weiss for examples of these concepts (45). Both reliability and agreement must be established on the target population or populations to which the measure will be applied, using typical examiners. There are five types of reliability and agreement:



  • Interrater


  • Test-retest


  • Intertrial


  • Alternate form


  • Population specific

Each type will be discussed below, along with indications for calculating reliability versus agreement and their respective statistics.


Interrater Reliability and Agreement

Interrater or interobserver agreement is the extent to which independent examiners agree exactly on a patient’s performance. By contrast, interrater reliability is defined as the degree to which the ratings of different observers are proportional when expressed as deviations from their means; that is, the relationship of one rated person to other rated people is the same, although the absolute numbers used to express the relationship may vary from rater to rater (45). The independence of the examiners in the training they receive and the observations they make is critical in determining interrater agreement and reliability. When examiners have trained together or confer when performing a test, the interrater reliability or agreement coefficient calculated from their observations may be artificially inflated.

An interrater agreement or reliability coefficient provides an estimate of how much measurement error can be expected in scores obtained by two or more examiners who have
independently rated the same person. Determining interrater agreement or reliability is particularly important for test scores that largely depend on the examiner’s skill or judgment. An acceptable level of interrater reliability or agreement is essential for comparison of test results obtained from different clinical centers. Interrater agreement or reliability is a basic criterion for a measure to be called objective. If multiple examiners consistently obtain the same absolute or relative scores, then it is much more likely that the score is a function of the measure, rather than of the collective subjective bias of the examiners (18).

Pure interrater agreement and reliability are determined by having one examiner administer the test while the other examiner or examiners observe and independently score the person’s performance at the same point in time. When assessing some parameters, when the skill of the examiner administering the test plays a vital role (e.g., sensory testing, range-of-motion testing) or when direct observation of each examiner is required (e.g., strength), it is impossible to assess pure interrater agreement and reliability. In these instances, each examiner must test the individual independently. Consequently, these interrater measures are confounded by factors of time and variation in patient performance.


Test-Retest Reliability and Agreement

Test-retest agreement is defined as the extent to which a patient receives identical scores during two different test sessions when rated by the same examiner. By contrast, test-retest reliability assesses the degree of consistency in how a person’s score is rank ordered relative to other people tested by the same examiner during different test sessions. Test-retest reliability is the most basic and essential form of reliability. It provides an estimate of the variation in patient performance on a different test day, when retested by the same examiner. Some of the errors in a test-retest situation also may be attributed to variations in the examiner’s performance. It is important to determine the magnitude of day-to-day fluctuations in performance so that true changes in the parameters of interest can be determined. Variability of the test or how it is administered should not be the source of observed changes over time. Additionally, with quantitative measuring instruments, the examiner must be knowledgeable in the method of and frequency required for instrument calibration.

The suggested test-retest interval is 1 to 3 days for most physical measures and 7 days for maximal effort tests in which muscle fatigue is involved (46). The test-retest interval should not exceed the expected time for change to occur naturally. The purpose of an adequate but relatively short interval is to minimize the effects of memory, practice, and maturation or deterioration on test performance (47).


Intertrial Reliability and Agreement

Intertrial agreement provides an estimate of the stability of repeated scores obtained by one examiner within a test session. Intertrial reliability assesses the consistency of one examiner rank-ordering repeated trials obtained from patients using the same measurement tool and standardized method for testing and scoring results within a test session. Intertrial agreement and reliability also are influenced by individual performance factors such as fatigue, motor learning, motivation, and consistency of effort. Intertrial agreement and reliability should not be confused with test-retest agreement and reliability. The latter involves test sessions usually separated by days or weeks as opposed to seconds or minutes for intertrial agreement and reliability. A higher level of association is expected for results obtained from trials within a test session than those from different sessions.


Alternate Form Reliability and Agreement

Alternate form agreement refers to the consistency of scores obtained from two forms of the same test. Equivalent or parallel forms are different test versions intended to measure the same traits at a comparable level of difficulty. Alternate form reliability refers to whether the parallel forms of a test rank order people’s scores consistently relative to each other. A high level of alternate form agreement or reliability may be required if a person must be tested more than once and a learning or practice effect is expected. This is particularly important when one form of the test will be used as a pretest and a second as a posttest.


Population-Specific Reliability and Agreement

Population-specific agreement and reliability assess the degree of absolute and relative reproducibility, respectively, that a test has for a specific group being measured (e.g., Ashworth scale scores for rating severity of spasticity from spinal cord injury). A variation of this type of agreement and reliability refers to the population of examiners administering the test (18).


Interpretation of Reliability and Agreement Statistics

Because measures of reliability and agreement are concerned with the degree of consistency or concordance between two or more independently derived sets of scores, they can be expressed in terms of correlation coefficients (35). The reliability coefficient is usually expressed as a value between 0 and 1, with higher values indicating higher reliability. Agreement statistics can range from −1 to +1, with +1 indicating perfect agreement, 0 indicating chance agreement, and negative values indicating less than chance agreement. The coefficient of choice varies, depending on the data type analyzed. The reader is referred to Bartko and Carpenter (48), Hartmann (49), Hollenbeck (50), Liebetrau (51), and Tinsley and Weiss (45) for discussions of how to select appropriate statistical measures of reliability and agreement. Table 8-1 provides information on appropriate statistical procedures for calculating interrater and test-retest reliability and agreement for discrete and continuous data types. No definitive standards for minimum acceptable levels of the different types of reliability and agreement statistics have been established; however, guidelines for minimum levels are
provided in Table 8-1. The acceptable level varies, depending on the magnitude of the decision being made, the population variance, the sources of error variance, and the measurement technique (e.g., instrumentation vs. behavioral assessments). If the population variance is relatively homogeneous, lower estimates of reliability are acceptable. By contrast, if the population variance is heterogeneous, higher estimates of reliability are expected. Critical values of correlation coefficients, based on the desired level of significance and the number of subjects, are provided in tables in measurement textbooks (52, 53). It is important to note that a correlation coefficient that is statistically significant does not necessarily indicate that adequate reliability or agreement has been established, because the significance level only provides an indication that the coefficient is significantly different from zero (see Table 8-1).








TABLE 8.1 Interrater Reliability, Test-Retest Reliability, and Agreement Analysis: Appropriate Statistics and Minimum Acceptable Levels




















































Reliability Analysis


Agreement Analysis


Data Type


Appropriate Statistic


Level


Appropriate Statistic


Level


Discrete



Nominal


ICC or κw


>0.75


κ


>0.60



Ordinal


ICC


>0.75


κw


>0.60


Continuous



Interval


ICC


>0.75


Χ2 and T


P< 0.05



Ratio


ICC


>0.75


Χ2 and T


P< 0.05


References: ICC: discrete (47,55 [more]), ordinal (47), continuous (44,47 [more]), minimal acceptable level (56); Cohen’s κκ (44,47,57,58 [more] [more] [more]), κw (47,59,60 [more] [more]), κw equivalence with ICC for reliability analysis of minimal data (61-64 [more] [more] [more]), minimal acceptable level (65); Lawlis and Lu’s Χ2 and T; statistical and minimal level (43,44 [more]).


ICC, intraclass correlation; κ, kappa; κw, weighted kappa; T, T index.


Agreement and reliability both are important for evaluating patient ratings. As discussed earlier, these are distinctly different concepts and require separate statistical analysis. Several factors must be considered to determine the relative importance of each. Decisions that carry greater weight or impact for the people being assessed may require more exact agreement. If the primary need is to assess the relative consistency between raters, and exact agreement is less critical, then a reliability measure alone is a satisfactory index. By contrast, whenever the major interest is either the absolute value of the score or the meaning of the scores as defined by the points on the scale (e.g., criterion-referenced tests), agreement should be reported in addition to the reliability (45). Scores generated from instrumentation are expected to have a higher level of reliability or agreement than scores obtained from behavioral observations.

A test score actually consists of two different components: the true score and the error score (35, 54). A person’s true score is a hypothetical construct, indicating a test score that is unaffected by chance factors. The error score refers to the unwanted variation in the test score (55). All continuous scale measurements have a component of error, and no test is completely reliable. Consequently, reliability is a matter of degree. Any reliability coefficient may be interpreted directly in terms of percentage of score variance attributable to different sources (18). A reliability coefficient of 0.85 signifies that 85% of the variance in test scores depends on true variance in the trait measured and 15% depends on error variance.


Specific Reliability and Agreement Statistics

There are several statistical measures for estimating interrater agreement and reliability. Four statistics commonly used to determine agreement are the frequency ratio, point-by-point agreement ratio, kappa (κ) coefficients, and Lawlis and Lu’s Χ2 and T-index statistics. For reliability calculations, the most frequently used correlation statistics are the Pearson product-moment (Pearson r) and intraclass correlation coefficients (ICCs). When determining reliability for dichotomous or ordinal data, specific ICC formulas have been developed. These nonparametric ICC statistics have been shown to be the equivalent of the weighted kappa (κw) (56, 57, 58, 59). Consequently, the κw also can be used as an index of reliability for discrete data, and the values obtained can be directly compared with equivalent forms of ICCs (57). The method of choice for reliability and agreement analyses partially depends on the assessment strategy used (45, 48, 51, 60). In addition to agreement and reliability statistics, standard errors of measurement (SEM) provide a clinically relevant index of reliability expressed in test score units. Each statistic is described below.


Frequency Ratio

This agreement statistic is indicated for frequency count data (47). A frequency ratio of the two examiners’ scores is calculated by dividing the smaller total by the larger total and multiplying by 100. This statistic is appealing because of its computational and interpretive simplicity. There are a variety of limitations, however. It only reflects agreement of the total number of behaviors scored by each observer; there is no way to determine whether there is agreement for individual responses using a frequency ratio. The value of this statistic may be inflated if the observed behavior occurs at high rates (60). There is no meaningful lower bound of acceptability (49).


Point-by-Point Agreement Ratio

This statistic is used to determine if there is agreement on each occurrence of the observed behavior. It is appropriate when there are discrete opportunities for the behavior to occur or for distinct response categories (47, 61, 62). To calculate this ratio, the number of agreements is totaled by determining the concurrence between observers regarding the
presence or absence of observable responses during a given trial, recording interval, or for a particular behavior category. Disagreements are defined as instances in which one observer records a response and the other observer does not. The point-by-point agreement percentage is calculated by dividing the number of agreements by the number of agreements plus disagreements and multiplying by 100 (62). Agreement generally is considered to be acceptable at a level of 0.80 or above (62).

The extent to which observers are found to agree is partially a function of the frequency of occurrence of the target behavior and of whether occurrence and/or nonoccurrence agreements are counted (61). When the rate of the target behavior is either very high or very low, high levels of interobserver agreement are likely for occurrences or nonoccurrences, respectively. Consequently, if the frequency of either occurrences or nonoccurrences is high, a certain level of agreement is expected simply owing to chance. In such cases, it is often recommended that agreements be included in the calculation only if at least one observer recorded the occurrence of the target behavior. In this case, intervals during which none of the observers records a response are excluded from the analysis. It is important to identify clearly what constitutes an agreement when reporting point-by-point percentage agreement ratios because the level of reliability is affected by this definition.


Kappa Coefficients

The κ coefficient provides an estimate of agreement between observers, corrected for chance agreement. This statistic is preferred for discrete categorical (nominal and ordinal) data because, unlike the two statistics discussed above, it corrects for chance agreements. In addition, percentage agreement ratios often are inflated when there is an unequal distribution of scores between rating categories. This often is the case in rehabilitation medicine, in which the frequency of normal characteristics is much higher than abnormal characteristics (63, 64). By contrast, κ coefficients provide accurate estimates of agreement, even when scores are unequally distributed between rating categories (64).

Kappa coefficients are used to summarize observer agreement and accuracy, determine rater consistency, and evaluate scaled consistency among raters (60). Three conditions must be met to use κ:



  • The patients or research subjects must be independent.


  • The raters must independently score the patients or research subjects.


  • The rating categories must be mutually exclusive and exhaustive (63, 64).

The general form of κ is a coefficient of agreement for nominal scales in which all disagreements are treated equally (45, 48, 51, 65, 66, 67, 68). The κw statistic was developed for ordinal data (48, 51, 69, 70), in which some disagreements have greater gravity than others (e.g., the manual muscle testing scale, in which the difference between a score of 2 and 5 is of more concern than the difference between a score of 4 and 5). Refer to the references cited above for formulas used to calculate κ and κw.

Several other variations of κ have been developed for specific applications. The kappa statistic κv provides an overall measure of agreement, as well as separate indices for each subject and rating category (71). This form of κ can be applied in situations in which subjects are not all rated by the same set of examiners. The variation of κ described by Fleiss et al. is useful when there are more than two ratings per patient (58); a computer program is available to calculate this statistic (63). When multiple examiners rate patients and a measure of overall conjoint agreement is desired, the kappa statistic κm is indicated (72). Standard κ statistics treat all raters or units symmetrically (58). When one or more of the ratings are considered to be a standard (e.g., scores from an experienced rater), alternate analysis procedures should be used (72, 73, 74).

May 25, 2016 | Posted by in PHYSICAL MEDICINE & REHABILITATION | Comments Off on Principles and Applications of Measurement Methods

Full access? Get Clinical Tree

Get Clinical Tree app for offline access