Our ability to develop rational treatments for sports medicine injuries results from an understanding of the general laws that regulate the musculoskeletal system. Building such an understanding requires that experiments be conducted to observe how the system operates under different conditions. The way such observations are made depends on the type of phenomenon observed. In certain circumstances, a phenomenon can be considered deterministic—that is, every time it is observed, the outcome is exactly the same. For example, a ball with a known size and weight dropped from a certain height will always hit the ground with essentially the same force, no matter how many times it is observed. In this situation, few observations are needed to know what the results will be. However, certain phenomena may occur only sometimes, and each time one of these phenomena occurs or is tested, different results can be obtained. Because these results occur with a certain probability, such phenomena are considered probabilistic. To determine whether some consistency exists in the effects produced, or whether relationships or associations occur, such phenomena must be observed many times, and the data obtained must be analyzed statistically.
This chapter defines and describes several terms and techniques used in statistics. Its purpose is to provide the reader with a general understanding and appreciation of and ability to perform clinical or basic science statistical analysis. In addition, this chapter will provide the reader with the means to recognize both the validity and limitations of the conclusions that are drawn. Not all statistical topics are covered in this chapter; only topics that are current or are closely related to the study of musculoskeletal system phenomena are covered. It is hoped that this chapter will serve as a springboard for the reader to seek more advanced understanding of statistical analytic methods.
Hypothesis Testing
To examine a phenomenon and extract information from observations of it, we must first describe it in a quantitative form. We often have a preconceived notion of how a group of persons will respond to a treatment, and we thus formulate a hypothesis. A hypothesis, simply stated, is a supposition that appears to describe a phenomenon and acts as a basis of reasoning and experimentation. Hypotheses may be extremely basic (e.g., “growth factor binding to extracellular matrix proteins activates cellular protein synthesis”) or very applied (e.g., “use of the patellar tendon for surgical reconstruction of the anterior cruciate ligament results in loss of quadriceps strength”).
A hypothesis is an untested statement based on previous information, a hunch, or an intuition. It can be stated for any group of observations. However, to ensure that the observations noted are appropriate, the measurements made are accurate, the experiment is performed efficiently, and the conclusions drawn are accurate, the hypothesis must be stated in a quantitative form. Clearly stating the hypothesis focuses attention on the central issues and ensures that a given phenomenon can be properly evaluated and reexamined.
A hypothesis to be tested using statistical methodology is presented in terms of the null hypothesis. The null hypothesis states that experimental treatment has no effect—that is, it is null. As an example, we could propose the following null hypothesis: There is no difference in ultimate tensile strength between a patellar tendon that is a surgically repaired substitute for the anterior cruciate ligament and a normal anterior cruciate ligament. To test this hypothesis, ultimate tensile strength from a control group that has no injury to the ligament is compared with ultimate strength from an experimental group that underwent surgical repair using the patellar tendon.
In the aforementioned example, we would then test the null hypothesis that there is no difference in tensile strength between the patellar tendon autograft and normal ligament. If we find that the hypothesis is not true, a difference exists, which is the most important thing we want to know. If we had chosen any other hypothesis—for example, there is a 25% loss in strength—and we found that this hypothesis was not true, we still would not know whether a 50%, 10%, 5%, 4%, or no difference in strength may exist. Without more testing and analysis, we would not have a significant conclusion.
In one of the most common embodiments, an experiment based on a null hypothesis is designed with a control group that receives no treatment and one or more experimental groups that receive treatment ( Fig. 9-1 ).
Four conclusions can be drawn from the use of the null hypothesis ( Table 9-1 ). The null hypothesis can be either true or false. In addition, each of these conclusions can be accepted or rejected (that is, believed or not believed). Of the four potential decisions, two are correct and two are incorrect ( Table 9-1 ). Suppose, in our ligament strength example, that the null hypothesis is actually true; that is, no difference in ligament strength exists between experimental and control groups. But also suppose that, based on our analysis or limited sample, we conclude that a difference does exist (that is, we reject a true null hypothesis). In statistics, rejection of a true null hypothesis is an incorrect conclusion known as a type I error. In this experiment, committing a type I error would imply that a significant effect of surgical repair exists when, in fact, it does not exist. This error can be viewed in more clinical terms as a false-positive result ( Table 9-2 ).
Null Hypothesis | ||
---|---|---|
Null Hypothesis | Accepted | Rejected |
True | Correct decision | Type I error |
False | Type II error | Correct decision |
Condition | Greek Symbol | Clinical Meaning | Controlled Using |
---|---|---|---|
Type I error | α | False positive | Significance level |
Type II error | β | False negative | Statistical power |
An alternate possibility is that the null hypothesis is false; that is, surgical repair has a significant effect on ligament strength. If, for similar reasons, we chose to believe that there was no effect on ligament strength, we would be falsely accepting the null hypothesis. Accepting a false null hypothesis is known as a type II error. In this experiment, committing a type II error would imply that surgery had no effect when, in fact, it actually had an effect, which can be viewed in clinical terms as a false-negative result ( Table 9-2 ).
In the aforementioned experiment, what would it mean clinically to commit a type I or type II error, and if one did commit such errors, how serious would they be? First, if we committed a type I error, we would state that a difference in strength exists with surgical repair using the patellar tendon when, in fact, no difference exists. We would be stating that something is different when it really is not different. We then would probably try to find another tendon that showed no difference in strength, and we would modify our rehabilitation program when we used the patellar tendon in the procedure. We would continue to test different tendons until we found one that showed no difference in strength. However, if we kept committing the same type of error, we might never find the right tendon. Then we would be forced to accept a procedure as a compromise—a less than desirable condition. In spite of the time and expense involved in such experimentation, we still would be left with the incorrect conclusion.
If we made a type II error, we would be stating that no difference in strength exists when the patellar tendon is used, when in fact a difference does exist. We would be missing something that is really there, which is a more serious error in this example, not only because it can lead us to abandon the attempt to find a better substitute, but because it can lead us to perform a procedure that could subsequently produce complications. We would be denying our patients a procedure that actually works.
The seriousness of each type of error depends on the situation; in the best situation, we must try not to commit either type of error. If we cannot eliminate errors, we must at least try to reduce the chance of committing either type of error. The probability or chance of committing a type I error—that is, the probability of finding an effect when one does not really exist—is known as the significance level of a statistical test, termed “alpha” (α). The probability or chance of committing a type II error—that is, the probability of failing to find an effect when one really exists—is known as “beta” (β). In statistical testing, the term “significance level” is used to describe a type I error, and “statistical power” (defined as 1-β) is used to describe a type II error. The calculation of α, β, or power is based on the science of probability and need not be detailed here because it can be found readily in statistical tables and/or automatically calculated by a computer program (an excellent “shareware” power calculation program can be found at http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/download-and-register ). Its importance to readers of this text lies in the awareness that both types of errors can exist and in the understanding of how they are avoided.
Statistical Definitions and Descriptions of Observations
Our ability to extract the truth from a set of observations rests on our ability to accurately quantify the phenomenon and use the appropriate tools to analyze it, which requires that we have terms that describe our data quantitatively. A few of the most important terms are reviewed here.
Statistical Terms
Mean
The sample mean, designated , is the average of all variates for a sample and is an unbiased estimator of the population µ. The population mean is the most probable value within the population and the one that the investigator is trying to estimate based on the mean of the data sample. The sample mean estimates the population mean if the population is normally distributed. The values within a normally distributed sample fit into the classic bell-shaped curve ( Fig. 9-2 ). The shape of the normal distribution is very specific; the curve cannot be too tail heavy if it is to be considered normally distributed. Many natural measurements, such as length, height, and mass, are usually normally distributed, whereas many others, especially ratios and percentages, are almost never normally distributed. One of the first concerns in statistical analysis of experimental data is whether the sample variates are normally distributed.
The units of the mean are the same as the units of the variable. Sample mean is calculated as:
X ¯ = ∑ i = 1 i = n X i n
where
∑ i = 1 i = n
= the arithmetic summation of all n values of Xi; Xi = the value of an individual variate (read as the “ith” variate); and n = the sample size.
Variance
The sample variance s 2 is a measure of the spread of data about the sample mean and is an estimator of the population variance σ 2 . As shown in Figure 9-2 , each population not only has its most probable value, the mean µ, but it also has a certain variability or variance σ 2 about that value. It is more difficult to extract the “truth” from a population that has a large variance because the likelihood of obtaining a variate near the mean is lower when sample variability is high.
As an example, to determine whether a particular exercise caused an increase in quadriceps strength after surgery, two groups of persons were studied—one group treated with traditional therapy, and one group treated with therapy using the new exercise. Average group strengths can be compared, and the results of the comparison will indicate whether the exercise was efficacious. Suppose that the results in one population were highly variable ( Fig. 9-3 ). The high variability might make it difficult to determine whether the two samples are truly different because the group using the new exercise has many values that overlap with the traditional therapy group. It is fairly easy to detect a difference between the two means µ 1 and µ 2 in Figure 9-3, top ; however, with increased population variability ( Fig. 9-3 , bottom ), this difference is less easy to demonstrate.
The proper units used to express variance are variable units squared. The calculation of variance uses a “sum of squared terms” as a type of expression:
σ 2 = ∑ i = 1 n ( X ¯ − X i ) 2 n − 1
The variance term contains a squared difference between the sample mean and the individual variate. This squared difference represents the “distance” from the mean to the value of a particular variate; it is squared to eliminate the sign of the difference (positive or negative) so that variability of either sign is summed over the entire sample. The entire summed, squared difference is divided by (n − 1) to yield a sort of “average” difference. The term “n − 1” is used instead of the more intuitively appealing “n” because, as sample sizes get small, statisticians have determined that it tends to slightly overestimate population variance. An example of a population variance might be 44 N × m 2 , which describes the variance of the mean of 12 individual knee extension torque variates.
Standard Deviation
Standard deviation (SD or σ) is the square root of the sample variance. SD is more often used to describe population variability than sample variance because the units of SD are the same as the original variate units. An example of an SD is 6.6 N × m. This statement contains information regarding the estimated population mean and includes some information with respect to population variability.
The calculation of SD is simply:
SD = s 2
where s 2 = the calculated sample variance.
SD has a very useful property for normally distributed data in that 66% of the variates are within one SD of the mean, 95% of the variates are within two SDs of the mean, and 99% of the variates are within three SDs of the mean ( Fig. 9-2 ). Because this value refers to the variability of the original sample, it can be used to make powerful predictions regarding the variability of the original population if the sample properly represents the entire population.
An example of this type of approach was published in a surgical study by investigators who were interested in understanding the surgical approaches that would avoid damage to the superficial branch of the radial nerve (SBRN). It was important to know where the SBRN became subcutaneous from the interval between the brachioradialis and the extensor carpi radialis longus, because external fixator pins are frequently inserted in this area and thus surgical approaches must avoid the SBRN. In cadaveric specimens the researchers measured the distance between the subcutaneous SBRN and an external bony landmark such as the radial styloid process, which can be palpated. They found that this distance was mean ± SD = 9.0 ± 1.4 cm, which enabled them to conclude that, in 95% of the persons from the general population, the subcutaneous SBRN region extended from 6.2 cm (mean − 2 SD) to 11.8 cm (mean + 2 SD) proximal to the radial styloid process. Knowledge of this 95% confidence interval can therefore help us avoid surgical complications of nerve injury.
Standard Error of the Mean
The standard error of the mean (SEM) is the variability associated with the estimate of the population mean. It describes the level of confidence we have that the mean, which is determined from a sample of a given population, represents the mean of the entire population. SEM is calculated as:
SEM = SD n
where SD = standard deviation and n = sample size.
SD is a relatively constant estimate of population variability that doesn’t change a huge amount with sample size, whereas SEM changes with sample size and does not estimate population variability at all. Because it actually represents the accuracy of a mean estimate, it is preferable to use SEM when comparisons are made between means. SD, which is related to population variability, is preferable when it is necessary to express the variability of the original population. For example, SD might be preferred when describing the baseline characteristics of a group of experimental subjects because it would provide the reader with an idea of the level of variability in the population from which the sample was obtained. However, when comparing a treatment group to a control group, SEM may be preferable because the accuracy of the individual mean values is of interest.
Coefficient of Variation
The coefficient of variation (CV) is a generic indicator of population variance. CV is calculated as:
CV = SD X ¯ × 100 %
so that the CV is expressed without the original units of measurement. Because CV is independent of units and absolute variate magnitude, it provides a general feel for a population’s variability and can be used to describe experimental datasets “generically.” No “acceptable” level of variability exists for a particular population. Thus in a clinical experiment involving complex treatment of persons who have variable characteristics, a CV of 50% to 100% might be expected and accepted, whereas in a laboratory experiment involving a more homogeneous species and a clearly defined procedure, a CV of 10% to 25% would be more likely. It is much easier to determine whether significant effects of a particular treatment exist when the CV of the sample is low.
Choice of Significance and Power Values
Terms such as SD, SEM, and CV and the calculation of such parameters illustrate that an experiment does not always work the way we expect it to work. It is the variation that these terms represent that causes us to question or believe the results we obtain, and these terms give us an indication of how much we can believe the conclusions that have been drawn. However, there is no true or ideal answer to the question of how much variability or error we can accept before we will no longer believe or disbelieve the results. The investigator must decide what to accept, and decisions will vary depending on the nature of the study. Statistics provide a means of quantifying what we wish to accept, as expressed in the P value and α and β levels.
The P value is simply the probability (denoted as α) of committing a type I error in a given experiment ( Table 9-2 ). When a report states that the results were significant ( P < .05), the investigator is saying that a type I error has been committed less than 5% of the time. Often we conclude that if a type 1 error is committed only 5% of the time we can believe the results; that is, we expect the conclusions drawn to be found not just in the representative sample but in the entire population as well. The problem with this automatic use of P < .05 as the level for statistical significance is that many times, especially in clinical situations, it may not be reasonable or even safe to commit a type I error 5% of the time, whereas in other cases it might be acceptable to commit a type I error a greater percentage of the time. The significance level α should be determined based on its meaning in the context of the experiment performed.
The P level chosen by the investigator as determining significance for the results obtained from a particular experiment is called the critical P level; this level may be different from the one that actually is obtained when the study is run. Whereas most investigators are familiar with setting limits for type I error by choosing a critical P value, they are not as familiar with limiting type II error. However, controlling type II error can be as important as or more important than controlling type I error, as described in the next example.
Many of us have observed presentations where a small sample size was used (for example, n = 3), statistical analysis was performed, and a P value greater than .05 was obtained. The speaker concluded that the treatment had no effect. Immediately, a protestor, believing the sample size to be too small, claimed that the speaker committed a type II error.
In another presentation, we may observe a surgeon who performed an experiment using a small sample size in which he or she attempted to compare a new surgical technique with the standard technique. Based on a high P value, the surgeon concluded that no significant difference existed between the new and standard methods and that the new method should be used because it is easier and cheaper. Is this conclusion appropriate?
Although this conclusion might be correct, we would also want to be sure that if a P value greater than .05 were obtained, we are not committing a type II error by incorrectly accepting a false null hypothesis. In the aforementioned example, we may wish to design the experiment with a power of 95%. In that case, we would be 95% sure that if the surgical repair had an effect on ligament strength (i.e., the null hypothesis was false), we would not falsely conclude that it did not have such an effect.
Several methods that use graphs, tables, and equations have been developed to allow the experimenter to set the significance level (α, the critical P value) and the statistical power for an experiment and then to determine the sample size required to achieve that design. Using these methods, the experimenter chooses α and β, estimates the sample variance, and anticipates the magnitude of the treatment effect ( Fig. 9-4 ).
A survey of the scientific literature, especially that related to biology and medicine, reveals that an overwhelming majority of investigators set the critical α value as 0.05. It should be obvious that nothing is magic about an α value of 0.05. This value simply indicates that the investigator is willing to accept committing a type I error 5% of the time and still believe that the results obtained are true. However, situations may exist in which the investigator is not willing to commit a type I error 5% of the time or even 1% of the time. In such cases, the critical α value should be adjusted accordingly—that is, made lower. An example of this concept is an experiment in which the investigator attempts to demonstrate a significant decrease in knee laxity using a new surgical procedure compared with an established procedure. If the critical α value is 0.05, the investigator is willing to conclude 5% of the time that the new surgical procedure is more effective, even if it actually is not. If the new procedure represents an increased risk to the patient or a significant increase in expense or rehabilitation time, the surgeon may be willing to commit a type I error only 1% of the time or a fraction of a percent of the time. In such a case, a critical α value of 0.05 may be too high.
At times, type II error may be more important to an investigator than type I error. For example, suppose that a safe experimental drug was administered to prevent thrombophlebitis after knee surgery. In this case, type I error would indicate that the drug had an effect when in fact it did not. The detriment to the patient is that he or she would take a drug that had no effect and subsequently would be at risk for experiencing the problem that should have been avoided with use of the drug. However, suppose type II error were committed in the same study. Type II error would indicate that the drug had no effect when in fact it had an effect. In this case, an effective drug would be withheld from the patient, which could represent a large problem. It may be that in this example, the power of the test should be 99.9%, and the critical α value should only be 0.1. The interpretation of the meaning of the α value is therefore paramount in selecting its value and in guarding against a cookbook application of statistical methods.
Calculation of Sample Size
To properly test a hypothesis and ensure that the conclusions drawn accurately represent the phenomenon, we infer that our sample adequately represents the behavior of the population. Clearly the greater the “n” is for this sample, the greater the likelihood that our conclusions will, in general, be correct. However, the greater the sample size, the greater the amount of testing and cost that is required to perform the investigation. Therefore a balance must be struck between adequate representation and available resources.
A number of experimental methods have been developed to calculate sample size for various experimental designs. Each method is specific for the experimental model used. Several of these methods are presented in the Suggested Readings section at the end of this chapter. It follows logically that sample size will increase as we design an experiment to be more and more selective (in other words, to decrease the probability of either type I or type II error) or as we try to resolve very small experimental effects (also known as effect size). Therefore, to design any experiment, we must specify the type I error rate or the acceptable probability that a false-positive result will be found. We then choose the type II error rate by specifying the power and compute sample size, given the experimental variability and our desired effect size. Having specified both type I and type II errors, interpretation of the data is straightforward. If our P value exceeds .05, we conclude that the treatment has no effect. We can be sure that if it is greater than .05, it is so, not because we have too few samples, but because the null hypothesis is indeed false.
Examples of Common Statistical Tests
Once an experiment has been designed and performed, the data must be analyzed. Scores of computer programs are available to perform these tasks. Only a few methods and their associated experimental designs will be discussed in this chapter. Importantly, these methods serve as models for the many methods in use. The reader is encouraged to try to perform these analyses using the software to which he or she has access.
Student t Test
One of the simplest experimental designs involves the comparison of two groups—one group treated experimentally and one that serves as an untreated control group ( Fig. 9-1 ). A characteristic of the experimental group is measured and then compared with the same characteristic of the control group to determine whether the particular treatment had a significant effect. Consider a case in which the experimental sample represents the quadriceps extension strength of 15 persons who underwent conservative treatment for a femoral fracture. At the end of 4 weeks of cast immobilization, quadriceps strengths of the treated persons are measured and compared with the quadriceps strengths of the normal legs of 15 untreated persons. It is hoped that these persons would be matched for physical and socioeconomic factors.
Suppose that the average strength of the immobilized leg was 210 ± 15 N × m (mean ± SEM) and the average strength of the control leg was 240 ± 13 N × m. Are these leg strengths significantly different? When the study includes one or two experimental groups, the traditional statistical analysis involves the use of the Student t test. Note that analysis of variance (ANOVA) yields exactly the same results and is also generally applicable to more than two groups and to more complex designs. It is thus preferable to learn how to use ANOVA. However, for the sake of simplicity, we present this example of the use of the t test.
The null hypothesis for the t test is that H 0 : µ 1 = µ 2 , where µ 1 and µ 2 represent the means of the first and second groups, respectively. If the sample sizes are equal, the statistic used to compare the two means is the t statistic, which is calculated as:
t = X ¯ 1 − X ¯ 2 s n = X ¯ 1 − X ¯ 2 SEM
where and = the sample means for groups 1 and 2, respectively, and SEM = the average SEM for the two groups.
If the sample sizes are not equal, the equation is only slightly modified and can be found in most statistical texts. Thus the t statistic calculates “how many” standard errors two means are apart from one another. In the current example,
t = 240 − 210 14 = 2.14
The degrees of freedom for this experimental design is a(n − 1) = 2(14) = 28, and the critical t value for a significance level of .05 is 2.048. Thus the calculated t value of 2.14 is (barely) statistically significant at the .05 level. The two-sample t test is easily modified to treat the one sample case where a particular sample mean is compared with a hypothetical mean value. Sometimes this is of interest because an experimental value is compared with a predefined level the investigator chose. In this case, the t statistic is calculated as:
t = X ¯ − μ SEM
where = the sample mean and µ = the hypothetical mean to which the sample mean is compared.
For the cases in which more than two groups are to be considered, ANOVA is required to properly extract the appropriate information.
Analysis of Variance
The purpose of ANOVA is to determine whether a significant difference exists between two or more sample means. This statistical test is often used in the experimental setting to determine whether an experimental treatment has a significant effect. In practice, this analysis tests the null hypothesis that the means of “a” groups are equal. In other words, the null hypothesis for ANOVA is that:
H 0 : μ 1 = μ 2 = μ 3 = … = μ a
where µ a represents the mean of the a th group and H 0 is again the abbreviation for the null hypothesis.
As previously mentioned, when ANOVA involves only two groups, the analysis is mathematically equivalent to the Student t test.
ANOVA Assumptions
ANOVA assumes that the various sample groups are normally distributed and that the variance between groups is equivalent. These assumptions are important because deviations from them can invalidate ANOVA results. As previously mentioned, not all bell-shaped curves are normally distributed. A population that is normally distributed can be described in terms of its mean µ and its variance σ 2 ( Fig. 9-2 ). The population mean of each group can be estimated by the arithmetic average, and each sample variance describes each group’s variability.
ANOVA Table
An example is used to explain ANOVA and the ANOVA table. Suppose we are interested in determining whether a difference exists in average muscle fiber area between three quadriceps muscles. In this experiment we would obtain three groups of data from, for example, the vastus medialis (VM), the vastus lateralis (VL), and the rectus femoris (RF) muscles. These raw data are plotted in Figure 9-5 as mean ± SEM. In this example, we have three groups and six samples per group. The null hypothesis in this experiment is stated as:
H 0 : μ VL = μ VM = μ RF
where µ VL = the mean of the sample obtained from the vastus lateralis muscle, µ VM = the mean of the sample obtained from the vastus medialis muscle, and µ RF = the mean of the sample obtained from the rectus femoris muscle.
Calculation of the ANOVA Statistics
To determine whether a significant difference exists in average muscle fiber area between the three muscles using ANOVA, we use a computer program of some type to determine whether the measured difference in means among the groups is significantly greater than the “noise” in the data. The details of this calculation are actually very straightforward. As an estimate of the “noise,” we first calculate the variance within each group relative to its average by:
σ i 2 = ∑ i = 1 n ( Y i − Y ¯ ) 2 ( n − 1 )
which is simply a modified form of the variance equation previously presented. The variance of the i th group in Equation 8 is calculated as the difference between an individual observation Yi and the mean of that sample Y. Those differences, which are calculated for each variate, are squared and summed, and the sum of squares (SS) is divided by the sample size to calculate variance within that group. (Actually, for mathematical reasons, the SS is divided by n − 1 to obtain the within-group variance.) As we saw previously, the procedure in which an individual value is subtracted from another and squared is extremely common in statistical equations.
To estimate the overall “noise level” of the entire dataset across all three samples, the variances of all the individual groups are averaged. This overall noise is termed the mean squared error (MSE):
MSE = ∑ j = 1 a ( ∑ i = 1 n ( Y ij − Y ¯ j ) 2 ( n − 1 ) ) a