32 Clinical Trial Design and Analysis
An appropriate trial design is of decisive importance to optimize the likelihood that the trial provides results that are interpretable, robust, and applicable. The decisions made during the design phase are a reflection of the continuous balance between internal validity and external validity.
Several disrupting factors, such as missing data, dropouts, and confounding, may jeopardize the interpretation of the trial results. The consumers of trial results (investigators, pharmaceutical industry workers, readers of medical journals) should be challenged to interpret the results of a trial in the light of these potentially disturbing factors.
Clinical trials are studies designed to assess the efficacy and toxicity of drugs or other interventions. Although the term clinical trial often refers to “randomized clinical trial (RCT),” every study in which patients are exposed to an intervention and in which data are systematically collected can be considered as a clinical trial. Clinical trials play an important role in new drug development, but can also serve to further explore existing drugs or interventions or to refine their use, as in determination of predictive factors for treatment efficacy, or to test treatment strategies. This chapter broadly confines itself to RCTs. It discusses methodologic principles of clinical trials, as well as RCT analysis, in the context of rheumatology, and it unveils limitations of RCTs while briefly discussing alternative solutions in design and analysis. This chapter is intended as an introduction for clinicians and researchers working in the field of rheumatology.
The classical template of an RCT includes two (or more) trial arms comparing the drug or intervention of interest (e.g., the new drug) with a control intervention. The latter may include a placebo or sham intervention, or an intervention that is considered to represent standard care. By definition, the treatment arms are created through the process of randomization, which is pivotal and will be outlined later in greater detail.
To better understand differences in trial design, it is often helpful to distinguish explanatory RCTs and pragmatic RCTs.1–3 Trials of new drugs, such as those designed for drug registration, aimed at showing efficacy and short-term safety, belong to the group of explanatory RCTs. In general, all elements of trial design, such as selection of patients, sample size, choice of the comparative intervention, and duration of the trial, are chosen in such a manner that the trial can optimally demonstrate a treatment effect, that is, a difference in efficacy between the new drug and the control intervention. The methodologic robustness of a trial, which is dependent on these elements of trial design, is referred to as internal validity. Explanatory trials do not always resemble clinical practice. As an example, they often include for methodologic reasons patients with a high level of disease activity who form only a minority in clinical practice. The extent to which clinical trial results can be extrapolated to the common clinical practice is referred to as external validity. As a rule of thumb, explanatory trials have a high level of internal validity, which may, however, jeopardize external validity to some extent. Pragmatic trials more closely resemble the clinical situation. Such trials aim to optimize treatment by further exploring existing drugs or treatment strategies. Pragmatic trials incorporate fundamental principles of RCTs, such as randomization, but include a more realistic representation of patients, may have a longer duration, and may allow co-interventions. In general, pragmatic trials have a lower level of internal validity as compared with explanatory trials, but a higher level of external validity. Often, explanatory trials are initiated and sponsored by pharmaceutical industry, but most pragmatic trials are (academic) investigator-driven initiatives.
Randomization, the process by which patients are assigned to treatment by chance, is the most important methodologic characteristic of an RCT and deserves some explanation. Randomization makes treatment arms similar for all variables except treatment, or, in other words, randomization divides all known and unknown variables that may or may not be of prognostic importance equally across treatment groups, thus reducing the probability that factors other than treatment may influence the results. It is important to realize that randomization does not completely preclude imbalances. Differences in variables that are of prognostic importance may simply occur by chance. However, randomization precludes intentional imbalances (e.g., dissimilarities created by physicians who consider a particular treatment more appropriate for a particular patient [selection]).
From a statistical perspective, chance differences will occur more frequently in small study samples than in larger ones, and may be of higher magnitude in small trials. It is therefore necessary to compare treatment groups at baseline with respect to important prognostic variables, and to adjust for differences in the statistical analysis in case of doubt. Usually, computer-generated randomization lists are used to randomize patients in an RCT. Technically, randomization is often performed in “blocks,” so that in every block of four or 10, there will be equal numbers of patients in the treatment and control groups. Randomizing in blocks ensures that if the sample size is less than expected, an equal proportion of patients will be included in each treatment group. Often, in multicenter trials, one center is assigned one or more blocks, ensuring that the numbers of patients receiving the new drug and the control drug are evenly distributed per center.
Some trials randomly enroll patients in strata (stratification) of equal or unequal size. Stratification (a better wording is stratified randomization) makes sense only if the variable subject to stratification represents a prognostically important feature. An appropriate example is a situation in which circumstantial evidence suggests that the efficacy of a treatment is different in males as compared with females. Stratified randomization with “males” and “females” as strata implies that randomization to treatment groups occurs after assignment of the appropriate stratum. This allows a justifiable comparison between treatment groups within each stratum because there is prognostic similarity at baseline. Appropriate stratified randomization requires a trial design and a sample size that indeed allows such a comparison (see later discussion on statistical power).
Stratified randomization should be distinguished from post hoc subgroup analysis, in which the “strata” are determined during analysis of the trial. In such post hoc comparisons, prognostic similarity cannot be assured, and statistical adjustments can account for this only rarely.
A fundamental choice in consideration of the design of the study is the decision about a superiority design or a nonsuperiority design. The latter theoretically can be further categorized as a noninferiority design and an equivalence design. The basis supporting this choice is the null hypothesis underlying the study. Consequences of the choice of design are important. If a new treatment is tested against placebo, the a priori hypothesis is that this new drug is more effective than placebo, and a superiority design is a rational choice. If for a particular disease or condition, treatments are already available, it is ethically often not justifiable to subject patients to a placebo treatment for longer periods. It is not always rational to assume that a new treatment will be better than the best available treatment at that moment, and a superiority design would have a high likelihood of failure. In such situations, one can opt for a nonsuperiority design. These designs have the underlying hypothesis that the treatment to test is at least not worse than (or, equivalent to) the comparative treatment, which can be the standard of care, or alternatively, the best currently available treatment.
In a superiority design, the question is whether the new treatment is more efficacious than the control intervention (e.g., placebo). Formally, such a study tests whether the null hypothesis of no difference between both treatment groups can be rejected. To do so, investigators agree on a minimally clinically important difference (MCID) between the intervention of interest and the control intervention such that a study should be able to demonstrate, and they design the study in such a way that this difference can be demonstrated with high likelihood (statistical power) when it really exists (see later). In a noninferiority design, the reasoning is opposite. The null hypothesis is that the new treatment is less efficacious than the control intervention.4,5 Even if the new intervention and the control intervention are truly similarly effective, a trial will almost never yield a result with a treatment effect of exactly zero (no difference). There will be variation around zero, and it is the task of investigators to decide in the design phase of the study which deviation from a treatment effect of zero they will accept to conclude that the interventions are equivalent, the noninferiority margin. Determination of the MCID in a superiority design and the noninferiority margin in a noninferiority design is a subjective decision with important consequences for the sample size. When it is important in a superiority design to be able to demonstrate very small treatment effects with a high likelihood, large sample sizes are needed; the same is true with a very narrow noninferiority margin in a noninferiority design. Especially with a noninferiority design, considerations other than efficacy alone may give guidance to the level of the noninferiority margin. If a new drug is less toxic or less costly than existing drug(s) on the market, and as such may provide additional benefits, one could be more lenient with regard to determining the noninferiority margin. In general, noninferiority designs require (far) more patients than are required by superiority designs.
Subjects who are entered into clinical studies should meet accepted criteria for the disease or disorder under study. Most rheumatologic conditions lack single and unequivocal diagnostic tests, and classification criteria have been developed to identify patients with similar characteristics.6 These classification criteria serve as eligibility criteria in an RCT. To homogenize patient populations for scientific purposes, classification criteria are designed to be highly specific. As a consequence, sensitivity may fall short, and classification criteria are often of limited use in diagnosis. The high specificity of classification criteria has implications for the makeup of the trial population. In general, patients with classic, often severe disease are overrepresented, and those with early, less typical disease are underrepresented.
In many trials in rheumatology, patients must meet certain criteria for disease activity or duration. Some trials require that the patient experience a flare after withdrawal of medication as evidence of active disease. Other studies define disease activity before withdrawal of medication as evidence of lack of response to current treatment. Disease severity can be defined by accepted clinical criteria or by lack of response to previous treatments. For example, RA studies may be limited to patients who have not yet received methotrexate (who presumably have relatively early disease or mild disease) or to patients who have failed treatment with at least one other disease-modifying antirheumatic drug (DMARD) (greater disease severity). There are ethical and methodologic reasons for the use of such activity and/or severity criteria. Ethical arguments may proscribe that a novel intervention is first tested in patients with severe, sometimes intractable disease in which common alternatives have failed. Methodologic arguments are that a treatment effect can best be demonstrated in a population of patients prone to change. Most inflammatory rheumatic diseases have a cyclic course characterized by exacerbations and remissions. Patients with a high level of disease activity will tend to improve over time, even without an intervention—a phenomenon known as regression toward the mean; the additional effect of a new intervention in comparison with a control treatment can be demonstrated more easily in such a context.
Exclusion criteria usually include conditions such as cancer; cardiac, hepatic, or renal disease; abnormalities in hematologic parameters, medication allergies, and pregnancy. Exclusion criteria serve to decrease background noise or variability due to differences in patient characteristics. In general, inclusion and exclusion criteria will homogenize the trial patient population and contribute to an environment that is most optimal to demonstrate a treatment effect. Inclusion and exclusion criteria also prevent entry of patients in whom an adverse response is more likely to occur and those for whom the experimental treatment could be dangerous a priori. As such, inclusion criteria and exclusion criteria contribute to a high level of internal validity but jeopardize external validity. Explanatory trials usually have a comprehensive set of inclusion and exclusion criteria. Pragmatic trials are more lenient in this regard because they should better reflect the common clinical practice.
Ethical considerations determine whether eligible subjects participate in a clinical trial. Governmental agencies of most countries require that institutions involved in human research have a local institutional review board (IRB). The IRB reviews all protocols before implementation and monitors ongoing studies at its institution. A crucial element in the review of a trial is the informed consent process.7 The consent form should explain to the study participant the purpose of the study, all potential benefits and risks (including risks to pregnant mother and fetus), alternatives to participation, and who is responsible for conducting the study. Patient confidentiality should be ensured. The consent form should clearly state that participation is completely voluntary, and that refusal to participate or withdrawal from the study will not affect future care. If compensation is provided, this must be documented in the consent form. Participants should be given contact information for questions or in case of injury and a statement about whether any medical treatment will be given if injury occurs. Investigators are responsible for ensuring that the risk to subjects is minimized and appropriate for the anticipated benefits.
The optimal duration of the trial represents a compromise between economical, ethical, and methodologic considerations. A trial should not be too short because an intervention needs time to exert its potentially advantageous (but also deleterious) effects; in particular, a short trial does not reflect the clinical reality of most rheumatologic conditions. Equally, a trial should not be too long because RCTs are expensive, patients should not be subjected to experimental interventions with uncertain adverse events for an excessive time period, and too much longitudinal bias should be avoided. Longitudinal bias may occur if during the trial, the treatment groups increasingly become dissimilar as the result of selective dropout, co-interventions, or other patients’ or physicians’ behavior. Selective dropout may occur if patients with a particular profile preferably withdraw from one of the treatment groups, thus creating prognostic imbalance. A common example is that of an RA trial comparing an effective drug versus placebo. Patients with relatively severe and active disease in the placebo group may preferably discontinue trial medication and may drop out because they do not experience benefit, while less severe and less active patients remain in the trial. Co-interventions—allowed or not allowed—may similarly jeopardize prognostic similarity if they occur in an unbalanced (i.e., unequal) manner across treatment groups. A common example is a trial that tests a nonsteroidal anti-inflammatory drug (NSAID) versus placebo with respect to the relief of pain. Simple analgesics, which are preferentially used in the placebo group, may inadvertently influence pain scores, leading to incorrect conclusions. The treating physician may contribute to prognostic imbalance by prescribing co-interventions, or in general terms by treating patients differently according to their clinical response or the occurrence of adverse events. A relatively short follow-up will decrease the likelihood that unintended events will occur and as such contributes to maintaining prognostic similarity and increasing internal validity.
Explanatory trials usually have a follow-up duration that is as short as possible, and co-interventions are prohibited. The most important limitation of short-term trials in lifelong rheumatologic diseases is that they do not appropriately reflect the course of the disease encountered in clinical practice. Sometimes, RCTs, especially pragmatic trials, have a long trial duration that better reflects the clinical reality. In such a trial, internal validity is deliberately sacrificed to some extent in favor of external validity (generalizability) and the yield of long-term information.
In double-blind studies, neither the patient nor the investigator is aware of the treatment group assignment. In single-blind studies, the investigator is aware of the treatment allocation, but the patient is not. In open-label studies, both patient and investigator are aware of the treatment assignment. The most important reason to blind treatment allocation is to avoid that any expectation about the type of treatment provided could influence the measured outcome (expectation bias), especially (but not exclusively) if the measured outcome includes subjective components. Note that subjective refers to the patient (e.g., pain scores) as well as to the investigator (e.g., joint scores). To avoid the latter, many drug trials make use of independent (joint) assessors, who are not responsible for decisions regarding patient care and are blind for the treatment. Another common example of a blinded independent assessor in rheumatology is the reader of radiographs in imaging studies in RA, psoriatic arthritis, or ankylosing spondylitis (AS).
Regardless of any precaution taken, unblinding may inadvertently occur because of identifiable adverse reactions or minor side effects, lack of efficacy, or changes in laboratory parameters. A meaningful effect of such a type of unblinding is not easy to prove, nor can it be adjusted for in the analysis.
Any clinical trial has one or more outcome variables of interest. Outcome is broadly defined and refers to a clinical situation or a change in a clinical situation that is quantifiable by using assessment instruments. Outcome variables can measure real outcome that directly affects the patient (e.g., vertebral fracture in osteoporosis), or alternatively can reflect a situation that is associated with real outcome but does not (yet) affect the patient (e.g., low bone mineral density in osteoporosis). The latter type of outcome is often referred to as surrogate outcome. Reasons to use surrogate outcome measures rather than real outcome measures are that the former occur (far) earlier and more frequently and can often be assessed on a continuous scale (which is a statistical advantage), and the latter often describe an event (the presence or absence of a clinical situation) with negative implications for statistical power.
The Outcome and Measurement in Rheumatology Clinical Trials (OMERACT) initiative was created to bring unanimity to the multitude of outcome measures in rheumatology on the basis of expert consensus.8 Its activities were initiated in RA and were expanded to include most other rheumatologic diseases. The OMERACT framework is the so-called OMERACT filter, which describes the methodologic prerequisites that an appropriate outcome measure should fulfill to be considered valid for clinical trials. The OMERACT filter prescribes three validation requirements: An outcome measure should be truthful, discriminatory, and feasible.
Truthful refers to whether an outcome measure truly measures what it is intended to measure, and approximates the concepts of face, content, and construct validity. It means, for example, that the disease activity score (DAS) in RA should truly measure what is considered important in RA (swelling and tenderness of joints) (content validity) and is a relevant construct to describe the process of RA, for example, because the disease activity score is associated with radiographic progression and limited physical function (construct validity). Discriminatory refers to whether an outcome measure can reliably be measured (intraobserver variation and interobserver variation), whether it can distinguish between two stages of the disease (e.g., RA with high disease activity vs. RA with low disease activity), whether it should be applied to groups of patients or to an individual patient, whether the measure is sensitive to change (e.g., whether the DAS measurably decreases if the disease improves), and whether the measure can discriminate between groups of patients on effective therapy versus those on placebo or less effective therapy. Feasible refers to whether an outcome measure is easily applicable and cheap in the setting in which it is intended to be used.
Ten biannual OMERACT conferences have resulted in sets of outcome measures for almost all inflammatory rheumatologic diseases and for numerous noninflammatory disorders. These so-called core-sets have importantly improved homogeneity across different clinical trials, thus favoring comparability. In the design phase of a clinical trial, it is highly recommended to choose a primary outcome measure from these core-sets, and to measure all components of the core-set as secondary outcome measures. Reporting of all core-set measures prevents selective reporting of only positive results with respect to a few variables.
Increasingly, indices are replacing single-outcome variables in rheumatology. An index is a weighted or unweighted combination of single variables that together reflect a particular domain of outcome.9 A general rule is that indices perform better than single-item variables only if they consist of variables that correlate moderately with each other. If variables correlate at a too high level, there is redundancy of information. If variables do not correlate, they will reflect different domains; this complicates interpretability, and it is better to separately describe them. Important examples of useful indices in rheumatology are the already mentioned disease activity score (DAS),10 the ankylosing spondylitis disease activity score (ASDAS),11 and the American College of Rheumatology (ACR) response criteria in RA.12