Types of variables

*Continuous variables* can exhibit infinite values between the minimum and maximum. For example, WOMAC can take any score from 0 to 2400 mm as in the example above. Another example would be of a visual analog pain score, which could be anywhere between 0 and 10. Age is also a continuous variable.

*Categorical variables*are also known as discrete or qualitative variables. These variables can be put into categories/groups. Categorical variables can be further characterized as nominal, ordinal, or binary.

*Nominal variables*are categorical variables where the groups are related, but there is no natural ordering among them, for example, race, type of insurance, or geographical regions (North, South, East, West).

*Ordinal variables*are categorical variables where there is a natural order among the groups, such as ranking scales or letter grades. Differences are not precisely meaningful. For example, pain scale whose options are categorized as mild, moderate, and severe—if one patient scores her pain as mild and another as moderate, we cannot say precisely the difference in their scores like in continuous variables; however, it would tell us that moderate is greater than mild.

*Binary variables*are also called dichotomous variables. A variable is said to be binary when there are only two possible levels. Variables that can be phrased as a yes/no question are in this category. For example, if we were looking at gender, we would most probably categorize somebody as either “male” or “female.” This example of a binary variable is also a nominal variable. In another example, if education level of patients were categorized into less than high school and greater than high school, it would be a binary variable. Since there is some order to this classification, it is also an ordinal variable. Thus, when a categorical nominal or an ordinal variable has just two possibility outcomes, it can be labeled as a binary variable.

Continuous variables can be put into categories for analysis. For example, DAS-28 ESR is a continuous variable whose value ranges from 0 to 9.4. It was developed and validated by the EULAR (European League Against Rheumatism) to measure the progress and improvement of Rheumatoid Arthritis. DAS28 is often categorized in clinical trials as clinical remission (0 to < 2.6), low disease activity (2.6 to < 3.2), moderate disease activity (3.2 to 5.1), and high disease activity (> 5.1 to 9.4). This would be a categorical-nominal variable.

- 2.

While studying social determinants of health affecting disease activity in rheumatoid arthritis patients, it was discovered that patient income was not normally distributed in the particular cohort. Around 65% of the cohort had an income level below $30,000 per year.

While describing the distribution of this data, which of the following measures of central tendency and dispersion would be most appropriate?

- A.

Mean and standard deviation

- B.

Mean and interquartile range

- C.

Median and standard deviation

- D.

Median and interquartile range

Correct answer: D

Non-normally distributed data is generally presented as median with interquartile range.

Numerous methods are used to summarize the distribution of data. The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. Two descriptive statistics can provide key information efficiently. The first is the measure of central tendency, which is a way to describe where the center of the distribution of the variable lies. The second is the measure of dispersion, which describes the spread of the variable. Since the measures of central tendency are not adequate to describe data, measures of dispersion add value.

- A.

*Mean*—the arithmetic average of the observations in a sample. One needs to add up all the numbers and then divide by the number of numbers. This is usually used to describe normally distributed data.

*Median*—the middle observation, i.e., half of the observations are smaller and half are larger. To find the median, numbers have to be listed in numerical order from smallest to largest, so you may have to rewrite your list before you can find the median. Like in the question above, this is usually used in non-normally distributed data.

*Mode*—the observation that occurs most frequently in a sample. If no number in the list is repeated, then there is no mode for the list.

*Range*—the difference between the largest and the smallest observation. This can be used to describe any type of data, and is most useful when one is interested in knowing the most extreme (i.e., highest and lowest or largest and smallest) observations in a sample. The prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a data set.

*Interquartile Range*—the difference between the 25th and 75th percentiles; in other words the central 50% of all observations. This is most often used in non-normally distributed data, but can be used to describe the middle 50% of any sample. If the interquartile range is large, it means that the middle 50% of observations are spaced wide apart. The important advantage of interquartile range is that it can be used as a measure of variability if the extreme values are not being recorded exactly. The main disadvantage in using interquartile range as a measure of dispersion is that it is not amenable to mathematical manipulation.

*Standard Deviation (SD)*—it is the square root of sum of squared deviation from the mean divided by the number of observations. This is used in normally distributed data to describe the spread of the observations about the mean. In other words, it tells you how close the data are to the average value in the sample. The advantage of SD is that along with mean it can be used to detect skewness. The disadvantage of SD is that even though it can detect skewness, it is not an ideal measure of dispersion for skewed data.

Modes, medians, and interquartile ranges are used to describe non-normally distributed data.

- 3.

To study the mortality and morbidity in patients with antiphospholipid syndrome, 1000 patients were studied starting in 1999. Assessments of disease activity, complications, hospitalizations as well as deaths were assessed in these patients systematically over time [4].

What type of study design is described above?

- A.

Case control study

- B.

Retrospective cohort study

- C.

Prospective cohort study

- D.

Randomized controlled study

Correct answer: C

There are several types of study designs as summarized in Fig. 2.3.

- A.

Our first distinction is whether the study is experimental (i.e., there is an active intervention as a part of the study design) or is observational [5].

*experimental studies*, the researcher manipulates the exposure, that is he or she allocates subjects to the intervention or exposure group.

*Randomized controlled trials*—subjects are randomly divided into groups. One group receives the intervention (patients and researchers maybe blinded to treatment) and followed forward in time. At the end of the study, the frequency of outcome is compared. This study design reduces the effect of unmeasured (confounding) variables that may influence outcomes of a study.

*Non-randomized controlled trials*—are similar to randomized controlled trials described above except that it specifically lacks the element of random assignment to treatment or control.

An

*observational study*draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

*Cohort studies*—subjects are divided into groups based on the presence or absence of an exposure over a period. The frequency of the outcome is compared. When these patients are followed over time, it is known as a prospective cohort study, whereas a retrospective cohort study (also called a historic cohort) is when a group of individuals who have shared a common exposure are compared to individuals who are not and their influence on the incidence of a condition such as disease or death.

*Case control studies*—subjects are divided into groups based on the presence or absence of the outcome of interest, and then the frequency of risk factors in each group is compared.

*Cross section studies*—in this type of study, the presence of the presumed risk factor and the presence of the outcome are measured at the same time in a population.

*Case report/case series*—the studies have a detailed report of signs, symptoms, diagnoses, treatment, and follow-ups; often individual or group of individuals.

- 4.

A rheumatologist designed a screening questionnaire for diagnoses of fibromyalgia. A blind comparison was made with clinician diagnosis of fibromyalgia using chart review for 200 patients. Among the 50 found to be having fibromyalgia according to the standard (clinician diagnosis), 35 were positive according to the screening questionnaire. Among the 150 patients who did not have fibromyalgia as per clinician diagnosis, 30 patients were found to have positive screening questionnaire.

Which of the following statements is*false*?

- A.

Specificity was 80%.

- B.

Positive predictive value was 70%.

- C.

Negative predicted value was 88%.

- D.

The prevalence was 25%.

Correct answer: C

Using the information, Table 2.1 is generated, and diagnostic tests are evaluated as described below.

- A.

2 × 2 table off the data presented above

Total = 20 |
Fibromyalgia present |
Fibromyalgia absent |
---|---|---|

Tests positive |
35 |
30 |

Tests negative |
15 |
120 |