The use of high-throughput omics may help to understand the contribution of genetic variants to the pathogenesis of rheumatic diseases. We discuss the concept of missing heritability: that genetic variants do not explain the heritability of rheumatoid arthritis and related rheumatologic conditions. In addition to an overview of how integrative data analysis can lead to novel insights into mechanisms of rheumatic diseases, we describe statistical approaches to prioritizing genetic variants for future functional analyses. We illustrate how analyses of large datasets provide hope for improved approaches to the diagnosis, treatment, and prevention of rheumatic diseases.
Key points
- •
Large genetic studies of rheumatic diseases have implicated many risk loci.
- •
Within risk loci, the identity and function of the pathogenic variants that underlie rheumatic diseases remain largely unknown, but methods in development will address these gaps in knowledge.
- •
Integrative analysis of omics datasets will yield new insights into the molecules, cells, tissues, and pathways that initiate and perpetuate rheumatic diseases.
- •
Functional characterization of prioritized genetic variants will pave the way for better diagnosis, treatment, and prevention of rheumatic diseases.
Introduction
The study of rheumatic diseases draws on many genome-scale technologies. Box 1 defines relevant terms that will be used in this discussion. Genome-wide association studies (GWAS) and other genetic studies have identified and replicated numerous loci associated with rheumatic diseases. Although these findings have led to increased awareness of particular pathogenetic pathways, there are multiple impediments to the translation of these results to the clinic. First, and as expected, the variants identified thus far do not account for the entirety of the heritable basis of any given rheumatic disease. Second, genetic variants in close physical proximity tend to be inherited together (linkage disequilibrium, or LD, see Box 1 ). As a result, a rheumatic disease risk locus usually contains multiple associated variants, from which the actual pathogenic variants are difficult to separate. This is most pronounced in the major histocompatibility complex region, where there are hundreds of associated variants, many of which are in strong LD. However, new techniques that leverage transethnic and annotation data will help narrow the search for single-nucleotide polymorphisms (SNPs) that are directly pathogenic. Finally, determining the mechanisms of action of pathogenic variants is challenging, due to interaction effects, cell type–specific gene expression, the local tissue milieu, the temporal course of gene expression, and complicating environmental factors.
- •
5′ untranslated region (5′-UTR): The region directly upstream of the initiation codon and translation start site. In mRNA, the sequence of this region strongly influences translation, and likewise the corresponding regions of template DNA contain many elements that can produce a marked effect on transcription.
- •
ATAC-Seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing): This technique is used to study chromatin accessibility (accessible or protected), which is related to transcription factor binding and gene expression.
- •
Copy number variation (CNV): A form of genetic variation resulting in a change in the number of copies of a gene or genomic element. Deletion and insertion of DNA by a variety of mechanisms can produce genetic variants affecting as little as a few kilobases (kb) or as much as an entire chromosome. CNVs have been difficult to assay using common technologies, affect a substantial portion of the genome, and influence a variety of diseases, including rheumatic diseases.
- •
CpG site: A DNA sequence consisting of a 5′ guanine nucleotide joined to a cytosine residue by a phosphate group. Cytosines in CpG sites can be methylated to form 5-methylcytosine, which can change its expression.
- •
DNA methylation: Modification of DNA by attachment of a methyl group to DNA nucleotides. One common site of methylation is CpG sites (see previous definition).
- •
Epigenetics: The study of genetic effects produced by mechanisms that do not alter the primary sequence of DNA. For instance, methylation of DNA (see previous definition) or of histones producing differences in gene regulation are examples of epigenetic effects. Epigenetic modifications may result in changes to gene expression and regulation.
- •
Extrinsic filtering: Data filtering based on information outside of the dataset, such as the inclusion of genomic annotations from the National Institutes of Health Roadmap Epigenomics Mapping Consortium (ROADMAP) or the Encyclopedia of DNA elements (ENCODE).
- •
Genome-wide association study (GWAS): Examination of a genome-wide set of genetic variants (typically single-nucleotide polymorphisms [SNPs]) to uncover associations between genotypic variation and a phenotype or trait. Similarly, epigenetic variation such as DNA methylation can be investigated in epigenome-wide association studies.
- •
Haplotype: A set of SNPs on the same DNA strand that are inherited together due to linkage to one another (see definition later in this glossary).
- •
Heritability: The proportion of phenotypic variation that can be accounted for based on genotypic variation.
- •
Imputation: Statistical inference of unobserved data, such as predicting the most likely allele of a particular SNP due to known linkage disequilibrium (LD)/haplotype structure. Imputation methods are most well established for genotyping data.
- •
Intrinsic filtering: Filtering of data based on information calculated from the dataset itself, such as filtering genetic variants based on linkage to another variant strongly associated in that dataset.
- •
Linkage disequilibrium (LD): The nonrandom association of 2 or more genetic variants. Genetic recombination during meiosis allows for independent assortment of alleles and genetic variants. Genomic proximity, as well as forces like selection, population structure, and genetic drift, can maintain the association of 2 variants over a considerable period.
- •
Long noncoding RNAs (lncRNA): lncRNAs are molecules of RNA greater than 200 nucleotides in length that do not code for protein products. These RNAs interact with several levels of gene-specific transcription, splicing, translation, posttranslational modification, and gene regulation. RNA-Seq studies have mostly targeted a genomic locus and having high depth can identify associated these lncRNAs for further analysis.
- •
Mendelian randomization: An epidemiologic method in which genetic variation in genes of known function is used to examine whether a modifiable exposure has a causal effect relationship to disease in nonexperimental studies. This method can be used to test for causal effects among 2 phenotypes (often an intermediate phenotype and a disease outcome) without conducting a randomized controlled trial.
- •
Metaorganism: A community of organisms including the host and others that is indicated by the metagenome. The metagenome comprises all the genetic material associated with a human being, including, for example, host DNA, microbial DNA, and the virome.
- •
Multiple enhancer variant hypothesis: The hypothesis based on the observation that multiple variants in linkage may act cooperatively to regulate the expression of a target gene, and in diseases such as RA, SLE, and multiple sclerosis.
- •
Nonadditive genetic effects: Effects for which the contribution of alleles influencing a trait are not independent of one another, or not independent of the environment.
- •
Metabolomics: The study of metabolites (small molecules left behind as part of specific cellular processes) within cells, fluids, or tissues or organisms. Collectively, these small molecules are referred to as the metabolome.
- •
Pathogenic variant: A variant that contributes to the pathogenesis of a specified disease state. Such variants also may contribute to or protect against other phenotypes. Pathogenic variants need be neither necessary nor sufficient to produce a disease state due to incomplete penetrance.
- •
Phased haplotype: With short read sequencing, it is uncertain whether variants are inherited from the maternal or paternal copy of a given chromosome. Algorithms have been devised to deduce phased haplotypes, or the most likely assignment of variants in a region to one or the other parental copy of a chromosome, enabling inference of haplotypes.
- •
Phenome: The phenome refers to the set of all phenotypic states for a given biological unit of interest, such as an organism or population.
- •
Polygenic traits: Traits influenced by genetic variation in several or many genes or genetic loci. Recent studies of rheumatic diseases suggest that thousands of genetic variants of small effect may modify disease risk.
- •
Proteomics: Analysis of the full complement of proteins produced by a given biological entity of interest, such as a cell, tissue, or organism, including those modified through splicing or posttranslational modification.
- •
Quantitative trait locus: A genetic variant that is associated with a quantitative difference in the measurement of a phenotype or trait. For instance, an expression quantitative trait locus is a genetic variant correlated with expression level of either local genes (<5 Mb; a cis -eQTL) or faraway genes (>5 Mb; a trans -eQTL). The presence of an SNP that correlates with the methylation state of 1 or more genomic elements, such as nearby CpG sites, is referred to as a methylQTL or meQTL.
- •
RNA-Seq: A next-generation sequencing technology that allows quantitative profiling of the transcriptome (eg, identifying the presence and amount of messenger RNA in a sample of cells, tissues).
- •
Single-nucleotide polymorphism (SNP): A DNA sequence variation affecting only 1 nucleotide, typically present in at least 1% of a given population. For instance, in the hypothetical sequence AGT(C)TA, the substitution of cytosine by thymine resulting in a sequence of AGT(T)TA would define an SNP.
- •
Structural variation (SV): Large-scale DNA sequence variants. Copy number variants (see previous definition) producing deletion or duplication of a genomic segment are structural variants, as are genomic rearrangements not resulting in a gain or loss of genetic material, such as an inversion or translocation.
- •
Transcriptomics: Study of the set of all RNA transcripts produced by the genome, usually studied in particular tissues or organs (eg, blood) or cell types (eg, CD4+ T lymphocytes).