Abstract
Genetic and genomic investigations are a starting point for the study of human disease, seeking to discover causative variants relevant to disease pathophysiology. Over the past 5 years, massively parallel, high-throughput, next-generation sequencing techniques have revolutionized genetics and genomics, identifying the causes of many Mendelian diseases. The application of whole-genome sequencing and whole-exome sequencing to large populations has produced several publicly available sequence datasets that have revealed the scope of human genetic variation and have contributed to important methodological advances in the study of both common and rare genetic variants in genetically complex diseases. The importance of noncoding genetic variation has been highlighted by the Encyclopedia of DNA Elements (ENCODE) project and National Institutes of Health (NIH) Roadmap Epigenomics Program and integrated analyses of these datasets, together with disease-specific datasets, will provide an important and powerful tool for determining the mechanisms through which disease-associated, noncoding variation influences disease risk.
In pediatric rheumatology, as in most of medicine, the causes of nearly every disease that we treat are either unknown or incompletely understood. Therefore, studies seeking to determine the pathogenic mechanisms of pediatric rheumatic diseases are of critical importance if we hope to identify and test new therapeutic agents. Genetic and genomic investigations are an integral component of the biomedical research endeavor, identifying variants that cause or influence the risk of developing a disease. In a field that is driven by technological advances, the emergence of massively parallel, high-throughput sequencing techniques, also known as next-generation sequencing (NGS) , has truly revolutionized clinical genomics with the largest immediate impact on monogenic diseases , which are caused by mutations of a single gene . The wide adoption of NGS has produced large, publicly available sequence datasets that have facilitated expanded use of single nucleotide polymorphism (SNP) imputation and ultimately improved the sensitivity of genome-wide association studies (GWAS) in the investigation of genetically complex diseases . Finally, efforts like the Encyclopedia of DNA Elements (ENCODE) project and National Institutes of Health (NIH) Roadmap Epigenomics Program have brought to the forefront the importance of noncoding variation. Moreover, the incorporation of these datasets into public repositories and genome browsers is now enabling the integrated interrogation of disease-specific genomic data and ENCODE data to determine the mechanisms through which disease-associated, noncoding variation influences disease risk. This is particularly important, considering that over 90% of GWAS-identified disease- and trait-associated variation falls within noncoding regions of the genome .
In this chapter, we will seek to demonstrate how advances in genetic and genomic technology may be applied to the investigation of pediatric rheumatic diseases. Given the paucity of examples in which these techniques have been applied to pediatric rheumatic diseases to date, we will discuss these innovative approaches in the context of a range of rheumatic and immunologic diseases.
Investigating Mendelian and monogenic pediatric rheumatic diseases
Methodologies
Prior to the advent of NGS technologies, investigations of diseases of Mendelian inheritance typically employed linkage analyses of large families with multiple affected individuals to map the phenotypes of interest to specific genomic regions . By typing family members for a panel of markers evenly spaced across the entire genome, one may identify the linked genomic regions where marker haplotypes segregate with disease status. From the linkage interval(s), candidate genes can be selected and sequenced by Sanger’s method until the disease-causing mutation is identified. In Sanger sequencing, the genomic regions of interest are amplified by the polymerase chain reaction (PCR) and each strand of the double-stranded PCR product is further subjected to dye-terminator sequencing reactions. After separation and detection by a capillary sequencer, a chromatogram demonstrating the nucleotide sequence for each DNA fragment is examined. The evaluation of linkage intervals by conventional sequencing can be an expensive and time-consuming process, particularly when linkage regions are large or contain many large, complex genes.
Compared to sequencing by Sanger’s method, NGS approaches have many important improvements and benefits. NGS approaches can simultaneously interrogate tens to hundreds of billions of base pair positions in a single reaction, generating datasets whose breadth and depth of sequence coverage dwarf that of conventional sequencing methods. Additionally, NGS approaches are often less expensive and less time consuming than conventional sequencing. For all of these reasons, NGS has quickly replaced traditional sequencing in many applications. There are a variety of NGS platforms and techniques from which investigators may choose, but they each share a general work flow that includes the preparation of a library of DNA templates, the sequencing and imaging of the template library, and the post-sequencing data analysis. More specific information about NGS platforms and their methodologies may be found in these excellent reviews of the topic .
DNA libraries for NGS may be generated through several approaches that are specific to the experimental design of the study. In whole-genome sequencing (WGS), one utilizes a library of DNA templates created from whole genomic DNA. By contrast, for targeted deep resequencing one extracts from the whole DNA only the regions of interest, producing a sequencing library from only those targeted regions, a process called target enrichment. Target enrichment may be accomplished through high-fidelity PCR amplification of the genomic regions of interest, in which case the sequencing libraries are created from the pooled PCR products. Alternatively, one may use hybrid capture-based methods for target enrichment, wherein a cocktail of custom-designed, synthetic oligonucleotide probes with sequences complementary to the genomic regions of interest is used to “capture” those desired regions. After additional processing, only the genomic regions of interest remain as the input material for the sequencing library. The most widely adopted target enrichment strategy is the whole-exome sequencing (WES) in which one seeks to capture and sequence the entirety of the coding regions of the genome, the so-called “exome,” which accounts for only about 1% of the entire genome . Due to the substantially lower computational effort and financial cost of performing WES, compared to WGS, combined with the expectation that a majority of Mendelian diseases are caused by protein-coding variants, WES is an attractive investigative strategy.
WES studies produce a very large set of sequence reads of various lengths and configurations, depending on the experimental design. These sequence reads are assembled and aligned to a known reference sequence and variant alleles are identified and cataloged. The resulting list of variants may then be filtered on the basis of various characteristics, including metrics of variant quality (depth of coverage and quality scores), indicators of potential importance of a variant (allele frequency, degree of evolutionary conservation, predicted effect on protein structure), and proper segregation within the experimental samples. The approaches and tools for the analysis and filtering of NGS datasets are rapidly evolving .
The first NGS discovery
The first instance of an NGS study identifying a novel cause of a human disease was in 2010, when WES (and later WGS) was independently applied to the investigation of postaxial acrofacial dysostosis (PAD), a rare, autosomal recessive malformation syndrome marked by features that include cleft lip and palate, posterior limb abnormalities, and ocular anomalies . In their first study, the authors performed WES of four PAD patients from three families, generating a list of all variants within the protein-coding region of these four individuals . By filtering the variant lists to include only genes harboring two novel, protein-altering mutations in all four individuals, the authors reduced their list to a single candidate gene, DHODH , encoding dihydroorotate dehydrogenase, which coincidentally is familiar to pediatric rheumatologists as the target of the drug leflunomide. Using conventional sequencing of DHODH , the authors identified compound heterozygous mutations in four additional affected individuals. The second study of PAD, which used WGS to examine a family quartet that included two affected children and two unaffected parents, also identified mutations of DHODH as its cause . It was immediately predicted that NGS would accelerate the discovery of genetic causes of many rare Mendelian diseases , and within its first year of adoption, WES was responsible for the identification of the genetic causes of 12 Mendelian diseases .
Next-generation sequencing discoveries and the pediatric rheumatologist
Deficiency of adenosine deaminase 2
Of the many NGS-driven Mendelian discoveries that have been reported since that time, a subset has identified causes of immune-mediated diseases that are directly relevant to pediatric rheumatologists. Among these are autoinflammatory diseases, which are generally characterized by recurrent episodes of inflammation that occur in the absence of either an identifiable trigger or features typical of autoimmunity, as well as disorders of immune dysregulation with predisposition to autoimmunity, immune deficiency, and atopy. Perhaps the most striking of the newly identified, immune-mediated diseases is the deficiency of adenosine deaminase 2 (DADA2), an autosomal recessive, autoinflammatory syndrome that is characterized by the presence of systemic inflammation with intermittent fevers, the early onset of recurrent lacunar strokes, some with hemorrhagic transformation, and vasculopathy . WES analysis of three unrelated, affected individuals and their unaffected parents identified compound heterozygous mutations of CECR1 , which encodes adenosine deaminase 2 (ADA2), in each subject. Importantly, it was the combined analysis of multiple affected individuals that produced this discovery. The separate sequence analyses in patients 1 and 2 produced lists of 17 and 19 candidate genes, respectively, whereas the combined analysis of these two individuals identified CECR1 as the only candidate gene. Compound heterozygous mutations of CECR1 were identified subsequently in three additional affected individuals using conventional sequencing methods, further helping to make the case that variants of this gene are really causative. Further, all affected subjects had markedly reduced levels of ADA2 and its enzymatic activity in peripheral blood, as compared to healthy individuals. The authors demonstrated that upon morpholino-based silencing of a CECR1 paralog in zebrafish, the fish embryos developed intracranial hemorrhage and neutropenia, and moreover this phenomenon was reversible with the co-introduction of wild-type, human CECR1 but not with the mutated forms of CECR1 . These experiments demonstrate the general value of animal models in establishing a functional link between a gene of interest and a disease. Specifically, this study demonstrates how the unique features of zebrafish, including their very rapid development, their translucent embryos, and the ease with which they may be genetically manipulated, may make them an appealing alternative to other model organisms for the rapid investigation of genetic discoveries. This study went on to identify homozygous mutations of CECR1 in three patients with polyarteritis nodosa (PAN) which, together with a concurrently published study that also identified recessively inherited mutations of CECR1 in six families with PAN, further expanded the scope of ADA2-mediated disease .
Chronic multifocal osteomyelitis with immune dysregulation
A very important benefit of NGS in the investigation of Mendelian disease is its utility, even in cases of single affected individuals and family units too small for linkage analysis. This ability was demonstrated in a study that examined a child with immune dysregulation characterized by vitiligo, autoimmune hemolytic anemia, B lymphopenia, and sterile chronic multifocal osteomyelitis which evolved into disseminated granulomatous disease . Using WES of the affected child and his two unaffected parents, investigators identified 12 genes that contained homozygous, non-synonymous mutations in the affected child that were not present in either parent. Among these was RAG1 , which encodes the recombinase-activating gene 1 and has been associated with immune dysregulation similar to that seen in the affected child. The presence of T and B lymphocytes in the affected child suggested that the missense mutation, R699W, produced reduced, but not absent, function, given that complete deficiency of RAG results in T- and B-cell deficiency.
Deficiency of HOIL-1 protein
Another example is the case of two siblings and a third unrelated child with a unique combination of autoinflammation and life-threatening immune deficiency, the cause of which was identified by a combination of SNP genotyping and WES. Early in life, all three affected children developed recurrent, episodic systemic inflammation as well as antibody deficiency, recurrent pyogenic infections, and the intracellular deposition of amylopectin-like material in the skeletal, smooth, and cardiac myocytes. Interestingly, while bone marrow transplantation brought resolution to the infectious and inflammatory manifestations of one patient, that patient continued to experience progression of the amylopectinosis, ultimately succumbing to complications of heart failure. All three subjects were found to have recessively inherited, loss-of-function mutations in RBCK1 , which encodes the HOIL-1 protein, a component of the linear ubiquitin chain assembly complex (LUBAC) . Through the addition of linear chains of ubiquitin, the LUBAC is critically involved in the targeting of proteins to specific cellular destinations, such as the targeting of nuclear factor kappa B (NF-κB) essential modulator to the interleukin (IL)-1 receptor (IL-1R), tumor necrosis factor receptor (TNFR), and toll-like receptor (TLR) signaling complexes . The authors went on to demonstrate that in fibroblasts and Epstein–Barr-virus-transformed B cells from these patients, the loss of HOIL-1 impaired the LUBAC assembly, leading to a disruption in NF-κB activation.
Immune dysregulation mediated by mutations in PLCG2
NGS was also involved in the discoveries of two diseases caused by mutations of PLCG2 , highlighting both the power and the limitations of NGS approaches. The cause of PLCG2-associated immune deficiency, antibody deficiency, and immune dysregulation (PLAID), an autosomal dominant syndrome characterized by cold urticaria and susceptibility to atopy, autoimmunity, and infection, was not identified by an NGS approach but was identified using SNP-based linkage analysis coupled with Sanger sequencing . Through this approach, three distinct exon-containing deletions of PLCG2 were identified as the cause of this syndrome. Importantly, WGS was undertaken in the first proband; however, it failed to identify the causative mutation because the software to detect deletions had not been tested and tuned to detect hemizygous deletions of this size. While the reasons for them may vary, such failures are not uncommon, and in fact a recent report from a large, academic medical center found that among the first 250 consecutive patients referred for clinical WES to evaluate possible genetic conditions, WES failed to identify the causative variant in 75% of cases . From a positive perspective, this reflects the successful identification of causal genetic variants in one out of four cases studied by WES, which is an extraordinary rate of success!
In contrast to PLAID, the cause of autoinflammatory PLAID (APLAID), which is characterized by antibody deficiency with recurrent infections, together with inflammatory manifestations of the skin, eyes, and gastrointestinal tract, was successfully identified using WES of a family trio that included an affected father and daughter and an unaffected mother . Using a variant-filtering strategy that identified only high-quality sequence variants with >10× coverage that were non-synonymous, novel, evolutionarily conserved, and predicted to be damaging, a list of eight candidate variants was generated. Conventional sequencing of these eight variants in the unaffected grandparents found that one of these mutations, the S707Y variant of PLCG2, was a de novo mutation in the affected father.
Detecting somatic mutations that cause monogenic diseases
Somatic mutations, or mutations arising de novo in somatic cells, may be an important and under-recognized cause of the disease. When a somatic mutation arises in an individual’s gametes, that mutation may be transmitted, de novo, to their offspring, in whom the mutation would be present in the germ line and could be further transmitted. By contrast, when a somatic mutation occurs in somatic cells other than gametes, it produces somatic mosaicism with two genetically distinct populations of cells. Because conventional sequencing cannot sensitively identify somatic mosaicism, there have been relatively few studies exploring the spectrum of nonmalignant diseases caused by somatic mutations. Given the enormous depth of coverage that NGS approaches can generate at each base position, it may now be possible to identify disease-causing somatic mutations, even if mutant cells represent a very low percentage of the larger cellular population. Although WES has been successfully used to identify causative somatic mutations in a number of diseases, including the cryopyrinopathies , the error rate of the whole genome amplification step of the most widely adopted sequencing platform is relatively high at 1% , which may confound the search for somatic mutations occurring at frequencies <10%. Studies of cancer genomics have demonstrated that this problem may be partially overcome by evaluating paired affected and unaffected tissues and this approach has been widely applied, identifying an enormous range of cancer-associated somatic mutations . By applying a similar strategy to the investigation of Proteus syndrome , a disease marked by segmental overgrowth and tissue hyperplasia that is thought to be the “Elephant Man” disease, investigators identified the cause to be mosaic, activating mutations of AKT1 . By performing WES of genomic DNA from seven affected and four unaffected tissue biopsy specimens acquired from six Proteus syndrome patients, together with genomic DNA from six of their unaffected first-degree relatives, investigators applied a filtering strategy designed to identify novel, protein-altering variants present in the affected, but not the unaffected, tissues. After identifying the first mutation of AKT1 in a single patient, manual inspection and follow-up studies ultimately revealed that 26 of 29 Proteus syndrome patients had AKT1 mutations.
In addition to employing study designs that examine paired affected and unaffected tissues, there are several other methods that promise to improve the fidelity of NGS studies. By allowing for the identification and exclusion of variants created by PCR or sequencing errors from subsequent analyses, these methods greatly increase the sensitivity of NGS to detect ultra-rare variants. In duplex sequencing, this is accomplished by incorporating oligonucleotide tags onto the 5′ ends of each strand of duplex DNA templates . Upon analysis of sequencing reads, if a variant is detected on tagged sequence reads from both strands, then one can conclude that the variant was present in the original sample. In contrast, mutations created by PCR or sequencing errors are introduced on a single DNA strand and therefore variants identified on reads from only one DNA strand are excluded. In circle sequencing, DNA templates are circularized, making it possible to use a rolling circle polymerase to generate multiple copies of the circularized template DNA in tandem. As a result, genetic variants identified in each of the multiple tandem copies may be confidently interpreted as true positive variants, while variants not detected in tandem may be attributed to experimentally introduced errors .
Next-generation sequencing discoveries and the pediatric rheumatologist
Deficiency of adenosine deaminase 2
Of the many NGS-driven Mendelian discoveries that have been reported since that time, a subset has identified causes of immune-mediated diseases that are directly relevant to pediatric rheumatologists. Among these are autoinflammatory diseases, which are generally characterized by recurrent episodes of inflammation that occur in the absence of either an identifiable trigger or features typical of autoimmunity, as well as disorders of immune dysregulation with predisposition to autoimmunity, immune deficiency, and atopy. Perhaps the most striking of the newly identified, immune-mediated diseases is the deficiency of adenosine deaminase 2 (DADA2), an autosomal recessive, autoinflammatory syndrome that is characterized by the presence of systemic inflammation with intermittent fevers, the early onset of recurrent lacunar strokes, some with hemorrhagic transformation, and vasculopathy . WES analysis of three unrelated, affected individuals and their unaffected parents identified compound heterozygous mutations of CECR1 , which encodes adenosine deaminase 2 (ADA2), in each subject. Importantly, it was the combined analysis of multiple affected individuals that produced this discovery. The separate sequence analyses in patients 1 and 2 produced lists of 17 and 19 candidate genes, respectively, whereas the combined analysis of these two individuals identified CECR1 as the only candidate gene. Compound heterozygous mutations of CECR1 were identified subsequently in three additional affected individuals using conventional sequencing methods, further helping to make the case that variants of this gene are really causative. Further, all affected subjects had markedly reduced levels of ADA2 and its enzymatic activity in peripheral blood, as compared to healthy individuals. The authors demonstrated that upon morpholino-based silencing of a CECR1 paralog in zebrafish, the fish embryos developed intracranial hemorrhage and neutropenia, and moreover this phenomenon was reversible with the co-introduction of wild-type, human CECR1 but not with the mutated forms of CECR1 . These experiments demonstrate the general value of animal models in establishing a functional link between a gene of interest and a disease. Specifically, this study demonstrates how the unique features of zebrafish, including their very rapid development, their translucent embryos, and the ease with which they may be genetically manipulated, may make them an appealing alternative to other model organisms for the rapid investigation of genetic discoveries. This study went on to identify homozygous mutations of CECR1 in three patients with polyarteritis nodosa (PAN) which, together with a concurrently published study that also identified recessively inherited mutations of CECR1 in six families with PAN, further expanded the scope of ADA2-mediated disease .
Chronic multifocal osteomyelitis with immune dysregulation
A very important benefit of NGS in the investigation of Mendelian disease is its utility, even in cases of single affected individuals and family units too small for linkage analysis. This ability was demonstrated in a study that examined a child with immune dysregulation characterized by vitiligo, autoimmune hemolytic anemia, B lymphopenia, and sterile chronic multifocal osteomyelitis which evolved into disseminated granulomatous disease . Using WES of the affected child and his two unaffected parents, investigators identified 12 genes that contained homozygous, non-synonymous mutations in the affected child that were not present in either parent. Among these was RAG1 , which encodes the recombinase-activating gene 1 and has been associated with immune dysregulation similar to that seen in the affected child. The presence of T and B lymphocytes in the affected child suggested that the missense mutation, R699W, produced reduced, but not absent, function, given that complete deficiency of RAG results in T- and B-cell deficiency.
Deficiency of HOIL-1 protein
Another example is the case of two siblings and a third unrelated child with a unique combination of autoinflammation and life-threatening immune deficiency, the cause of which was identified by a combination of SNP genotyping and WES. Early in life, all three affected children developed recurrent, episodic systemic inflammation as well as antibody deficiency, recurrent pyogenic infections, and the intracellular deposition of amylopectin-like material in the skeletal, smooth, and cardiac myocytes. Interestingly, while bone marrow transplantation brought resolution to the infectious and inflammatory manifestations of one patient, that patient continued to experience progression of the amylopectinosis, ultimately succumbing to complications of heart failure. All three subjects were found to have recessively inherited, loss-of-function mutations in RBCK1 , which encodes the HOIL-1 protein, a component of the linear ubiquitin chain assembly complex (LUBAC) . Through the addition of linear chains of ubiquitin, the LUBAC is critically involved in the targeting of proteins to specific cellular destinations, such as the targeting of nuclear factor kappa B (NF-κB) essential modulator to the interleukin (IL)-1 receptor (IL-1R), tumor necrosis factor receptor (TNFR), and toll-like receptor (TLR) signaling complexes . The authors went on to demonstrate that in fibroblasts and Epstein–Barr-virus-transformed B cells from these patients, the loss of HOIL-1 impaired the LUBAC assembly, leading to a disruption in NF-κB activation.
Immune dysregulation mediated by mutations in PLCG2
NGS was also involved in the discoveries of two diseases caused by mutations of PLCG2 , highlighting both the power and the limitations of NGS approaches. The cause of PLCG2-associated immune deficiency, antibody deficiency, and immune dysregulation (PLAID), an autosomal dominant syndrome characterized by cold urticaria and susceptibility to atopy, autoimmunity, and infection, was not identified by an NGS approach but was identified using SNP-based linkage analysis coupled with Sanger sequencing . Through this approach, three distinct exon-containing deletions of PLCG2 were identified as the cause of this syndrome. Importantly, WGS was undertaken in the first proband; however, it failed to identify the causative mutation because the software to detect deletions had not been tested and tuned to detect hemizygous deletions of this size. While the reasons for them may vary, such failures are not uncommon, and in fact a recent report from a large, academic medical center found that among the first 250 consecutive patients referred for clinical WES to evaluate possible genetic conditions, WES failed to identify the causative variant in 75% of cases . From a positive perspective, this reflects the successful identification of causal genetic variants in one out of four cases studied by WES, which is an extraordinary rate of success!
In contrast to PLAID, the cause of autoinflammatory PLAID (APLAID), which is characterized by antibody deficiency with recurrent infections, together with inflammatory manifestations of the skin, eyes, and gastrointestinal tract, was successfully identified using WES of a family trio that included an affected father and daughter and an unaffected mother . Using a variant-filtering strategy that identified only high-quality sequence variants with >10× coverage that were non-synonymous, novel, evolutionarily conserved, and predicted to be damaging, a list of eight candidate variants was generated. Conventional sequencing of these eight variants in the unaffected grandparents found that one of these mutations, the S707Y variant of PLCG2, was a de novo mutation in the affected father.
Detecting somatic mutations that cause monogenic diseases
Somatic mutations, or mutations arising de novo in somatic cells, may be an important and under-recognized cause of the disease. When a somatic mutation arises in an individual’s gametes, that mutation may be transmitted, de novo, to their offspring, in whom the mutation would be present in the germ line and could be further transmitted. By contrast, when a somatic mutation occurs in somatic cells other than gametes, it produces somatic mosaicism with two genetically distinct populations of cells. Because conventional sequencing cannot sensitively identify somatic mosaicism, there have been relatively few studies exploring the spectrum of nonmalignant diseases caused by somatic mutations. Given the enormous depth of coverage that NGS approaches can generate at each base position, it may now be possible to identify disease-causing somatic mutations, even if mutant cells represent a very low percentage of the larger cellular population. Although WES has been successfully used to identify causative somatic mutations in a number of diseases, including the cryopyrinopathies , the error rate of the whole genome amplification step of the most widely adopted sequencing platform is relatively high at 1% , which may confound the search for somatic mutations occurring at frequencies <10%. Studies of cancer genomics have demonstrated that this problem may be partially overcome by evaluating paired affected and unaffected tissues and this approach has been widely applied, identifying an enormous range of cancer-associated somatic mutations . By applying a similar strategy to the investigation of Proteus syndrome , a disease marked by segmental overgrowth and tissue hyperplasia that is thought to be the “Elephant Man” disease, investigators identified the cause to be mosaic, activating mutations of AKT1 . By performing WES of genomic DNA from seven affected and four unaffected tissue biopsy specimens acquired from six Proteus syndrome patients, together with genomic DNA from six of their unaffected first-degree relatives, investigators applied a filtering strategy designed to identify novel, protein-altering variants present in the affected, but not the unaffected, tissues. After identifying the first mutation of AKT1 in a single patient, manual inspection and follow-up studies ultimately revealed that 26 of 29 Proteus syndrome patients had AKT1 mutations.
In addition to employing study designs that examine paired affected and unaffected tissues, there are several other methods that promise to improve the fidelity of NGS studies. By allowing for the identification and exclusion of variants created by PCR or sequencing errors from subsequent analyses, these methods greatly increase the sensitivity of NGS to detect ultra-rare variants. In duplex sequencing, this is accomplished by incorporating oligonucleotide tags onto the 5′ ends of each strand of duplex DNA templates . Upon analysis of sequencing reads, if a variant is detected on tagged sequence reads from both strands, then one can conclude that the variant was present in the original sample. In contrast, mutations created by PCR or sequencing errors are introduced on a single DNA strand and therefore variants identified on reads from only one DNA strand are excluded. In circle sequencing, DNA templates are circularized, making it possible to use a rolling circle polymerase to generate multiple copies of the circularized template DNA in tandem. As a result, genetic variants identified in each of the multiple tandem copies may be confidently interpreted as true positive variants, while variants not detected in tandem may be attributed to experimentally introduced errors .
Investigating pediatric rheumatic diseases with genetically complex inheritance
In pediatric rheumatology, many of our most commonly encountered diseases, including the various forms of juvenile idiopathic arthritis (JIA), juvenile systemic lupus erythematosus (JSLE), the juvenile inflammatory myositis, and the chronic regional pain syndromes (i.e., juvenile fibromyalgia), are genetically complex diseases. Instead of resulting from mutations of a single gene, as is the case in diseases of Mendelian inheritance, genetically complex diseases result from the combined effects of one or more genetic risk factors and one or more environmental risk factors. The risk of genetically complex diseases may be influenced by common genetic variants, which typically have relatively small effect sizes on disease risk. Rare genetic variants also contribute to the risk of genetically complex diseases, as has been demonstrated in type-1 diabetes (T1D) , rheumatoid arthritis (RA) , and Behçet’s disease (BD) among others. Rare variants often have higher penetrance and confer larger effects on an individual’s risk of disease than do common variants. However, because they are selected against, they remain rare and therefore make relatively small contributions to the disease risk in populations.
The gold-standard approach to investigate genetically complex diseases is the association study in which one compares the frequency of genetic variants between ancestrally similar groups of disease-affected individuals and healthy control subjects. In such a study, if the frequency of a genetic variant differs significantly between the two groups, then the variant is said to be disease associated and further studies are warranted to determine whether the disease-associated variant is the disease-causal variant or if its association with the disease simply reflects its strong linkage disequilibrium (LD) with the actual causal variant. Although some candidate gene studies were informed by intervals defined by linkage mapping, many candidate gene studies relied upon best guesses and luck.
With the emergence of commercially available methods to interrogate the entire genome for associations in a high-throughput manner, the GWAS quickly supplanted the candidate gene study and became the preferred method of identifying susceptibility loci in genetically complex diseases. A GWAS is a hypothesis-neutral, unbiased study design that examines common genetic markers spanning the entire genome to identify disease susceptibility loci . Using arrays of SNPs that are designed to query most of the common, independently inherited LD blocks, GWAS can identify LD blocks harboring common, disease-associated variants. GWAS do not often identify the functionally important risk variant(s) within the LD block, though, and moreover, rare disease-associated variants may also exist within GWAS-implicated genes. For both of these reasons, GWAS are often coupled with candidate gene studies to more intensively interrogate the GWAS-implicated susceptibility loci to identify a functionally relevant variant. A potential pitfall of these follow-up studies is the tendency to concentrate on the gene nearest to the variant, while ignoring the possibility of more long-range effects.
Methodological advances in the investigation of genetically complex diseases
There have been a number of important advances in the performance of GWAS and analysis of genome-wide datasets that warrant further discussion. First, the development and maturation of SNP imputation as a genetic statistical tool has made it a common element of a contemporary GWAS . SNP imputation is a statistical method that can, through repetitive comparisons of an experimental SNP dataset with a reference panel of dense SNP genotypes, accurately determine additional SNP genotypes in the experimental population, in silico. The maturation of SNP imputation was enabled both by the availability of large, sequencing-based reference populations, such as the 1000 Genomes Project and through methodological improvements to accommodate the substantially larger size of sequencing-based reference populations . Ultimately, multi-reference imputation was pioneered with improved performance of imputation across experimental populations . The potential power and value of using imputation in GWAS is demonstrated by two recent studies of a single BD case-control collection. In the initial GWAS of this population, which examined 311,459 SNPs, significant associations were identified between BD and IL10 , HLA-B , and IL23R-IL12RB2 . In the second study of the same case-control collection, SNP imputation was used to expand the dataset to include 779,465 SNPs, ultimately identifying independent associations between BD and CCR1 , STAT4 , KLRC4 , and ERAP1 that were not detected in the first GWAS .
In addition to expanding the density of SNP datasets, imputation-based approaches may also be used to infer classical major histocompatibility complex (MHC) alleles. Many GWAS, particularly those of autoimmune and immune-related diseases, have identified disease associations among the MHC class-I or class-II gene clusters. Recently, a technique has evolved that allows one to statistically determine the MHC class-I and class-II types, as well as the subordinate amino acid identities, from SNP genotypes derived from the MHC region of nearly any commercially available GWAS array . By applying this approach to RA, which has a strong association with human leukocyte antigen (HLA)-DRB1 alleles , the risk of RA was mapped to three amino acid positions of HLA-DRB1. This work further refined our understanding of the shared epitope of RA, while also identifying associations between RA and single residues of both HLA-B and HLA-DPB1, neither of which were previously known to influence RA risk . In addition to RA, this method was recently used to investigate the role of MHC class-I molecules in BD, where it identified a group of 7 amino acid residues of HLA-B and HLA-A that independently influence disease susceptibility . Moreover, the locations of the BD-associated residues implicate both peptide binding by MHC class-I molecules and the regulation of cytotoxic cells through two distinct families of cytotoxicity receptors in the pathogenesis of BD. Given that numerous pediatric rheumatic diseases have associations with specific MHC molecules, this technique may be especially informative in the investigation of these diseases.
Particularly relevant to pediatric rheumatology, genome-wide meta-analysis has been developed and adopted to bolster the power of GWAS in some of the most heavily studied diseases, such as RA , where the latest genome-wide meta-analysis includes approximately 10 million SNPs examined in 29,880 patients and 73,758 healthy controls of European and Asian ancestry . Building on the 59 known RA susceptibility loci identified by previous studies, Okada et al. identified 42 novel, significant RA susceptibility loci . Genome-wide meta-analysis may also be utilized in the investigation of rare, genetically complex diseases, such as systemic JIA, enthesitis-related JIA, and psoriatic JIA, where genome-wide meta-analysis may be required to even assemble a case collection of adequate size to perform the GWAS.
Low-frequency and rare genetic variants in genetically complex diseases
Low-frequency variants (with frequencies below 0.05) and rare genetic variants (with frequencies below 0.01) are another important source of risk in genetically complex diseases. In fact, because rare variants were not included in commercially available GWAS SNP arrays, it was not until the release of WGS and WES data from reference populations like the 1000 Genomes Project and the National Heart Lung and Blood Institute’s Exome Sequencing Project that the landscape of rare genetic variation was truly appreciated . Today, rare variants may be interrogated in three general ways: using NGS methods, including targeted deep resequencing, WES, and WGS; using special genotyping arrays designed to examine rare variants, such as the ImmunoChip , an SNP array that densely genotypes SNPs within a subset of immunologically important genes; and using SNP imputation with 1000 Genomes reference data to infer the identities of rare variants. Importantly, the statistical methods for evaluating rare variants differ from those used to analyze common variants. Single point analyses, like those used to examine common variants, are underpowered because the number of occurrences of rare alleles would require a prohibitively large study population. Instead, analyses that consider the distribution of rare variants in an entire gene, referred to as gene-based tests or burden tests, are widely utilized to test for disease associations with rare variants . These tests may be used to compare the distribution or “burden” of rare variants between two groups. For further information, see the topical review by Panoutsopoulou et al.
Targeted deep resequencing studies are the NGS version of the candidate gene study, allowing multiplex interrogation of all candidate genes or regions of interest. In one of the pioneering targeted resequencing studies, Nejentsev et al. examined 10 candidate loci in 480 type-1 diabetes (T1D) patients and 480 healthy control subjects, identifying rare variants of IFIH1 , encoding melanoma differentiation-associated protein 5, which was located in a region previously implicated by GWAS of T1D. This study demonstrated that resequencing could clarify the source of risk in genomic regions initially identified by GWAS . Using a similar approach, Kirino et al. examined 10 candidate genes implicated by GWAS and 11 genes of the innate immune system by deep resequencing in a collection of 2461 BD cases and 2458 healthy control subjects. Using three different tests of rare variant association, this study identified significant enrichment of rare variants of TLR4 , IL23R , and NOD2 in BD, as well as an association between BD and the common M694V variant of MEFV , implicating innate immune mechanisms in the pathogenesis of BD . Targeted deep resequencing studies of GWAS-implicated risk loci have also identified rare variant associations in many other diseases, including RA and inflammatory bowel disease .
Looking forward to genetically complex pediatric rheumatic diseases
As these examples demonstrated, the coupling of GWAS and targeted deep resequencing of known susceptibility loci is a strong and intuitive approach to harnessing the capabilities of NGS, while limiting the financial cost and computational time that may be incurred through WGS or WES of the GWAS and replication populations. It is also foreseeable that in the near future, the investigation of genetically complex diseases will transition away from SNP genotyping arrays, instead employing WGS or WES to create population-based datasets. This choice would produce complete datasets appropriate for the thorough investigation of the full range of genetic variation from a single experimental procedure.
With respect to pediatric rheumatic diseases, the approaches discussed above hold great promise for advancing our understanding of diseases like JIA and JSLE, with the obvious challenge of sample recruitment. Although GWAS of common variation generally require sample sizes of at least 1000 affected subjects to detect effects of 10–20% increase in disease risk, studies in diseases like RA have continued to identify novel susceptibility loci, albeit most with smaller contributions to risk than those previously identified, as the size of study populations have grown . Because most pediatric rheumatic diseases are rare diseases, assembling adequately sized patient collections to undertake association studies will always remain a challenge. However, this challenge may be met by the formation of local, regional, national, and international collaborations within the pediatric rheumatology community. The sharing and consolidation of patient sample collections through such networks can produce large, international investigations that will continue to provide new insights into the causes of pediatric rheumatic diseases .

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

