32 Molecular Imaging In Situ Hybridization
Molecular genetics describes the study of gene structure and function at its most basic level; it includes determination of the coding sequence itself, the transmission of this genetic information from deoxyribonucleic acid (DNA) to ribonucleic acid (RNA) to polypeptide, and the regulation of these processes. This introduction to molecular genetics provides a synopsis of these mechanisms. The chapter describes a number of techniques commonly used to study genes, including sequencing, molecular cloning, and hybridization. It aims to provide an introduction to the applications of these techniques in the study of genes and their functions. In doing so, it gives an overview of the paradigm shift in recent years from traditional Sanger sequencing to next-generation sequencing (NGS). Molecular cloning, a method used to amplify a defined DNA fragment to obtain multiple identical copies, commonly used to study gene function, will be discussed. Finally, examples of the many usages of nucleic acid hybridization such as fluorescent in situ hybridization, tissue in situ hybridization, and array comparative genomic hybridization are outlined.
32.1 The Genetic Code
A gene may be defined as a heritable unit occupying a specific locus within the genome; it contains the DNA sequence to direct the formation of a protein or RNA of functional importance. A gene is composed of regulatory sequences, such as the promoter and enhancer regions, that drive gene expression on apposite stimuli. Within the transcribed region of the gene, there are introns interspersed with exons. Introns are those regions that are removed by splicing following transcription in the formation of the mature RNA. An exon, in contrast, refers to any nucleotide sequence remaining in the mature RNA; in the case of messenger RNA (mRNA), it is this product that will act as the template for translation into proteins. However, not all of the exonic sequence will be translated into protein because there is typically a 5′ (upstream) untranslated region and a 3′ (downstream) untranslated region in the first and last exons.
The DNA polymer forms the foundation of the genetic code and is constructed by the covalent linking of nucleotides. A nucleotide consists of a five-carbon sugar, a phosphate group, and a nitrogenous base. There are four nitrogenous bases: two purines, adenine and guanine, and two pyrimidines, cysteine and thiamine. The sequence of these nucleotides encodes the genetic information.
The double helix DNA molecule exists as two strands held together by the hydrogen bonds that form between a complementary purine and pyrimidine: adenine pairs with thiamine and guanine pairs with cysteine (Fig. 32.1). The sense or coding strand, termed the Crick strand, provides the sequence of the RNA to be transcribed, whereas its complementary antisense or noncoding strand, termed the Watson strand, provides the template on which RNA is constructed.
The structure of RNA is very similar to that of DNA with a linear backbone of repeating phosphate and five-carbon sugar units, with each unit attached to a nitrogenous base. However, most RNA is single stranded. In addition, the ribose sugar residue in RNA possesses a hydroxyl group at the 2′ carbon compared to the hydrogen atom in DNA. RNA shares three out of four nucleotides with DNA; the thiamine is replaced with uracil. Mature mRNA exits the nucleus and moves to the ribosome where its genetic code is translated. The code is read as triplicates of nucleotides, termed codons, commencing from the initiation codon AUG, encoding a methionine. Translation will usually start at the first initiation codon in the sequence; however, the surrounding sequence must conform to the Kozak consensus sequence (in eukaryotes) in order to be correctly identified as the translational start site by the ribosome. There are 64 codons, each of which codes either one of the 20 standard amino acids or a sequence terminating stop codon. This process forms the polypeptide, which may then be required to undergo a series of posttranslational modifications in order to correctly perform its function.
32.2 Genetic Variation
Genetic variation within the human genome underlies a significant portion of the phenotypic variation among individuals. A genetic variant can be considered neutral if it does not cause any phenotypic consequence or functional if it imparts an altered phenotype. A functional genetic variant that causes or leads to an increased susceptibility to a disease is considered pathogenic. In single gene disorders, monogenic conditions, it is easier to identify overtly pathogenic variants, whereas in complex polygenic disorders the influence of genetic variation is more subtle.
Genetic variants may be grouped into different subclasses; these include single nucleotide polymorphisms (SNPs), deletions, insertions, translocations, and inversions. The effect of any of these variants depends to a significant extent on their location within the genome and the downstream effects on protein expression, structure, and function.
SNPs are the most common source of genetic variation within the human genome, with an SNP predicted to occur every 1 in 300 nucleotides. An SNP is a nucleotide position within the genome at which one of two or occasionally three bases may be found. These variants may be common or rare and may be functional or neutral. SNPs may influence gene function through effects on regulatory sequences, intronic sequences with effects on splicing, or exonic sequences. An SNP within the exonic protein-coding region of a gene will result in altered DNA sequence but may not affect the amino acid sequence of the protein; these variants are termed synonymous. These synonymous variants may still be pathogenic if they cause anomalous splice sites. SNPs that alter both DNA and amino acid sequence are termed nonsynonymous. Nonsynonymous variants may be missense when the amino acid is changed to another amino acid or nonsense, which occurs when a premature stop site is inserted. There are databases of all known SNPs within the human genome that are easily accessible. Large population studies termed genome-wide association studies commonly use microarrays with up to 3 million known SNPs to examine for association between these variations and human traits (e.g., height) or complex polygenic disease.
Insertions and deletions result in either the gain or loss of nucleotides within the genome. These sequence changes vary in size. Deletions or insertions of single nucleotides within coding regions will cause a frameshift in codon sequence most commonly resulting in a premature stop codon. Large deletions or insertions can result in the loss or duplication of whole genes or groups of genes.
32.3 Sequencing
Gene sequencing has undergone a paradigm shift in recent years. Fredrick Sanger first described his chain termination method for sequencing DNA in 1977 and since then this method has been virtually the sole technique used in genetic sequencing. It was Sanger sequencing that was used to sequence the entire human genome. However, in the last decade, NGS methods have been developed; these methods are also termed massively parallel sequencing, due to the millions of sequencing reactions that occur simultaneously. This advance allows much greater capacity and speed of sequencing while decreasing costs. The developments in sequencing technology have been complemented by advances in bioinformatics in order to analyze the large datasets obtained.
32.3.1 Sanger Sequencing
Sanger sequencing requires a single-strand DNA template upon which DNA polymerase is used to synthesize the complimentary strand in vitro. Using the Sanger sequencing technique, four reactions are set up simultaneously each containing the four deoxynucleotides (dATP, dCTP, dGTP, dTTP) required for DNA synthesis and one of four chain terminating dideoxynucleotides. Dideoxynucleotides are analogous to deoxynucleotides except they lack the hydroxyl group on the 3′ carbon required for polymerization and therefore when incorporated will terminate strand elongation. The stochastic incorporation of a dideoxynucleotide, rather than its corresponding deoxynucleotide, will terminate elongation. Identifying the positions of termination by size of the strand will identify the nucleotide present at sequential positions.
32.3.2 Next-Generation Sequencing
There are multiple different NGS platforms available, utilizing different technologies, and the field continues to advance rapidly. Despite this, there are commonalities between many of the NGS systems. The majority of the systems require library preparation by the random fragmentation of sample DNA to generate sequences of the appropriate length for the platform. These sequences are then end repaired and short sequences, termed adaptor sequences, are ligated to the ends of the strands. The library is then amplified on a fixed surface; this amplification step spatially separates library fragments to create clusters of amplified template from an individual DNA fragment, each on a separate bead or at a single locus on a glass slide. This spatial separation allows localization of individual fragment reads, whereas the amplification enables the output of nucleotide incorporation to be of sufficient magnitude to be detected.
The majority of NGS systems operate through sequencing by synthesis (i.e., it is recorded each time a nucleotide is incorporated into the newly synthesized chain). This is more efficient than traditional techniques, which separate the synthesis reaction and the sequence determination. Individual nucleotides are sequentially applied to the platform, and, where incorporated into an elongating polymer, this is detected and recorded at each discrete locus. This produces reads (strings of bases) that can be reassembled on a reference sequence to determine origin. In addition, multiplexing allows a number of DNA libraries to be sequenced simultaneously through addition of a distinct “barcode” sequence to each library. This barcode sequence will be appended to each individual sequence read, so they can be accurately ascribed to the different libraries. Several of the NGS technologies currently in use are summarized in the following; this is not exhaustive, and the field remains dynamic.
Sequencing by Synthesis
Pyrosequencing relies on the emission of a pyrophosphate group from the elongating chain during nucleotide incorporation. This pyrophosphate is used for the creation of adenosine triphosphate (ATP) by the enzyme ATP sulphurylase in the presence of adenosine 5-phosphosulphate; the ATP then interacts with the enzyme luciferase to emit light.
Ion semiconductor sequencing is similar to pyrosequencing except it utilizes the release of a hydrogen ion during DNA polymer elongation. The platform uses a semiconductor chip to detect alterations in pH upon hydrogen ion release.
Sequencing by reversible terminator chemistry is utilized by the IIlumina HiSeq (San Diego, CA); this is one of the most commonly used platforms due to its speed and efficiency. Unique flowcell technology utilizes a bridge amplification technique to form clusters of the individual library fragments. The sequencing by synthesis chemistry operates through the addition of all four nucleotides simultaneously; each has a different fluorophore attached to the external phosphate group. The fluorophore is cleaved off by the DNA polymerase after detection and therefore will not cause background or stereo chemical inhibition to elongation. There is a block at the 3′ hydroxyl group to ensure detection and cleavage of the fluorophore prior to any subsequent base incorporation.
Single molecule real-time sequencing does not use an amplification step but records the real-time incorporation of individual fluorescently labeled nucleotides to a single sequence complexed to DNA polymerase on a smart cell. The smart cell contains multiple zero-mode wavelength guides on which a camera focuses with resolution sufficient to detect single nucleotide incorporation. The fluorophore is again cleaved by the polymerase to prevent stereochemical elongation inhibition.
Sequencing by Ligation
Sequencing by ligation uses fluorescently labeled complementary probes and DNA ligase as opposed to DNA polymerase to determine the fragment sequence.
Other Sequencing Technology
Nanopores are under development through which the DNA fragment is passed; an electric current is simultaneously passed through the nanopore. Each nucleotide blocks the nanopore to a different degree depending on its shape; this results in alterations to the electrical current passing through the nanopore. This in turn can be used to determine the nucleotide sequence.
Applications
The ability to rapidly and comparatively inexpensively sequence vast quantities of DNA has created a wealth of opportunities, and numerous applications of NGS have been described. Whole exome sequencing of patients with rare monogenetic disorders in whom causative mutations have not been identified is now commonly undertaken. The first major success of this approach was the identification of mutations in gene encoding the enzyme dihydroorotate dehydrogenase as a potential cause of a cohort of Miller syndrome.
NGS has been used to examine the modulation of whole transcriptomes by a given treatment or genetic alteration. It has allowed genome-wide study of the binding of regulatory proteins to DNA and interrogation of epigenetic changes such as histone modification or methylation on a global scale across the genome. However, with such large datasets, the work to analyze, validate, and interpret the data is now often more time-consuming than the sequencing itself.
As sequencing technologies and the associated bioinformatics analysis advance, knowledge of genetic influences on disease risk, disease progression, and response to therapy will increase. In the future, this progress in sequencing and the information it brings will enable an individual genetically tailored approach to medical practice.