Genetics      Biostatistics      HLA      MHC     Glossary      Homepage



Genome Biology Notes for Genetic Epidemiologists

Mehmet Tevfik DORAK



The ENCODE project (Bernstein, 2012) provided a wealth of genomic landmarks that were systematically integrated to segment the genome into seven major classes: transcription start site and predicted promotor region (TSS) (1.2%), predicted promotor flanking region (PF) (0.7%), predicted enhancer (E) (1.8%), predicted weak enhancer (WE) (2.5%), CTCF-enriched element (0.1%), predicted transcribed region (T) (19.3%) and, predicted repressed or low-activity region (R) (69.6%).


DHS (DNase Hypersensitivity Sites): Hypersensitivity sites are those areas on the DNA where there is no interference from histone molecules. ENCODE identified 200,000 sites per cell type. It also identified three million places where transcription factors bind to these 200,000 open sites.


DNaseI footprinting detects DNA sequences that are protected from cleavage by DNaseI because they are bound by regulatory factors.


eQTLs correlate with gene expression levels. A substantial proportion of causative eQTLs (app. 40%) are in regions of active chromatin (open chromatin, DNaseI footprints and ChIP-seq peaks), and active binding sites for immune-related transcription factors (Jun-D, NFkB) are among the most highly eQTL-enriched regions (Gaffney, 2012). As many as 55% of eQTLs are also ‘DNase I sensitivity quantitative trait loci’ (dsQTLs).


Enhancers are one of the seven major genomic landmarks involved in gene regulatory activities as described by the ENCODE project. They are distal cis-regulatory elements that carry sequence information for transcription factor binding, regulate gene expression regardless of location and orientation (including trans effects), and control tissue-specific gene expression. Enhancers can be recognized by their DNase I sensitivity, methylation status and unique histone modifications. They are usually located in intergenic regions, but may also be in exons.. In mice, global chromatin connectivity maps reveal approximately 40,000 long-range promoter-enhancer interactions (Zhang, 2013). Most of the enhancers associate with promoters locate beyond their nearest genes (less than 10% of enhancers interact with the promoter of the nearest gene (Gorkin, 2014)); thus, the linear juxtaposition is not the only guiding principle driving enhancer target selection (Zhang, 2013). Promoter-enhancer interactions exhibit high cell-type specificity (Zhang, 2013). Methylation levels also influence enhancer activity. As an example, breast cancer GWAS hits are enriched for enhancer methylation sites correlated with intertumor expression variation (Aran D & Hellman A, 2013). Many enhancer sites predict gene expression levels better than promoter methylation.


Tissue specificity of enhancers and nearby gene expression increase with enhancer length; neighborhoods containing stretch enhancers are enriched for important cell type-specific genes; and GWAS variants associated with traits relevant to a particular cell type are more enriched in stretch enhancers compared with short enhancers (Parker, 2013).


Transcription factor binding to enhancer elements is critical for gene expression regulation. Enhancers are often found in noncoding sequences in close proximity to the gene that they regulate, and may even be on another chromosome. In mice, almost 7% of enhancers overlap coding exons. Thus, exonic sequences may also function in the regulation of nearby genes. An implication of this finding is that phenotypes seen in genetic knockout animals may be the result of not only the lack of expression of the deleted gene, but also alterations in the expression of genes that are regulated by enhancers in the deleted exons, which may be even on a different chromosome (Birnbaum, 2012).


Functional elements predicted from chromatin-state annotations tend to span even larger regions (e.g., the median length of enhancer states is 600 bp, although the driver nucleotides can be similarly few.


According to the ‘‘multiple enhancer variant’’ hypothesis for GWAS traits, several correlated variants impact multiple enhancers and cooperatively affect gene expression (Corradin, 2014).


Protein-coding genes: Genes that code for peptides. The current human protein-coding gene number is around 20,000. See the latest statistics in GENCODE, HGNC, VEGA and NCBI. As of 2016, there are more non-coding RNA genes than protein-coding genes in the human genome.


Only 1% of the mammalian genome carries protein-coding potential, yet 70 to 90% is transcribed at some point during development to produce a large transcriptome of long noncoding RNA (lncRNA, defined as RNA > 100 nucleotides in length) (Willingham, 2005; ENCODE Cconsortium, 2004; Carninci, 2005; Bertone, 2004) -mostly to produce a large transcriptome of long noncoding RNA (lncRNA). Some estimate total membership to exceed 200,000, whereas others suggest fewer than 10,000. The ENCODE project has revealed an enormous complexity, with ~10 isoforms overlapping any previously annotated genes, thereby challenging the traditional definition of a gene. Although there is now little doubt that pervasive transcription occurs, whether this activity is universally functional is unknown. These transcripts are often poorly conserved, unstable, and/or present in few copies (Lee, 2012).


ENCODE Consortium studies showed that for most GWAS associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the associated (lead) SNP rather than the associated (lead) SNP itself (Schaub, 2012).


SNPs reported in the GWAS catalog have on average 5% chance of being the causal SNP (Farh, 2014).


GWAS hits are typically distant (median 14kb) from causal SNPs, and many are not in tight LD (Farh, 2014). Nearest genes of GWAS SNPs might frequently be misleading functional candidates as the target genes (Kirsten, 2015).


Only as little as 7% of disease-associated SNPs maps to protein-coding regions, while the remaining (70-93%) are located in non-coding regulatory regions (Hindorff, 2009; Maurano, 2012; Ernst, 2011). Thus, GWAS risk loci are enriched in genomic regions with many expression quantitative trait loci (eQTLs) and open chromatins (DHS regions).


Trait-associated SNPs are more likely to be eQTL (Dixon, 2007; Emilsson, 2008; Dimas, 2009; Nica, 2010; Nicolae, 2010; Montgomery, 2011).


eQTL) mapping offers a powerful approach to elucidate the genetic component underlying altered gene expression. Genetic variation can also influence gene expression through alterations in splicing, noncoding RNA expression and RNA stability.


eQTLs can also influence the expression levels of non-coding RNA genes. 75% of the SNPs affecting lincRNA expression (lincRNA cis-eQTLs) are specific to lincRNA alone and do not affect the expression of neighboring protein-coding genes. The specific genotype-lincRNA expression correlation is tissue-dependent.


~20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus (Washietl, 2014).


A prominent role for loci of non-coding RNAs (ncRNAs) and loci of pseudogenes in the regulation of expression of coding genes in humans has been shown in a large study of human peripheral mononuclear cells (Kirsten, 2015).


GWAS signals are enriched with expression quantitative loci (eQTLs) in a tissue-specific manner: 50% to 90% of eQTLs are tissue dependent.


44% of genes have cis-eQTLs (Westra, 2013): In peripheral mononuclear cells, there are eQTLs for about 85% of analysed genes, and 18% of genes are trans-regulated. Local eSNPs are enriched up to a distance of 5 Mb to the transcript challenging typically implemented ranges of cis-regulations, and nearest genes of GWAS SNPs might frequently be misleading functional candidates as the target genes. Dissection of co-localized functional elements indicated a prominent role of SNPs in loci of pseudogenes and non-coding RNAs for the regulation of coding genes (Kirsten, 2015).


17.6% of all genes expressed in mononuclear blood cells were associated with a trans-eSNP and 83.2% of all genes were associated with a cis-eSNP. Conversely, 779 042 (29.7%) of all SNPs were associated with a gene in cis, whereas 38 034 (1.4%) were associated with a gene in trans. After pruning, these observations corresponded to 81 148 (28.4%) cis- and 3800 (1.3%) trans-acting SNPs, respectively. Note that the smallest identified effect sizes with study-wide significance are different for cis- and trans-eQTLs (0.4 and 1.3% explained variance of gene-expression levels, respectively) (Kirsten, 2015).


Conservation provides surprisingly little information for predicting eQTL location (Gaffney, 2012).


Only 27% of the distal regulatory elements have an interaction with the nearest promoter (Edwards, 2013) suggesting that the nearest gene is not often the target gene of a GWAS hit. Target genes of a GWAS hit may even be on a different chromosome.


Genetic variants that modify chromatin accessibility and transcription factor binding are a major mechanism through which genetic variation leads to gene expression differences among humans (Degner, 2012; PDF): As many as 55% of eQTL SNPs are also ‘DNase I sensitivity quantitative trait loci’ (dsQTLs).


There is a complex landscape of long-range gene-element connectivity across ranges of hundreds of kb to several Mb, including interactions among unrelated genes. The 5C results showed that 50 to 60% of long range interactions occur with a high degree of tissue specificity (ENCODE, Nature 2012). Topologically associated domains (TADs) represent matrices of contacts among regulatory regions, and interactions within TADs are different between cell types and are influenced by cellular differentiation as well as environmental conditions.


Non-coding region mutations are also relevant in somatic mutations in cancer cells. Many recurrent somatic cancer variants occurr in noncoding regulatory regions and thus might indicate mutations that drive cancer (Khurana, 2013).


The majority of disease-associated missense variant exhibit wild-type chaperone binding profiles, suggesting they preserve protein folding or stability. While common variants from healthy individuals rarely affect interactions, two-thirds of disease-associated alleles perturb protein-protein interactions, with half corresponding to ‘‘edgetic’’ alleles affecting only a subset of interactions while leaving most other interactions unperturbed (Sahni et al, 2015). With transcription factors, many alleles that leave protein-protein interactions intact affect DNA binding. Different mutations in the same gene may lead to different interaction profiles resulting in distinct disease phenotypes. Thus disease-associated alleles that perturb distinct protein activities rather than grossly affecting folding and stability are relatively widespread (Sahni et al, 2015).


Around 15% of human protein-encoding genes have an associated natural antisense transcripts (NAT) (Lapidot & Pipel, 2006). Consequences include transcriptional interference, RNA masking and effects on methylation.


Transcription factors (TF) and transcription factor binding sites (TFBS):

There is experimental evidence that transcription factor binding activity differs between humans; this is due to genetic variation within TF binding sites; and results in gene expression level changes (Kasowski, 2010). Disruptive variants because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) show evidence for selection against (Khurana, 2013).


Chromatin variability shows genetic inheritance in families; correlates with genetic variation and population divergence; and is associated with disruptions of transcription factor binding motifs (Kasowski, 2013).


Approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions (Gaffney, 2012). Active binding sites for immune-related TFs are among the most highly eQTL-enriched regions in the genome (Gaffney, 2012). Aside from altering the binding of TFs, eQTLs may also perturb gene regulation in more subtle ways - for example, by altering the intrinsic nucleosome preferences of the DNA (Gaffney, 2012). eQTNs may also act epigenetically by altering the pattern of DNA methylation, with resulting effects on gene expression (Gaffney, 2012), by altering miRNA expression levels (miR-eQTLs) (Huan, 2015), by correlating with histone acetylation levels (haQTLs) (del Rosario, 2015).


Less than 10% of high probability causal SNPs alter a transcription factor binding site motif.


Epigenetic changes in chromatin architectures or DNA sequences relate to TF binding: H3K4me3, H3k9ac and H3k27ac contribute more in the regions near TSSs, whereas H3K4me1 and H3k79me2 dominate in the regions far from TSSs. DNA methylation plays relatively important roles when close to TSSs than in other regions. In addition, the results show that epigenetic modification models for the predictions of TF binding affinities are cell line-specific (Liu, 2015).


CpG methylation recruits sequence specific transcription factors essential for tissue specific gene expression (Chatterjee & Vinson, 2012). Direct and selective methylation of certain TFBS that prevents TF binding is restricted to special cases and cannot be considered as a general regulatory mechanism of transcription (Medvedeva, 2014).


Binding differences were frequently associated with single-nucleotide polymorphisms and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of binding variation (Kasowski, 2010).


With transcription factors, many disease-causing missense alleles (that leave protein-protein interactions intact) affect DNA binding (Sahni et al, 2015).


Evolutionarily conserved regions can often provide valuable information on the location of regulatory elements (Gaffney, 2012) although Lee et al reported somewhat different results (Lee, 2014). Conservation is strongly negatively correlated with distance from the transcription start site (TSS).


A single miRNA can potentially regulate hundreds of different genes on a small scale of 2-fold or less.


miRNA effect may have has cell type- and biological context-dependency (example: miR-155-dependent regulation of SOCS1 in different immune cell subsets) (Lu LF, 2015).


A gene may be involved in disease pathogenesis even if none of its SNPs show any association ("an intricate transcriptional connectivity between Crohn disease (CD) susceptibility genes and interferon-γ, a key effector in CD, despite the absence of CD-associated SNPs in the IFNG locus has been described by Kumar, 2015).


The phenotypic consequences of expression quantitative trait loci (eQTLs) are presumably due to their effects on protein expression levels. Yet the impact of genetic variation, including eQTLs, on protein levels remains poorly understood. To address this, we mapped genetic variants that are associated with eQTLs, ribosome occupancy (rQTLs), or protein abundance (pQTLs). We found that most QTLs are associated with transcript expression levels, with consequent effects on ribosome and protein levels. However, eQTLs tend to have significantly reduced effect sizes on protein levels, which suggests that their potential impact on downstream phenotypes is often attenuated or buffered. Additionally, we identified a class of cis QTLs that affect protein abundance with little or no effect on messenger RNA or ribosome levels, which suggests that they may arise from differences in posttranslational regulation (Battle, 2015).


Tissue specificity of enhancers and nearby gene expression increase with enhancer length. Neighborhoods containing stretch enhancers are enriched for important cell type-specific genes, and GWAS variants associated with traits relevant to a particular cell type are more enriched in stretch enhancers compared with short enhancers. (Parker, 2013).


Epigenetics is defined as stable and heritable patterns of gene expression that do not entail any alterations to the original DNA sequence.

Epigenetic mechanisms primarily consist of:

·       DNA methylation

·       post-translational histone modifications

·       small or long non-coding RNA transcripts

·       spatial chromatin interactions


Post-transcriptional regulation is highly versatile and adaptable by controlling RNA availability in cellular time and space. Messenger RNA stability, transport, storage and translation are largely determined by the interaction of mRNA with microRNAs (miRNAs) and RNA-binding proteins (RBPs).


DNA methylation is more prevalent in regions of low CpG density, and correlates with repression of transcription when affects the promoter regions, but may increase transcription if it occurs within the gene. Factors that influence CpG methylation include chromatin accessibility, DNaseI footprinting, transcription factor levels and CTCF binding. In cancer, most hypermethylated sites are in microRNA coding regions, and never methylated sites are predominantly in CpG islands (Ghorbani, 2016).


Methylation QTLs (meQTLs) are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels (Banovich, 2014).


The majority of transgenerational similarity in DNA methylation is attributable to genetic effects of cis meQTLs. Approximately 20% of individual differences in DNA methylation in the population are caused by DNA sequence variation that is not located within CpG sites (McRae, 2014).


Of the ~20,500 protein-coding genes in humans, 7.5% are directly involved in RNA metabolism by binding to and/or processing RNA, or by constituting essential components of RNPs (Gerstberger, Hafner & Tuschl, 2014).


CCCTC-binding factor (CTCF) is a ubiquitously expressed and an essential protein, and is, in many ways, an exceptional transcription factor. It was first described as a transcriptional repressor, but was also found to act as a transcriptional activator. Most strikingly, it harbours insulator activity: when positioned in between an enhancer and gene promoter, it can block their communication and prevent transcriptional activation (Holwerda & Laat, 2015). Systematic chromatin immunoprecipitation experiments combined with high-throughput sequencing (ChIP-seq) have been performed to map CTCF binding events across the genome in many tissues of different species. They show that the genome is covered with a myriad of CTCF binding sites. More than most other transcription factors CTCF appears to bind to intergenic sequences, often at a distance from the transcriptional start site (TSS). CTCF was one of the first proteins demonstrated to mediate chromatin looping between its binding sites. Further evidence for its role in the organization of genome structure comes from observations that it frequently binds to boundaries between chromosomal regions that occupy distinct locations in the nucleus, to boundaries between regions with different epigenetic signatures and/or different transcriptional activities, and to boundaries between recently identified topological domains, which are spatially defined chromosomal units within which sequences preferentially interact with each other.


CTCF is a highly conserved zinc finger protein implicated in diverse regulatory functions, including transcriptional activation/repression, insulation, imprinting, and X chromosome inactivation. CTCF-mediated intra- and inter-chromosomal contacts are crucial at several developmentally regulated genomic loci. A primary role for CTCF in the global organization of chromatin architecture is well documented. CTCF may be a heritable component of an epigenetic system regulating the interplay between DNA methylation, higher-order chromatin structure, and lineage-specific gene expression (Phillips & Corces, 2009).



Mehmet Tevfik Dorak, M.D., Ph.D.

 Genetics      Biostatistics      HLA      MHC     Glossary      Homepage


16 April 2017