Genome Biology Genetics Biostatistics HLA MHC Glossary Homepage

Genome Biology Notes for Genetic Epidemiologists

The vast non-coding landscape of the human genome plays crucial roles in genome biology (Hangauer, 2013; ENCODE, 2012) and subsequently, in disease development (Edwards, 2013; Visscher, 2017)

The ENCODE project (Bernstein, 2012) provided a wealth of genomic landmarks that were systematically integrated to segment the genome into seven major classes: transcription start site and predicted promotor region (TSS) (1.2%), predicted promotor flanking region (PF) (0.7%), predicted enhancer (E) (1.8%), predicted weak enhancer (WE) (2.5%), CTCF-enriched element (0.1%), predicted transcribed region (T) (19.3%) and, predicted repressed or low-activity region (R) (69.6%).

DNase hypersensitivity sites (DHS): Hypersensitivity sites are those areas on the DNA where there is no interference from histone molecules. ENCODE identified 200,000 sites per cell type. It also identified three million places where transcription factors bind to these 200,000 open sites.

DNaseI footprinting detects DNA sequences that are protected from cleavage by DNaseI because they are bound by regulatory factors.

DNase hypersensitivity sites (DHSs) span less than a quarter of GWAS hits, but account for more than three quarters of heritability explained by GWAS (Gusev, 2014). As many as 55% of eQTLs are also ‘DNaseI sensitivity quantitative trait loci’ (dsQTLs) (Degner, 2012). Enrichment increases for enhancer DHSs and cell-type-specific DHSs. 15 to 36% of trait-associated loci map to DHSs independently of other annotations (Trynka, 2015). Variants associated with height are highly enriched in embryonic stem cell DHSs (Trynka, 2015).

eQTLs correlate with gene expression levels. A substantial proportion of causative eQTLs (app. 40%) are in regions of active chromatin (open chromatin, DNaseI footprints and ChIP-seq peaks), and active binding sites for immune-related transcription factors (Jun-D, NFkB) are among the most highly eQTL-enriched regions (Gaffney, 2012). As many as 55% of eQTLs are also ‘DNaseI sensitivity quantitative trait loci’ (dsQTLs).

Enhancers are one of the seven major genomic landmarks involved in gene regulatory activities as described by the ENCODE project. They are distal cis-regulatory elements that carry sequence information for transcription factor binding, regulate gene expression regardless of location and orientation (including trans effects), and control tissue-specific gene expression. Enhancers can be recognized by their DNaseI sensitivity, methylation status and unique histone modifications. They are usually located in intergenic regions, but may also be in exons. In mice, global chromatin connectivity maps reveal approximately 40,000 long-range promoter-enhancer interactions (Zhang, 2013). In human primary blood cells, there are on average 175,000 high-confidence promoter-enhancer interactions per cell type with a median of four interactions per promoter fragment per cell type (Javierre, 2016). Most of the enhancers that associate with promoters locate beyond their nearest genes (and less than 10% of enhancers interact with the promoter of the nearest gene) (Sanyal, 2012; Gorkin, 2014; Mifsud, 2015; Schoenfelder, 2015); thus, the linear juxtaposition is not the only guiding principle driving enhancer target selection (Zhang, 2013). Similarly, in 17 human hematopoietic primary cell lines, promoter interactions are highly cell type-specific and enriched for links between active promoters and epigenetically marked enhancers (Javierre, 2016). Most target genes are located within 1Mb from their enhancers (Bickmore & van Steensel, 2013; Dekker & Mirny, 2016). Promoter-enhancer interactions exhibit high cell-type specificity in human hematopoietic cells (Javierre, 2016) and mouse stem cells (Zhang, 2013). Methylation levels also influence enhancer activity. As an example, breast cancer GWAS hits are enriched for enhancer methylation sites correlated with intertumor expression variation (Aran, 2013). Many enhancer sites predict gene expression levels better than promoter methylation. Promoter-interacting regions are enriched for regulatory chromatin features and eQTLs, which link non-coding GWAS variants with putative target genes (Javierre, 2016).

Gene set enrichment analysis (GSEA) is used to identify the pathways enriched by the genes involved in disease pathogenesis (Subramanian, 2005). On average, about 50% of disease associated genes of a disease are statistically mapped to pathways (Li & Agarwal, 2009). For a review on the use of pathway/GSEA analysis of GWAS data and an example, see Mooney, 2014 and Farashi, 2020 (also see i-Gsea4Gwas online tool).

Tissue-specificity of enhancers and nearby gene expression increase with enhancer length; neighborhoods containing stretch enhancers are enriched for important cell type-specific genes; and GWAS variants associated with traits relevant to a particular cell type are more enriched in stretch enhancers compared with short enhancers (Parker, 2013).

Transcription factor binding to enhancer elements is critical for gene expression regulation. Enhancers are often found in noncoding sequences in close proximity to the gene that they regulate, and may even be on another chromosome. In mice, almost 7% of enhancers overlap coding exons. Thus, exonic sequences may also function in the regulation of nearby genes. An implication of this finding is that phenotypes seen in genetic knockout animals may be the result of not only the lack of expression of the deleted gene, but also alterations in the expression of genes that are regulated by enhancers in the deleted exons, which may be even on a different chromosome (Birnbaum, 2012).

Functional elements predicted from chromatin-state annotations tend to span even larger regions (e.g., the median length of enhancer states is ∼600 bp, although the driver nucleotides can be few).

According to the ‘‘multiple enhancer variant’’ hypothesis for GWAS traits, combinatorial effects of multiple enhancer variants in linkage disequilibrium impacting multiple enhancers cooperatively affect gene expression to mediate susceptibility to common traits (Corradin, 2014).

The chromatin state at promoters and CTCF-binding at insulators is generally invariant across cell types, but enhancers are marked with cell-type-specific histone modification patterns, correlate to cell-type-specific gene expression programs, and are functionally active in a cell-type-specific manner (Heintzman, 2009).

The genes with high regulatory loads (i.e., enhancer load which correlates with transcription factor load) are enriched for disease associations (Galhardo, 2015). The high enhancer load genes reveal an enrichment for disease-associated genes in a cell type-selective manner, carry longer 3' UTRs with higher miRNA binding sites, and are involved in more KEGG pathways than average (Galhardo, 2015).

Epigenetics is defined as stable and heritable patterns of gene expression that do not entail any alterations to the original DNA sequence.

Epigenetic mechanisms primarily consist of

· DNA methylation

· post-translational histone modifications

· small or long non-coding RNA (including microRNA/miR or lncRNA) transcripts

· spatial chromatin interactions

Post-transcriptional regulation is highly versatile and adaptable by controlling RNA availability in cellular time and space. Messenger RNA stability, transport, storage and translation are largely determined by the interaction of mRNA with microRNAs (miRNAs) and RNA-binding proteins (RBPs).

DNA methylation is more prevalent in regions of low CpG density (Bock & Lengauer, 2008) and correlates with repression of transcription when affects the promoter regions but may increase transcription if it occurs within the gene (Jones PA, 1999). Factors that influence CpG methylation include chromatin accessibility, DNaseI footprinting, transcription factor levels and CTCF binding (Gebhard, 2010; ENCODE, Nature 2012). Methylation sites associated to gene expression levels are enriched in enhancers, gene bodies and CpG island shores. The correlation between DNA methylation and gene expression can be positive or negative and it is consistent across cell-types. The association DNA methylation to gene expression appears more tissue-specific than the genetic effects on gene expression or DNA methylation (eQTL or meQTL effects) (Gutierrez-Arcelus, 2015). In cancer, most hypermethylated sites are in microRNA coding regions, and never-methylated sites are predominantly in CpG islands (Ghorbani, 2016).

Methylation QTLs (meQTL) are more likely to reside at regulatory elements and associated with variation in other properties of gene regulation, including transcription factor binding, chromatin conformation, histone modifications, DNaseI accessibility, RNA splicing, and gene expression levels (Banovich, 2014; Zhang, 2010; Maurano, 2012). meQTLs are frequently associated with methylation levels at multiple CpG sites across regions of up to 3 kb (Banovich, 2014). There is a significant overlap of SNPs that are associated with both methylation and gene expression levels (i.e., between meQTL and eQTLs) (Bell, 2011; Lemire, 2015; Lin, 2018). At least a quarter, and up to 69% of all meQTLs are tissue-specific ([Smith, 2014; Andrews, 2017; Lin, 2018).

Over 80% of genetic variants at CpG sites (meSNPs) are meQTL loci, and meSNPs account for over two-thirds of the strongest meQTL signals (Zhi, 2013).

The majority of transgenerational similarity in DNA methylation is attributable to genetic effects of cis-meQTLs. Approximately 20% of individual differences in DNA methylation in the population are caused by DNA sequence variation that is not located within CpG sites (McRae, 2014). cis-meQTLs mostly localize to CpG sites outside of genes, promoters and CpG islands (CGIs), while trans-meQTLs are over-represented in promoter CGIs. meQTL SNPs are enriched in CTCF-binding sites (because CTCF binding is CpG methylation-sensitive), DNaseI hypersensitivity regions and histone marks (Shi, 2014). cis-meQTLs mostly localize to CpG sites are more likely to localize at distant regulatory elements than at promoters (Banovich, 2014; Gutierrez-Arcelus, 2015; Lin, 2018). meQTL-targeted CpG sites tend to be in CpG island shores (Zhang, 2010; Gutierrez-Arcelus, 2015; Lin, 2018).

Examination of the enrichment of 5,654 noncoding disease-associated SNPs from the GWAS Catalog (5,134 unique SNPs) showed 1.6-fold enrichment for meQTLs (Liu, 2014).

Trans-meQTLs make up a small percentage of known meQTLs (2 - 7%), but are highly polygenic (Hannon, 2016; Gaunt, 2016).

Methylation sites associated to expression levels are enriched in enhancers, gene bodies, and CpG island shores (Gutierrez-Arcelus, 2015). DNA methylation-mediated (epigenetic) effects on gene expression are more tissue-specific than genetic effects on gene expression (eQTL-mediated) or methylation levels (meQTL-mediated) (Gutierrez-Arcelus, 2015). Not only genetic effects (sequence variation), but also DNA methylation correlate with alternative splicing (both in a tissue-specific manner) (Gutierrez-Arcelus, 2015). Overall, epigenetic effects on gene expression are more tissue-specific than genetic effects.

Histones are proteins that act to package DNA into chromatin. The methylation of lysine 4 on histone H3 is correlated with active gene expression in eukaryotes. In mammals, H3K4 is methylated by the MLL family of histone methyl transferases (HMTs).

Chromatin/histone marks highlighting active gene regulation are phenotypically cell type-specific (Trynka, 2013). Loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies (Trynka, 2013).

Long range chromatin interactions constitute a primary mechanism for regulating transcription in mammalian genomes (Fullwood, 2009; Sanyal, 2012). Almost half of transcription start sites display one or more long-range interaction, with some interacting with as many as 20 distal fragments (Sanyal, 2012). CCCTC-binding factor (CTCF) is a highly conserved zinc finger protein, which was one of the first transcription factors shown to mediate chromatin looping between its binding sites (Splinter, 2006; Handoko, 2011). It frequently binds to intergenic sequences, often at a distance from the transcriptional start sites, and activates gene expression via intra- and inter-chromosomal chromatin looping (Kim, 2007; Phillips & Corces, 2009; Kim, 2015; Holwerda & Laat, 2015). Estrogen receptor alpha (ESR1) is another transcription factor (whose binding sites are far from TSSs) functioning by extensive chromatin looping to bring (multiple) genes together for coordinated transcriptional regulation ([Fullwood, 2009 #6285]; Sanyal, 2012). Overall, 50 to 60% of all long range chromatin interactions occur in only one of the four cell lines, indicative of a high degree of tissue-specificity for gene-element connectivity (ENCODE, 2012).

microRNAs (miRNAs) originate from long stem-loop containing primary transcripts (pri-miRNAs) that are generally transcribed by RNA polymerase II. pri-miRNAs are substrates of the RNAse III enzyme Drosha and its binding partner, the dsRNA-binding protein DGCR8/Pasha which cleave pri-miRNAs into ∼70 nt precursor hairpins (pre-miRNA) in the nucleusfor export to the cytoplasm. In the cytoplasm, the pre-miRNA is further cleaved by another RNAse III enzyme Dicer into a mature miRNA and its partner strand, the miRNA* (microRNA star). The mature miRNA is defined as the strand, which is loaded into the RNA-Induced Silencing Complex (RISC) complex. The mature miRNA identifies its mRNA target by binding to partially complementary sites within 3′ UTRs of their target genes, resulting in mRNA degradation and translational repression (Filipowicz, 2008; Selbach, 2008; Bartel, 2009; Hendrickson, 2009; Chekulaeva & Filipowicz, 2009; Krol, 2010). Almost 60% of miRNAs derive from recycled, spliced out introns, and the rest map to intergenic areas (Hesselberth, 2013).

Up to 60% of genes in the human genome are regulated by miRNAs (Lewis, 2005; Chekulaeva & Filipowicz, 2009; Arora, 2013); this estimate has now gone up to more than 60% (Friedman, 2018). One genome-wide bioinformatics study annotated more than 45,000 conserved miRNA binding sites in the 3’ UTR of 60% of human genes (Friedman, 2009). A single miRNA can potentially regulate hundreds of different genes on a small scale of 2-fold or less (Xiao, 2007; Baek, 2008; Selbach, 2008). For example, more than 600 distinct mRNA targets were identified following miR-124 overexpression in the breast cancer cell line MCF-7 (Hendrickson, 2009). miRNA effect may have has cell type-specificity and biological context-dependency (example: miR-155-dependent regulation of SOCS1 in different immune cell subsets) (Lu LF, 2015).

Long ncRNAs (lncRNAs) do not encode proteins and are >200 nucleotides long (which may be several thousand nucleotides long); they are transcribed by RNA polymerase II, similarly to mRNAs. lncRNAs have little or no protein-coding capacity, but are abundantly expressed in a developmentally regulated and tissue-specific manner (Mercer, 2008). They are involved in a wide range of cellular functions, including epigenetic silencing, transcriptional regulation, and RNA processing/modification (Mercer, 2009). lncRNAs may influence the gene expression of neighboring genes (in cis) at the transcriptional, post‑transcriptional and translational levels. lncRNAs, like miRNAs, have multiple targets. HOTAIR, for example, has around 800 targets (Gupta, 2010; Vance & Pontig, 2014). An average of 10 transcription units, the vast majority of which make lncRNAs, may overlap each traditional coding gene ([Lee, 2012 #4388]). Approximately 20% of human lncRNAs are not expressed in species beyond chimpanzee and are undetectable even in rhesus (Washietl, 2014).

There are at least 69 human complex trait/disease-associated lncRNAs in LCLs. These loci are often associated with cis-regulation of gene expression and tend to be localized at TAD boundaries, suggesting that these lncRNAs may influence chromosomal architecture (Tan, 2017). In terms of histone marks, lncRNAs can be identified by the presence of K4-K36 domains in intergenic regions (Guttman, 2009; Khalil, 2009).

Protein-coding genes are traditional genes that code for peptides. The current human protein-coding gene number is around 20,000. See the latest statistics in GENCODE, HGNC, VEGA and NCBI. As of 2015, there were already more non-coding RNA genes than protein-coding genes in the human genome.

Of the ~20,500 protein-coding genes in humans, 7.5% are directly involved in RNA metabolism by binding to and/or processing RNA, or by constituting essential components of RNPs (Gerstberger, Hafner & Tuschl, 2014).

ENCODE Consortium studies showed that for most GWAS associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the associated (lead) SNP rather than the lead SNP itself ([Schaub, 2012 #4205]). Likewise, in autoimmune disorders,

SNPs reported in the GWAS catalog have on average 5% chance of being the causal SNP (Farh, 2014). GWAS hits are typically distant (median 14kb) from causal SNPs, and many are not in tight linkage disequilibrium (Farh, 2014).

Only a small percentage (7%) of GWAS hits are located in protein-coding regions, while the remaining 93% are located in gene regulatory regions or in intergenic regions (Hindorff, 2009). In general, following studies also showed that 70-90% of GWAS hits are in non-coding regulatory regions (Maurano, 2012; Ernst, 2011).

The majority of GWAS hits, as well as loci with sub-genome-wide significance (P values between 1 × 10⁻⁵ and 5 × 10⁻⁸), localize to non-coding genomic regions with many gene regulatory signals [3], suggesting that most trait/disease causal SNPs exert their phenotypic effects by altering gene expression (Maurano, 2012; Schaub, 2012; Croteau-Chonka, 2015). These observations are further supported by GWAS hits being enriched in genomic regions with many expression quantitative trait loci (eQTLs) and open chromatins (Nicolae, 2010; (GTex Consortium, 2015; Finucane, 2015; Gusev, 2014), and promoter-interacting regions (Javierre, 2016). GWAS hits are enriched with eQTLs in a tissue-specific manner (Dimas, 2009; Nicolae, 2010).

Most GWAS hits for complex disorders yield allelic (additive model) ORs between 1.1 and 1.5 (Lohmueller, 2003; Ioannidis, 2003).

A gene may be involved in disease pathogenesis even if none of its SNPs show any association ("an intricate transcriptional connectivity between Crohn disease (CD) susceptibility genes and interferon-γ, a key effector in CD, despite the absence of CD-associated SNPs in the IFNG locus” has been described (Kumar, 2015)). Likewise, the sex differential in autism may not be due to any of the risk loci having sex-specific effects, but their interaction partners may be the reason for the sex effect (Werling, 2016).

In contrast to earlier expectations (Zuk, 2012) and examples in plants (Lachowiec, 2017) and model animals (Gibert, 2017), non-additive (dominance) effects and neither gene x gene (epistasis) nor gene x environment interactions do not appear to contribute to missing heritability for complex human traits (Aschard, 2012; Zhu, 2015; Nolte, 2017; Visscher, 2017) with the exception of HLA-linked (Lenz, 2015; Wei, 2016) or unlinked (Galarza-Munoz, 2017) autoimmune disorder associations. The missing heritability has been thought to be likely due to rare alleles that are not included in common genotyping chips (Zaitlen, 2013), but, in schizophrenia, rare coding variants do not contribute to heritability (Gusev, 2014). At least for height, methylation profiles account for almost no variation (Shah, 2015).

eQTL mapping offers a powerful approach to elucidate the genetic component underlying altered gene expression (Gilad, 2008). Genetic variation can also influence gene expression through alterations in splicing, noncoding RNA expression and RNA stability (Pickrell, 2010a; Pickrell, 2010b; Borel, 2011).

Trait-associated SNPs are more likely to be eQTL (Dixon, 2007; Emilsson, 2008; Dimas, 2009; Nica, 2010; Nicolae, 2010; Montgomery, 2011; Richards, 2012; Stranger, 2013; Dubois, 2010; Zhong, 2010; Croteau-Chonka, 2015). eQTLs drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites, and independently through 3' UTR regulation (Trynka, 2015). A twin study found that a substantial proportion of gene expression heritability is trans to the structural gene (Grundberg, 2012).

eQTLs can also influence the expression levels of non-coding RNA genes. 75% of the SNPs affecting lncRNA expression (lncRNA cis-eQTLs) are specific to lncRNA alone and do not affect the expression of neighboring protein-coding genes. The specific genotype-lincRNA expression correlation is tissue-dependent (Kumar, 2013).

GWAS hits are enriched with eQTLs in a tissue-specific manner (Dimas, 2009; Nicolae, 2010). 50% to 90% of all eQTLs are tissue-dependent (Petretto, 2006; Dimas, 2009; Nica, 2011). In one study, 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. Even among the shared eQTLs, 10 to 20% have significant differences in the magnitude of fold change between genotypic classes across tissues (Nica, 2011). These earlier results are being revised downward in recent studies (Peters, 2016). The majority of eQTLs detectable in one study are shared among multiple cell types (Flutre, 2013). Many eQTLs have been found to affect monocyte gene expression in a stimulus- or time-specific manner (Fairfax, 2014). In a study that used a Mendelian randomization approach, associations of common genetic variants in 57 GWAS and 24 studies of expression quantitative trait loci (eQTLs) from a broad range of tissues were integrated yielding 3,484 instances of gene-trait-associated changes in expression (FDR<0.05). These genes were often not the nearest genes to the genetic variant (Hauberg, 2017).

The expression of 4.4% of genes (eQTL probes) in different tissues correlate with the same SNP but with opposite allelic direction. SNPs that are located in transcriptional regulatory elements are enriched for tissue-dependent regulation, including SNPs at 3′ and 5′ untranslated regions and SNPs that are synonymous-coding. SNPs that are associated with complex traits more often exert a tissue-dependent effect on gene expression (Fu, 2012).

In eQTL studies, around one-third of all correlations are tissue-dependent and eQTLs that are shared among multiple tissues show variable effects sizes (Nica, 2011; Fu, 2012), and even differences in direction (Fu, 2012). 14.5% of genes (eQTL probes) have different (and independent) eQTLs depending on the tissue (Fu, 2012). These observations are not surprising given that only 6.0% of all genes are ubiquitously expressed in all tissues examined (n=46) and only 3.1% of genes were expressed in only one of the 46 tissues examined (Su, 2002). Less than 10% of eQTLs yield discordant results between blood and a specific tissue if they show correlations in multiple tissues (Su, 2002).

CNVs are more likely to be eQTLs than SNPs (Bryois, 2014).

A prominent role for loci of non-coding RNAs (ncRNAs) and loci of pseudogenes in the regulation of expression of coding genes in humans has been shown in a large study of human peripheral mononuclear cells (Kirsten, 2015).

17.6% of all genes expressed in mononuclear blood cells were associated with a trans-eSNP and 83.2% of all genes were associated with a cis-eSNP. Conversely, 779,042 (29.7%) of all SNPs were associated with a gene in cis, whereas 38,034 (1.4%) were associated with a gene in trans. After pruning, these observations corresponded to 81,148 (28.4%) cis- and 3,800 (1.3%) trans-acting SNPs, respectively. The smallest identified effect sizes with study-wide significance are different for cis- and trans-eQTLs (0.4 and 1.3% explained variance of gene-expression levels, respectively) (Kirsten, 2015).

Because most eQTL studies measure the steady-state level of mRNAs, they cannot distinguish between changes in transcript production and decay rates. In a study of 16,000 genes in 70 Yoruban HapMap lymphoblastoid cell lines, ~10% of genes showed correlations between steady-state expression level and variation in decay rate variation, and 195 loci -dubbed RNA decay quantitative trait loci (RdQTLs)- were identified as being associated with this variation (Pai, 2012).

In a meta-analysis of blood eQTLs, 44% of genes have cis-eQTLs (Westra, 2013). In peripheral mononuclear cells, there are eQTLs for about 85% of analysed genes, and 18% of genes are trans-regulated. Local eSNPs are enriched up to a distance of 5 Mb to the transcript challenging typically implemented ranges of cis-regulations, and nearest genes of GWAS hits might frequently be misleading functional candidates as the target genes. Dissection of co-localized functional elements indicated a prominent role of SNPs in loci of pseudogenes and non-coding RNAs for the regulation of coding genes (Kirsten, 2015).

The HLA region contains almost half of all known trans-eQTLs in humans (Fehrmann, 2011; Fairfax, 2012). A meta-analysis of all blood eQTL studies concluded that the region is ten-fold enriched for trans-eQTLs (Westra, 2013). However, not all studies have observed trans-eQTL hotspots in the HLA region (Yao, 2017; Joehanes, 2017). There are many trans-eQTL hotspots in the human genome (Schramm, 2014; Brynedal, 2017; Yao, 2017; Joehanes, 2017).

The mechanism of trans-eQTL effect appears to be cis-mediation (Pierce, 2014; Yao, 2017) with miRNAs being one of the mediators (Joehanes, 2017). The expression of around 15% of transcripts is regulated by a significant two-locus interaction (P = 3E-144), but gene-mediated trans-effects are not a major source of epistasis (Becker, 2012).

Evolutionarily conserved regions can often provide valuable information on the location of regulatory elements (Pennacchio, 2006; Prabhakar, 2006) although different results have also been reported (Lee, 2014). Conservation is strongly negatively correlated with distance from the transcription start site (TSS), and functional intronic sites tend to accumulate toward the 5′ end of genes at least in murids (Gaffney & Keightley, 2006). However, conservation provides surprisingly little information for predicting eQTL location (Gaffney, 2012).

Only 27% of the distal regulatory elements have an interaction with the nearest gene’s promoter (Sanyal, 2012) suggesting that the nearest gene is not often the target gene of a GWAS hit. Target genes of a GWAS hit may even be on a different chromosome. Nearest genes are not the most common targets of eQTLs (Wang, 2013).

Genetic variants that modify chromatin accessibility and transcription factor binding are a major mechanism through which genetic variation leads to gene expression differences among humans (Khurana, 2013; Degner, 2012]). As many as 55% of eQTLs are also ‘DNaseI sensitivity quantitative trait loci’ (dsQTLs) (Degner, 2012).

The phenotypic consequences of eQTLs are presumably due to their effects on protein expression levels, but the impact of genetic variation, including eQTLs, on protein levels has not been widely examined. A combined analysis of genetic variants that are associated with eQTLs, ribosome occupancy (rQTLs), or protein abundance (pQTLs) revealed that most QTLs are associated with transcript expression levels, with consequent effects on ribosome and protein levels. eQTLs tend to have significantly reduced effect sizes on protein levels. A new class of cis QTLs affect protein abundance with little or no effect on messenger RNA or ribosome levels, which suggests that they may arise from differences in posttranslational regulation (Battle, 2015).

The majority of disease-associated missense variant exhibit wild-type chaperone binding profiles, suggesting they preserve protein folding or stability. While common variants from healthy individuals rarely affect interactions, two-thirds of disease-associated alleles perturb protein-protein interactions, with half corresponding to ‘‘edgetic’’ alleles affecting only a subset of interactions while leaving most other interactions unperturbed (Sahni, 2015). With transcription factors, many alleles that leave protein-protein interactions intact affect DNA binding. Different mutations in the same gene may lead to different interaction profiles resulting in distinct disease phenotypes. Thus disease-associated alleles that perturb distinct protein activities rather than grossly affecting folding and stability are relatively widespread (Sahni, 2015).

Around 15% of human protein-encoding genes have associated natural antisense transcripts (NAT) (Lapidot & Pipel, 2006). Consequences include transcriptional interference, RNA masking and effects on methylation.

3' UTRs are involved in regulating gene expression at multiple levels: at the pre-mRNA level, 3' UTRs are involved in mRNA 3' end formation and polyadenylation whereas at the mature mRNA level, 3' UTRs determine such properties as mRNA stability/degradation, nuclear export, subcellular localization and translation efficiency.

Transcription factors (TF) and transcription factor binding sites (TFBS):

There is experimental evidence that transcription factor binding activity differs between humans; this is due to genetic variation within TF binding sites; and results in gene expression level changes (Kasowski, 2010). Disruptive variants because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) show evidence for selection against (Khurana, 2013).

Chromatin variability shows genetic inheritance in families; correlates with genetic variation and population divergence; and is associated with disruptions of transcription factor binding motifs (Kasowski, 2013).

Approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions (Gaffney, 2012). Active binding sites for immune-related TFs are among the most highly eQTL-enriched regions in the genome (Gaffney, 2012). Aside from altering the binding of TFs, eQTLs may also perturb gene regulation in more subtle ways - for example, by altering the intrinsic nucleosome preferences of the DNA (Segal, 2006). eQTLs may also act epigenetically by altering the pattern of DNA methylation, with resulting effects on gene expression (Bell, 2011), altering miRNA expression levels (miR-eQTLs) (Huan, 2015), or correlating with histone acetylation levels (haQTLs) (del Rosario, 2015).

Less than 10% of high probability causal SNPs alter a transcription factor binding site motif.

Epigenetic changes in chromatin architectures or DNA sequences relate to TF binding: H3K4me3, H3k9ac and H3k27ac contribute more in the regions near TSSs, whereas H3K4me1 and H3k79me2 dominate in the regions far from TSSs. DNA methylation plays relatively important roles when close to TSSs than in other regions. In addition, the results show that epigenetic modification models for the predictions of TF binding affinities are cell line-specific (Liu, 2015).

CpG methylation recruits sequence specific transcription factors essential for tissue specific gene expression (Chatterjee & Vinson, 2012). Direct and selective methylation of a certain TFBS that prevents TF binding is restricted to special cases and cannot be considered as a general regulatory mechanism of transcription (Medvedeva, 2014).

Binding differences were frequently associated with SNPs and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of TF binding variation (Kasowski, 2010).

With transcription factors, many disease-causing missense alleles (that leave protein-protein interactions intact) affect DNA binding (Sahni, 2015).

The sequence of events in gene expression regulation most of which can be influenced by eQTLs is not well established (Pai, 2015). The initial intuition that histone modifications regulate chromatin state, which in turn determine whether factors can bind to different sites is now being replaced that TF binding may be the central event that mediates changes in other regulatory mechanisms determining chromatin states, accessibility, and conformations. According to an intermediate model, a small number of particular TFs at a subset of their designated binding sites act as pioneer factors, and it is this pioneer activity that generates concerted changes in chromatin state, which are maintained by histone marks, DNA methylation, and nucleosome positioning. Chromatin areas that are accessible because of pioneer activity are available for binding by secondary factors (Pai, 2015).

Cohesin, CCCTC-binding factor (CTCF), and ZNF143 are key components of three-dimensional chromatin structure and instrumental in the distal chromatin state’s effect on gene transcription (Heidari, 2014). Cohesin forms a large ring capable of encircling two DNA molecules; it is recruited to active promoters, but associated with CTCF in the formation of insulator elements (Lee & Young, 2013). The conserved anchors of CTCF-CTCF loops are frequently mutated in cancer (Ji, 2016).

CTCF is a ubiquitously expressed transcription factor (Holwerda & de Laat, 2013). It was first described as a transcriptional repressor, but later on, it was also found to act as a transcriptional activator. It harbors insulator activity: when positioned in between an enhancer and gene promoter, it blocks their interaction and prevents transcriptional activation (Bell, 1999). The mammalian genome is covered with many CTCF binding sites (Schmidt, 2012). CTCF sites are extremely divergent, allowing a CTCF functional sequence to be present although not detectable by in silico means. More than most other transcription factors, CTCF appears to bind to intergenic sequences, often at a distance from TSS (Kim, 2007; Phillips & Corces, 2009; Kim, 2015; Holwerda & Laat, 2013). CTCF mediates chromatin looping between its binding sites (Splinter, 2006; Handoko, 2011). CTCF binding is CpG methylation-sensitive and thus involved in imprinting. Altogether, CTCF mediates intra- and inter-chromosomal contacts at several developmentally regulated genomic loci, and and therefore, contributes to the global organization of chromatin architecture (Phillips & Corces, 2009; Zuin 2014). CTCF ablation affects TAD organization by decreasing intra-domain and increasing inter-domain contacts (Zuin, 2014). CTCF motif orientation and looping show a strong correlation, and looping takes place in more than 90% of the cases in a convergent manner (Rao, 2014; Vietri Rudan, 2015).

Mehmet Tevfik Dorak, MD, PhD

Genome Biology Genetics Biostatistics HLA MHC Glossary Homepage

11 June 2018