Genome Biology Notes for Genetic Epidemiologists
Protein-coding genes: Genes that code for peptides. The current human protein-coding gene number is around 20,000. See the latest statistics in GENCODE, HGNC, VEGA and NCBI. As of 2016, there are more non-coding RNA genes than protein-coding genes in the human genome.
Only 1% of the mammalian genome carries protein-coding potential, yet 70 to 90% is transcribed at some point during development to produce a large transcriptome of long noncoding RNA (lncRNA, defined as RNA > 100 nucleotides in length). Some estimate total membership to exceed 200,000, whereas others suggest fewer than 10,000 (5–8). The ENCODE project has revealed an enormous complexity, with ~10 isoforms overlapping any previously annotated genes, thereby challenging the traditional definition of a gene (8). Although there is now little doubt that pervasive transcription occurs, whether this activity is universally functional is unknown. These transcripts are often poorly conserved, unstable, and/or present in few copies (Lee, 2012).
ENCODE Consortium studies showed that for most GWAS associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the associated (lead) SNP rather than the associated (lead) SNP itself (Schaub, 2012).
SNPs reported in the GWAS catalog have on average 5% chance of being the causal SNP (Farh, 2014).
GWAS hits are typically distant (median 14kb) from causal SNPs, and many are not in tight LD (Farh, 2014).
Only a small percentage (7%) of disease-associated SNPs is located in protein-coding regions, while the remaining 93% are located in gene regulatory regions or in intergenic regions.
Trait-associated SNPs are more likely to be eQTL (Dixon, 2007; Emilsson, 2008; Dimas, 2009; Nica, 2010; Nicolae, 2010; Montgomery, 2011).
eQTL) mapping offers a powerful approach to elucidate the genetic component underlying altered gene expression. Genetic variation can also influence gene expression through alterations in splicing, noncoding RNA expression and RNA stability.
eQTLs can also influence the expression levels of non-coding RNA genes. 75% of the SNPs affecting lincRNA expression (lincRNA cis-eQTLs) are specific to lincRNA alone and do not affect the expression of neighboring protein-coding genes. The specific genotype-lincRNA expression correlation is tissue-dependent.
~20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus (Washietl, 2014).
A prominent role for loci of non-coding RNAs (ncRNAs) and loci of pseudogenes in the regulation of expression of coding genes in humans has been shown in a large study of human peripheral mononuclear cells (Kirsten, 2015).
GWAS signals are enriched with expression quantitative loci (eQTLs) in a tissue-specific manner: 50% to 90% of eQTLs are tissue dependent.
44% of genes have cis-eQTLs (Westra, 2013): In peripheral mononuclear cells, there are eQTLs for about 85% of analysed genes, and 18% of genes are trans-regulated. Local eSNPs are enriched up to a distance of 5 Mb to the transcript challenging typically implemented ranges of cis-regulations, and nearest genes of GWAS SNPs might frequently be misleading functional candidates as the target genes. Dissection of co-localized functional elements indicated a prominent role of SNPs in loci of pseudogenes and non-coding RNAs for the regulation of coding genes (Kirsten, 2015).
17.6% of all genes expressed in mononuclear blood cells were associated with a trans-eSNP and 83.2% of all genes were associated with a cis-eSNP. Conversely, 779 042 (29.7%) of all SNPs were associated with a gene in cis, whereas 38 034 (1.4%) were associated with a gene in trans. After pruning, these observations corresponded to 81 148 (28.4%) cis- and 3800 (1.3%) trans-acting SNPs, respectively. Note that the smallest identified effect sizes with study-wide significance are different for cis- and trans-eQTLs (0.4 and 1.3% explained variance of gene-expression levels, respectively) (Kirsten, 2015).
Conservation provides surprisingly little information for predicting eQTL location (Gaffney, 2012).
Only 27% of the distal regulatory elements have an interaction with the nearest promoter (Edwards, 2013) suggesting that the nearest gene is not often the target gene of a GWAS hit. Target genes of a GWAS hit may even be on a different chromosome.
Nearest genes of GWAS SNPs might frequently be misleading functional candidates as the target genes (Kirsten, 2015).
Genetic variants that modify chromatin accessibility and transcription factor binding are a major mechanism through which genetic variation leads to gene expression differences among humans (Degner, 2012; PDF): As many as 55% of eQTL SNPs are also ‘DNase I sensitivity quantitative trait loci’ (dsQTLs).
There is a complex landscape of long-range gene-element connectivity across ranges of hundreds of kb to several Mb, including interactions among unrelated genes. The 5C results showed that 50 to 60% of long range interactions occur with a high degree of tissue specificity (ENCODE, Nature 2012). Topologically associated domains (TADs) represent matrices of contacts among regulatory regions, and interactions within TADs are different between cell types and are influenced by cellular differentiation as well as environmental conditions.
Non-coding region mutations are also relevant in somatic mutations in cancer cells. Many recurrent somatic cancer variants occurr in noncoding regulatory regions and thus might indicate mutations that drive cancer (Khurana, 2013.
The majority of disease-associated missense variant exhibit wild-type chaperone binding profiles, suggesting they preserve protein folding or stability. While common variants from healthy individuals rarely affect interactions, two-thirds of disease-associated alleles perturb protein-protein interactions, with half corresponding to ‘‘edgetic’’ alleles affecting only a subset of interactions while leaving most other interactions unperturbed (Sahni et al, 2015). With transcription factors, many alleles that leave protein-protein interactions intact affect DNA binding. Different mutations in the same gene may lead to different interaction profiles resulting in distinct disease phenotypes. Thus disease-associated alleles that perturb distinct protein activities rather than grossly affecting folding and stability are relatively widespread (Sahni et al, 2015).
Around 15% of human protein-encoding genes have an associated natural antisense transcripts (NAT) (Lapidot & Pipel, 2006). Consequences include transcriptional interference, RNA masking and effects on methylation.
Transcription factors (TF) and transcription factor binding sites (TFBS):
There is experimental evidence that transcription factor binding activity differs between humans; this is due to genetic variation within TF binding sites; and results in gene expression level changes (Kasowski, 2010). Disruptive variants because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) show evidence for selection against (Khurana, 2013).
Chromatin variability shows genetic inheritance in families; correlates with genetic variation and population divergence; and is associated with disruptions of transcription factor binding motifs (Kasowski, 2013.
Approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions (Gaffney, 2012). Active binding sites for immune-related TFs are among the most highly eQTL-enriched regions in the genome (Gaffney, 2012). Aside from altering the binding of TFs, eQTLs may also perturb gene regulation in more subtle ways - for example, by altering the intrinsic nucleosome preferences of the DNA (Gaffney, 2012). eQTNs may also act epigenetically by altering the pattern of DNA methylation, with resulting effects on gene expression (Gaffney, 2012), by altering miRNA expression levels (miR-eQTLs) (Huan, 2015), by correlating with histone acetylation levels (haQTLs) (del Rosario, 2015).
Less than 10% of high probability causal SNPs alter a transcription factor binding site motif.
Epigenetic changes in chromatin architectures or DNA sequences relate to TF binding: H3K4me3, H3k9ac and H3k27ac contribute more in the regions near TSSs, whereas H3K4me1 and H3k79me2 dominate in the regions far from TSSs. DNA methylation plays relatively important roles when close to TSSs than in other regions. In addition, the results show that epigenetic modification models for the predictions of TF binding affinities are cell line-specific (Liu, 2015).
CpG methylation recruits sequence specific transcription factors essential for tissue specific gene expression (Chatterjee & Vinson, 2012). Direct and selective methylation of certain TFBS that prevents TF binding is restricted to special cases and cannot be considered as a general regulatory mechanism of transcription (Medvedeva, 2014).
Binding differences were frequently associated with single-nucleotide polymorphisms and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of binding variation (Kasowski, 2010).
With transcription factors, many disease-causing missense alleles (that leave protein-protein interactions intact) affect DNA binding (Sahni et al, 2015).
Evolutionarily conserved regions can often provide valuable information on the location of regulatory elements (Gaffney, 2012) although Lee et al reported somewhat different results (Lee, 2014). Conservation is strongly negatively correlated with distance from the transcription start site (TSS).
A single miRNA can potentially regulate hundreds of different genes on a small scale of 2-fold or less.
miRNA effect may have has cell type- and biological context-dependency (example: miR-155-dependent regulation of SOCS1 in different immune cell subsets) (Lu LF, 2015).
A gene may be involved in disease pathogenesis even if none of its SNPs show any association ("an intricate transcriptional connectivity between Crohn disease (CD) susceptibility genes and interferon-γ, a key effector in CD, despite the absence of CD-associated SNPs in the IFNG locus has been described by Kumar, 2015).
The phenotypic consequences of expression quantitative trait loci (eQTLs) are presumably due to their effects on protein expression levels. Yet the impact of genetic variation, including eQTLs, on protein levels remains poorly understood. To address this, we mapped genetic variants that are associated with eQTLs, ribosome occupancy (rQTLs), or protein abundance (pQTLs). We found that most QTLs are associated with transcript expression levels, with consequent effects on ribosome and protein levels. However, eQTLs tend to have significantly reduced effect sizes on protein levels, which suggests that their potential impact on downstream phenotypes is often attenuated or buffered. Additionally, we identified a class of cis QTLs that affect protein abundance with little or no effect on messenger RNA or ribosome levels, which suggests that they may arise from differences in posttranslational regulation (Battle, 2015).
Tissue specificity of enhancers and nearby gene expression increase with enhancer length. Neighborhoods containing stretch enhancers are enriched for important cell type-specific genes, and GWAS variants associated with traits relevant to a particular cell type are more enriched in stretch enhancers compared with short enhancers. (Parker, 2013).
Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels (Banovich, 2014).
The majority of transgenerational similarity in DNA methylation is attributable to genetic effects of cis meQTLs. Approximately 20% of individual differences in DNA methylation in the population are caused by DNA sequence variation that is not located within CpG sites (McRae, 2014).
Of the ~20,500 protein-coding genes in humans, 7.5% are directly involved in RNA metabolism by binding to and/or processing RNA, or by constituting essential components of RNPs (Gerstberger, Hafner & Tuschl, 2014).
CTCF is a ubiquitously expressed and an essential protein, and is, in many ways, an exceptional transcription factor. It was first described as a transcriptional repressor, but was also found to act as a transcriptional activator. Most strikingly, it harbours insulator activity: when positioned in between an enhancer and gene promoter, it can block their communication and prevent transcriptional activation (Holwerda & Laat, 2015). Systematic chromatin immunoprecipitation experiments combined with high-throughput sequencing (ChIP-seq) have been performed to map CTCF binding events across the genome in many tissues of different species. They show that the genome is covered with a myriad of CTCF binding sites. More than most other transcription factors CTCF appears to bind to intergenic sequences, often at a distance from the transcriptional start site (TSS). CTCF was one of the first proteins demonstrated to mediate chromatin looping between its binding sites. Further evidence for its role in the organization of genome structure comes from observations that it frequently binds to boundaries between chromosomal regions that occupy distinct locations in the nucleus, to boundaries between regions with different epigenetic signatures and/or different transcriptional activities, and to boundaries between recently identified topological domains, which are spatially defined chromosomal units within which sequences preferentially interact with each other.
CTCF is a highly conserved zinc finger protein implicated in diverse regulatory functions, including transcriptional activation/repression, insulation, imprinting, and X chromosome inactivation. Here we re-evaluate data supporting these roles in the context of mechanistic insights provided by recent genome-wide studies and highlight evidence for CTCF-mediated intra- and inter-chromosomal contacts at several developmentally regulated genomic loci. These analyses support a primary role for CTCF in the global organization of chromatin architecture and suggest that CTCF may be a heritable component of an epigenetic system regulating the interplay between DNA methylation, higher-order chromatin structure, and lineage-specific gene expression (Phillips & Corces, 2009).
26 March 2016