Genetics Evolution HLA MHC Epidemiology Genetic Epidemiology Population Genetics Glossary Homepage
GENE EXPRESSION
M.Tevfik Dorak, MD, PhD
PowerPoint Presentation on Gene Expression
The Central
Dogma of molecular biology states that RNA is made using DNA
templates and that protein is made using the information in RNA (genome encodes
transcriptome which is translated to proteome. See a List of OMEs).
Although this sounds like stating the obvious, it was an important principle
when discovered. An exception is the reverse transcription, which allows DNA to
be made from RNA (another exception may be the prions). The
flow from DNA to protein involves transcription and translation with possible
modifications after each step. Expressed genes include those that are transcribed
and translated all the way to proteins, and those that are transcribed into RNA
but not translated into protein. Transfer
and ribosomal RNAs, XIST
and H19
genes are transcribed but not translated to a protein product (Joubel,
1996; Milligan,
2002). Thus, gene expression should not be described as conversion of
genomic information to protein sequences. Another classic concept in
genetics is the principle of one gene-one peptide. This also had to be revised
because it is possible that one gene can produce a variety of products.
Summary of gene transcription and
translation
Genes are transcribed from 5' to
the 3' of the sense/coding strand
by RNA polymerase II. It is actually the antisense
or template strand, which is
transcribed (3' to 5') and gives a strand identical to the sense strand. The
initial product of transcription is RNA, more precisely heterogeneous nuclear
RNA (hnRNA). The four bases in the RNA polynucleotide chain are A, G, C,
and U (uracil instead of thymine, which is demethylated thymine). Base pairings
in RNA are usually A-T and G-C, but weaker G-U pairings can be identified
occasionally. The RNA copy of the DNA that is to become a polypeptide is called
messenger RNA (mRNA). mRNA sequence is identical to
the coding (sense) strand of DNA. The starting point of transcription occurs at
the promoter region of the gene. The promoter is recognized by transcription
factors (DNA-binding proteins) that control the rate and degree of
transcription. The information for promoter function is provided directly by
DNA sequence, and its structure represents the actual signal. Most promoters
have a TATA box located approximately 25-35 base pairs upstream of the
transcription initiation site. The TATA box tends to be surrounded by
GC–rich sequences. The fixed position of the TATA box is critical to the
positioning of the RNA polymerase. Another conserved sequence approximately 70
bases upstream of the transcription initiation site is the CAAT box. Other
upstream elements are known to modify the rate of transcription. From the
promoter region, the RNA polymerase enzyme travels along the template,
synthesising RNA, until shortly after a termination sequence (stop codon). For
a short stretch, as the RNA polymerase moves along the unwinding DNA molecule,
the DNA is in a single-stranded conformation. As soon as the
enzyme has passed by, the DNA duplex reforms. In this fashion, the
integrity of the DNA molecule is maintained during transcription. The nucleic
acid sequence of the gene includes not only the coding sequence (CDS or exons),
which corresponds to the amino acid sequence of the protein, but also
additional intervening (non-coding) sequences (introns). The exons are the
regions of the gene represented in the processed mRNA and introns are regions
that are spliced out of the processed mRNA. This processing of the gene and RNA
occurs in the nucleus. In addition to splicing out of the introns, a series of
up to 200 adenine residues is added to the terminal end of the mRNA molecule
(polyadenylation). The processed mRNA is then transported to the cytoplasm and
translated into protein by the ribosomes. Each ribosome consists of two
subunits, which contain several proteins in association with a long RNA
molecule known as ribosomal RNA. The generation of the polypeptide is mediated
by yet another small RNA species, transfer RNA (tRNA). A tRNA is able to
recognize only the one amino acid to which it is covalently linked. The tRNA
contains a trinucleotide sequence, the anticodon, which is complementary to the
codon that represents the amino acid. The anticodon allows the tRNA to recognise
the codon through complementary base pairing. The ribosome moves along the
mRNA, permitting sequential amino acids to be assembled into protein. The order
of the amino acids in a protein confers its ultimate three-dimensional
conformation and its biological and biochemical activities. Alteration of the
amino acid sequence due to missense nucleotide substitution changes these
characteristics.
Regulation of gene expression involves the following steps:
Transcriptional control: how and when a gene is
transcribed (including epigenetic control)
RNA processing control: how the RNA transcript is
processed (including small nuclear ribonucleoproteins, snRNPs)
RNA transport control: which mRNA transcripts are
exported to the cytoplasm, and where they reach in the cytoplasm
mRNA stability
control: selective degradation of some mRNAs
Translational control: selecting which mRNAs in the
cytoplasm are translated by ribosomes (including regulation by microRNAs)
Post-translational control: selectively activating, inactivating,
degrading or compartmentalizing particular specific proteins (including
regulation by ‘small interfering’ siRNAs).
It is
possible that a gene is encoded on the sense strand and another one on the
anti-sense strand in opposite direction (overlapping genes; for an
example, see CYP21A2
and TNXB;
Tee,
1995). Overlapping genes are particularly frequent in viruses and
plasmid / phages who need to pack a lot of information in a small, compact
genome (HIV is an example). Degeneracy of the genetic code facilitates the
presence of overlapping genes. See also Michael W. Pfaffl’s Page
for an excellent review of gene structure & expression as well as molecular
genetic methods to quantitate gene expression, including microarray methods
supplemented by videos.
Components of genes playing roles in their expression
In
molecular terms, a gene is the entire DNA sequence required for synthesis of a
functional protein. In addition to the coding and intervening sequences, a gene
also includes transcription-control region. Untranslated regions, exons and
introns make up of a complete gene.
Enhancers: An
enhancer is a cis-acting regulatory element on either 5’ or 3’
flanking region of a gene that stimulates a specific promoter. It is not
transcribed. Enhancers are usually at some distance even from the distal
promoter region. In some cases, distant enhancers are important for
developmental expression (eg, globins), cell-specific expression (eg,
immunoglobulins), or hormonal regulation (eg, glucocorticoid response
elements). Enhancers are binding sites for transcriptional activators
(activating transcription factors)
such as AP-1, AP-2, Oct-1, GATA-1, P53, NF-kB. Certain genes have silencers that do the
opposite of enhancers.
Promoters: The
region in the close vicinity of the transcription initiation site at the 5' end
of a gene. The promoter region is not transcribed. Promoters are the initial
binding sites for RNA polymerases. Transcription factors bind to the
promoters and allow RNA polymerase to act. The proximal region of the
promoter is generally packed with short (5-15 nucleotides) motifs called transcription factor binding sites
(promoter-proximal elements). The proximal promoter region contains sites for
ubiquitous proteins such as Sp-1, CAAT box/enhancer-binding protein (C/EBP),
cAMP response element-binding protein (cAMP), or Activator Protein-1 (AP-1).
Factors involved in cell-specific expression may also bind to these sites.
The
minimal promoter region that is capable of initiating basal transcription is
called the core promoter. The common promoters, TATA and CAAT
boxes, are found about 30 bp and 70 bp, respectively, upstream of the transcription
initiation site. The TATA box is the most conserved functional
signal in eukaryotic promoters. It is present in 30-50% of promoters. Many
highly expressed genes contain a strong TATA box in their promoters (TATA+). TATA-less
(TATA-) promoters usually belong to housekeeping genes, some oncogenes and
growth factors. TATA- genes usually have a downstream promoter element (DSE)
located approximately 30bp downstream of transcription initiation site. This
group of promoters are much harder to identify compared to TATA+ ones by
bioinformatic tools. The region 200-300bp immediately upstream of the core
promoter is the proximal promoter full of multiple transcription factor binding
sites. Further upstream is the distal promoter region that usually contains enhancers and
some transcription factor binding sites. It is the distal promoter region that
is most difficult to identify through computational methods because of lack of
standard features and high variability (reviews and reports on promoter
prediction: Pedersen,
1999, Werner,
1999, Davuluri,
2001, Ohler
& Niemann, 2001, Suzuki, 2001 'full text',
Hannenhalli
& Levy, 2001 'full text', Solovyev
& Shahmuradov, 2003; See also Bioinformatics approaches and resources
for SNP functional analysis by Mooney, 2005.
Nucleotide
mutations outside the protein coding regions but in regions that influence gene
expression are called regulatory mutations (Knight,
2005). These are usually promoter region mutations and cause variations in
gene expression levels (regulatory SNP -rSNP- finding tool: rSNP Guide).
Mutations in the flanking sequences of insulin and collagen type I genes, for
example, cause type I diabetes mellitus and osteoporosis, respectively. Some
promoters contain neither TATA boxes nor other alternative promoter elements
and transcription is often initiated at multiple sites over a defined
region. These promoters are often
characterized by having a CG-rich stretch of nucleotides (GC box) within 100bp
of the initiation site (see Ioshikhes
& Zhang, 2000,
Antequera,
2003 and Htf islands in glossary).
Approximately 50-70% of genes contain GC boxes in their promoters and this
feature can be used to predict promoter sequences via bioinformatics methods (Ponger
& Mouchiroud, 2002 'full text'; Wang
& Leung, 2004). All known regulatory units (promoters, enhancers,
silencers) have been compiled in a single database called Transcription
Regulatory Regions Database (TRRD;
Kolchanov,
2002).
Promoters
function to orient RNA polymerase so that the correct strand of DNA is used as
the template for transcription. In principle, any region of the DNA double
helix could be copied into two different RNA molecules. The promoter of a gene
determines which of the two strands is copied. A promoter is an oriented DNA
sequence that points the RNA polymerase in one direction, and this orientation
determines which DNA strand is copied. The anatomy of
a promoter is usually defined by a combination of gene transfer experiments to
assess the effects of promoter mutants and studies of protein–DNA
interactions such as DNAse I footprinting and the electrophoretic gel mobility
shift assay (EMSA) (see below).
Eukaryotic
RNA Polymerases (RNAP): As opposed to a single bacterial RNAP,
eukaryotes have three different RNAPs (I, II and III) with each one having
their own promoters. They add to the 3' end of the growing RNA molecule one
nucleotide at a time using ribonucleotide triphosphates (rNTPs) as substrates
(this reaction releases pyrophosphates). RNAP I is dedicated to the synthesis
of only one type of RNA molecule (pre-rRNA; rRNA is encoded by tandem repeats
of genes in the nucleolus). RNAP III produces small RNAs such as all tRNAs, 5S rRNA and a number of small RNAs. RNAP II is the
enzyme principally involved in the transcription of genes from DNA sequence
into RNA. Unlike DNA polymerase, however, RNA polymerase does not require a
preformed primer to initiate the synthesis of RNA. Instead, RNAP is recruited
and oriented by promoters. Unlike RNAP I and III, RNAP II recognises many
thousands of promoters. Most promoters have the TATA and CAAT boxes. All
promoters for genes for housekeeping proteins contain multiple copies of a
GC-rich element that includes the sequence 5'-GGGCGG-3' (GC box). Transcription
by RNAP II is also affected by more distant enhancers.
The
end-product of transcription is heterogeneous nuclear RNA (hnRNA), which
represents the primary transcripts of RNA polymerase II. hnRNA
has a very wide range of sizes (2-40 kb). The RNA copy of the DNA that is to
become protein is the processed product of hnRNA termed messenger RNA (mRNA).
Not all transcribed RNA is destined to arrive in the cytoplasm as mRNA.
Modifications of primary RNA transcripts include splicing, cleavage, base
modification, capping and the addition of poly-A tails. During RNA processing introns are spliced out
but sometimes splicing out also involves exons. This is called alternative
splicing and results in different protein products (isoforms) from the same
DNA sequence (more than one possible exon assembly). Alternative
splicing (along with overlapping genes and others) is the reason for the lower
than estimated number of genes identified in human genome.
Around 98% of all transcriptional output in humans is
non-coding RNA. RNA-mediated gene regulation is widespread in higher
eukaryotes and complex genetic phenomena like RNA interference,
co-suppression, transgene silencing, imprinting and methylation
involve some form of RNA signalling. It is possible that intronic and other
non-coding RNAs have evolved to comprise a second tier of gene
expression in eukaryotes (Mattick,
2001). Trans-acting RNAs may relay information required for the
coordination and modulation of gene expression via chromatin
remodelling, RNA–DNA, RNA–RNA and RNA–protein
interactions.
A first
step in the activation of transcription is the decondensation of chromatin
surrounding a gene so that transcription factors and RNA polymerase can
gain access to DNA. Conversely, genes
are often turned off by the condensation of chromatin surrounding a particular
gene, and the inhibition of RNA polymerase and transcription factor binding. A
large number of DNA binding proteins, collectively called transcription
factors, regulate cell and tissue specificity of gene expression through
their influence on RNA polymerase II-mediated transcription. The sequences to
which transcription factors bind are called response elements.
There are different classes of transcription factors. The general transcription
factors include proteins that are involved in the assembly of the basal
transcription apparatus. The sequence specific transcription factors include
proteins that bind to DNA regulatory elements in the enhancers or promoters of
genes. There are also coactivators and corepressors that bind to other
transcription factors and regulate transcription by altering chromatin
structure or by making contacts with the basal transcriptional machinery. A
transcription factor database on the Internet (TransFac;
Wingender,
2000) contains a comprehensive list of transcription factors and allows a
promoter region to be searched for recognition sites for sequence-specific
transcription factors. Another related internet resource is TFSearch which helps to find transcription factor binding sites in a sequence.
Most transcription factors share motif structures such
as zinc finger, helix-turn-helix (HTH), helix-loop-helix (HLH), leucine zipper,
homeodomain and POU domain (see also Structural
Motifs in Eukaryotic Transcription Factors).
An
ever-increasing number of diseases are attributed to mutations in transcription
factor genes. In general, germline mutations in transcription factor genes
result in malformation syndromes and somatic mutations involving many of the
same genes contribute to carcinogenesis. Some of the better known transcription
factors are: c-JUN, c-MYC, CBP, CREB, E2F1, STAT, SRY, PIT-1, ETS-1, RUNX-2/AML-3,
GATA-1-6,
PBX2,
RFXAP
and nuclear factor
kappa-B (NFKB). Coactivators and co repressors also take part in
regulation of the activity of transcription factors. Diseases caused by
mutations in transcription factors include Rubinstein-Taybi
syndrome, vitamin
D-resistant rickets type IIA, bare
lymphocyte syndrome type II, acute
lymphoblastic leukemia, acute myeloid
leukemia and Down’s
syndrome-associated acute megakaryoblastic leukemia. Transcription
factors linked to the cell cycle control, RB1
and P53,
play major roles in neoplastic development. They are involved in retinoblastoma,
osteogenic
sarcoma and Li-Fraumeni
syndrome and other familial cancer syndromes (Levine,
1991). Changes in transcription factor activity is
also associated with neoplastic diseases. In most cancers, translocations
combine different transcription factor domains or bring a transcriptional
activation region under the control of a heterologous gene. One example is the t(1;19)(q23; p13) translocation in pre–B-cell acute
lymphoblastic leukaemia, which fuses the PBX homeodomain gene to the
transactivation domain of the E2A transcription factor gene.
Transcription factors also
regulate gene expression by their effect on histone acetylation.
Histone acetylation may create a more open chromatin configuration that allows
other transcription factors to gain access to the regulatory regions of a gene
and increases gene transcription. The transcriptional coactivator, CREB-binding
protein (CBP), possesses intrinsic histone acetyltransferase (HAT) activity.
Other transcription factors (often repressors) recruit histone deacetylases.
Transcription initiation (cap)
site: This is where the transcription of DNA to immature
(precursor) pre-mRNA starts. It is the 5' end of the coding sequence; the
beginning of the first exon. This sequence adds a 7-methylated GTP (7mG) cap to
the beginning of the mRNA (to protect it against the activity of
5'-exonuclease). From here to the translation initiation site, the
sequence codes for the 5'-untranslated region (UTR) (also
called ribosome-binding region) and the signal peptide. The
5'-UTR is not featured in the final protein product. It contains the site (the leader
sequence) at which ribosomes initially bind to mRNA to start translation.
The signal sequence is encoded in the first or second exon and
translated at the N-terminal. It provides the signal for correct cellular
location (endoplasmic reticulum, Golgi apparatus, cell membrane, etc) or
outside the cell through the cell membrane, and is finally cleaved off by a
metalloproteinase). Mutations interfering with cleavage of signal peptide
result in human diseases such as factor X
deficiency (due to a substitution of arginine by glycine at position -20
{numbering the alanine at the NH2-terminus of the mature protein as +1}, and autosomal
recessive hypoparathyroidism due to a mutation substituting serine with
proline at position -3 in the signal peptide of the prepro-parathyroid hormone
gene.
The 5’ UTRs of most mRNAs
contain a consensus sequence of 5’-CCAGCCAUG-3’ involved in the
initiation of protein synthesis. The events that occur to mRNA before it leaves
the nucleus are collectively called RNA
processing or post-transcriptional
modification (capping, cleavage, base modification,
polyadenylation and splicing). See a review on computational detection and
location of transcription start sites by Down & Hubbard, 2002
'full text').
Translation
initiation site (ATG / AUG): This sequence represents the beginning (N-terminal)
of translated proteins. (5' of DNA codes for N-terminal of a polypeptide.) It
codes for a methionine but methionine is frequently subject to
post-translational elimination. Thus, each mature mRNA's first codon is for
methionine (AUG) but not all polypeptides start with methionine. Translation
takes place in the ribosome in the cytoplasm. According to the official Human
Gene Nomenclature rules the 'A' nucleotide of the ATG is nucleotide number +1,
and all other sequence variation should be numbered using this nucleotide as
reference. The nucleotide 5' to +1 is numbered -1; there is no base 0 (see Nomenclature
Page). The sequences prior to translation initiation site comprise
the 5' untranslated region (5' UTR) and the most 5' base within the 5' UTR
constitutes the transcriptional start site (designated as +1). The 5' UTR
varies greatly in length in different genes, and it may contain several exons.
For this reason, it can be challenging to identify the location of the
promoter, even when the coding region of the gene has been found.
Exon-intron boundaries: Each intron starts with
GpT (5' splice site / donor) and ends with ApG (3' splice site / acceptor). The
introns are subject to splicing out during post-transcriptional modification.
The highly conserved intronic 5'GT and 3'AG sequences are essential for correct
splicing. These are called splicing sites and mutations of these nucleotides
cause aberrant splicing. U1 and U2 small nuclear RNAs bind to
the conserved splicing sites in introns to mediate splicing out. A
number of diseases are caused by splicing site mutations one of which being congenital
adrenal hyperplasia (CAH) caused by an intron 2 splicing site mutation (Higashi,
1988) (many other mutations also cause CAH). Most of beta
thalassaemia disorders are caused by splicing site mutations. An autosomal
dominant form of isolated
growth hormone deficiency is caused by mutations in intron 3 of the GH1
gene that cause exon 3 skipping, resulting in truncated products of the GH1
gene that prevent secretion of normal GH. Although the introns are not
represented in the resultant polypeptide, they may contain some regulatory
sequences. An example of this is the intron 35 of the MHC class III gene C4
(complement component 4), which contains promoter activity for the gene lying
next to it CYP21A2 (21-hydroxylase).
Stop codon: One of the
three strop codons (UAA, UGA and UAG) provides the termination signal for
translation. (The presence of three different stop codons is an example
of degeneracy of genetic code.) The triplet before the stop codon codes for the
last amino acid of a polypeptide chain (C-terminal). (3' of DNA codes for
C-terminal of a polypeptide.)
Polyadenylation signal: The
polyA signal (usually 5’-AAUAAA-3’) is after (downstream to) the
stop codon (within the 3’ UTR) and signals for the addition of a poly-A
tail that varies in length as a post-transcriptional modification on mRNA. It
is located 10-30 nucleotides upstream of the 3’ cleavage site. Poly(A) tail is believed to stimulate translation initiation
whereas its shortening triggers entry of mRNA into the decay pathway. Besides
the functional importance in maturation of mRNA as well as its transport,
degradation and translation, poly-A tail is also important in preferential
extraction of mRNA from total RNA. This approach, however, requires attention
as not all but 90% of mRNAs have poly-A tails (histone mRNAs, for example).
Untranslated Regions (UTRs): Transcription
often terminates at 0.5 - 2 kb downstream of the poly-A signal determined by
transcription termination signals. The DNA sequence between the termination signal (for translation) to the end of transcription
termination is called 3' UTR. 5' UTR usually contains gene- or developmental
stage-specific and common regulators of expression (motifs, boxes, response or
binding elements), and 3' UTR is also involved in gene expression although it
does not contain well-known transcription control sites. 3' UTR sequences
(called cytoplasmic polyadenylation elements or adenylation control elements)
can control the nuclear export, polyadenylation status, subcellular targeting,
rates of translation and degradation of mRNA (see Decker
& Parker , 1995 & Pesole,
2001 for
reviews). Many mRNAs with short half-lives contain a
50-nucleotide AU-rich sequence (reiterations of AUUUA) in 3’ UTR (AU-rich
elements; 3'AURE). Removal or alteration of this sequence prolongs the
half-life of mRNA. The presence of this sequence may be a feature of genes that
can rapidly alter their expression level. 3'AURE is found
in the genes encoding cytokines, adhesion molecules, and protooncogenes and may
be a marker of mRNAs that are inducible by environmental stressors
(Asson-Batres,
1994). The involvement of 3' UTR is well
documented in controlling male and female gametogenesis and in early embryonic
development. Myotonic
dystrophy is a disease caused by the expansion of the triplet
repeats in the 3' UTR of a protein kinase gene, DMPK. Genetic variation in the 3'UTR of NF1 affects
expression levels (Cowley,
1998).
A difference in phenotype that is
dependent on the position of a gene or a group of genes is called position
effect. This is often due to the presence of heterochromatin nearby. The
change in a gene's location may cause a change in its expression (a problem
that has to be overcome in gene therapy).
In summary, the stages of
protein synthesis are: transcription, RNA processing, translation and post-translational
modifications (such as glycosylation, deamidation, acetylation, hydroxylation,
sulfation, lipidation, methylation or phosphorylation; including removal of the
N-terminal methionine in most proteins). Such posttranslational changes in the
molecule may play a role in disease pathogenesis despite no change in the
genetic code. An example is the replacement of beta-82 lysine by asparagine (N)
or aspartic acid (D) in haemoglobin (Charache,
1977). Several other posttranslational deamidation in haemoglobin
molecule have been reported (see OMIM 141900).
In Celiac disease,
the deamidation of gamma-gliadin creates an epitope that acts as the
self-antigen in the initiation of this autoimmune disease (Molberg,
1998). Deacetylation of lysine or methylation of lysine and arginine
in histone molecules results in gene silencing similar to methylation of CpG
sequences in the DNA (histone molecules can also undergo ubiquitination or
phosphorylation as post-translational modifications).
An example of gene expression regulation
Cellular iron uptake and storage
are regulated through a feedback control mechanism mediated at the
post-transcriptional level by cytoplasmic factors know as iron-regulatory
proteins 1 and 2. These proteins sense levels of iron in the transit pool. When
iron in this pool is scarce, they bind to stem-loop structures known as
iron-responsive elements on the 5' untranslated region of the ferritin mRNA and
3' untranslated region of the transferrin mRNA. Such a binding inhibits
translation of ferritin mRNA and stabilizes the mRNA for transferrin receptors.
The opposite scenario develops when iron in the transit pool is plentiful (Ponka,
1998). An interesting example of the regulation of gene expression is
random allelic inactivation. This occurs in autosomal genes (Rhoades,
2000) similar to X-inactivation or parental imprinting. Genes that show
random allelic inactivation include olfactory receptor genes, and the various
genes encoding antigen receptors on lymphocytes (immunoglobulin genes, T cell
receptor genes and NK receptor genes (see NK Cell
Receptors)).
Epigenetic effects on gene expression
Genetic phenomena without any DNA nucleotide change that
affect gene expression include paramutation (paramutation is an allelic
interaction that results in meiotically heritable changes in gene expression in
plants), methylation, genomic imprinting, allelic
exclusion and histone modification / histone code hypothesis
(see glossary). Such epigenetic changes and especially their
heritability are one of the very hot debates of recent years (see Hidden Inheritance by G Vines in New
Scientist, 28 Nov 1998, pp.27-30; Epigenetics:
Special Issue of Science, 2001; Rando, 2007). The
National Fragile X Foundation website explains the molecular basis of a
well-known epigenetic disease, fragile X syndrome.
This disease is due to CGG triplet repeat expansion, which triggers methylation
of CpGs in the repeat track and subsequently silencing of the FMR1
gene promoter (Coffee,
2002). For a long time, epigenetic regulation of gene activity has
been synonymous with DNA methylation. It is now better appreciated that
epigenetic changes do not depend only on DNA methylation but also on a number
of covalent histone modifications. There is even a link between DNA methylation
and histone modification resulting in chromatin remodelling with subsequent
alteration of gene activity. All of these interactions are collectively called
epigenetic crosstalk (for a review, see Weissmann
& Lyko, 2003). DNA methylation indeed has a direct effect on
gene expression by modulating the interaction between transcription factors and
DNA but it also exerts indirect effects on the regulation of gene expression
through chromatin modifications. Transcriptionally inactive heterochromatin
is packed densely whereas active euchromatin is less condensed.
Acetylation, phosphorylation and methylation of histone proteins affect gene
expression via induction of changes in the state of chromatin. For example,
histone deacetylation and methylation of histone H3 at lysine 9 (H3-K9) mark
for (inactive) heterochromatin, which hinder transcription factor access to
their binding sites in regulatory regions of genes (see histone in glossary;
histone acetylation and
chromatin remodelling).
Common techniques to measure gene
expression and regulation of gene expression are
The following features of
eukaryotic genomes can complicate bioinformatic studies of ‘gene finding’:
1. The genes are not located
end-to-end but instead separated by long stretches of intergenic
‘junk’ DNA (low gene density),
2. Alternative splicing creates
unexpected complexity in the products of the same gene when using mRNA
fragments to predict genes,
3. Genes nested within each other
or overlapping genes (complex gene structure),
4. Pseudogenes.
See Computational Gene
Recognition Programs for bioinformatics resources.
Analysis of Gene Expression
(See mRNA
Transcript Analysis in Cancer
Medicine e5 Online and Comparison
of Different Methods of mRNA Quantitation-Ambion.)
RNA can be isolated from cells in its intact form, free from
DNA (DNAse treatment may be necessary to eliminate any contaminating DNA). As in all
RNA-related work, special precautions must be used in extracting and
manipulating the RNA to avoid RNases from destroying the
molecule (RNA Isolation:
The Basics (Ambion). See also Northern
Analysis: The Basics (Ambion) and Northern
blot movie.
There is a limit to the
sensitivity of
RNase Protection Assay (RPA): Another technique used in the
analysis of mRNA is the nuclease S1 (RNAse) protection assay (RPA). This assay
is more sensitive than
Since the nucleases that digest the annealed probe/mRNA
hybrid are specific for single-stranded nucleotides, any mismatches between
probe and target are susceptible to digestion. A mismatch can be detected if
the nuclease-digested radiolabelled probe is smaller than expected, or when the
probe has been digested into multiple fragments. In fact, by careful
measurement of the length of the digested probe, exactly where the mismatch has
occurred in the target mRNA can be detected. Because conditions for
annealing RNA to DNA are highly selective, S1 nuclease protection analysis is
very sensitive. See also Nuclease Protection
Assays: The Basics (Ambion) and RPA
movie.
DNase I Footprinting Assay: DNase I
footprinting assays are based on the principle that any protein that is bound
to a DNA fragment will protect the DNA from digestion by DNase I. It is used to
determine the sites at which proteins bind to DNA. In experiments of this type,
a DNA fragment is radiolabelled at one end. The labelled DNA is incubated with
the DNA binding protein of interest and then subjected to partial digestion
with DNase I. The DNA regions interacting with the protein can be identified by
comparison of the digestion products of the protein-bound DNA with those
resulting from identical DNase treatment of a parallel sample of DNA that was
not incubated with protein (or incubated with another protein). The region of
DNA that is protected by the DNA binding protein will appear as a gap
(footprint) in the sequence of DNase I digestion pattern.
DNA footprinting can also be
employed in vivo using permeabilised cells, or intact nuclei, are exposed to
DNase I before isolation of DNA. In vitro and in vivo DNA footprinting assays
often give different results. This suggests that the interactions or proteins
involved may be different in the intact cell in comparison to naked DNA that is
mixed with nuclear protein extracts.
Footprinting assays are able to
analyse only 150-200 bp DNA segments in each assay. The number and nature of
proteins bound remain unknown unless specific proteins are used in isolation in
the assay. The sensitivity is low; it can only detect interactions with
abundant proteins. Because Dnase can still cleave several basepairs away from
the protein binding site, the resolution of the binding site is relatively low.
The
electrophoretic mobility shift assay (EMSA): EMSA is more
useful than the footprinting assay for quantitative analysis of DNA-protein
binding reactions (also called gel-shift or band-shift assay). It measures the
ability of purified proteins to bind to radiolabelled DNA-fragments (usually
10-50 bp). In this assay, the electrophoretic mobility of a radiolabelled oligonucleotide
is determined in the presence and absence of a sequence-specific DNA-binding
protein. The samples are electrophoresed under non-denaturing conditions.
Protein binding generally reduces the mobility of a DNA fragment, causing a
shift in the location of the fragment band detected, following non-denaturing
polyacrylamide electrophoresis, by autoradiography. Free DNA probe that is not
bound to protein will migrate the fastest. Any probe that is bound to protein will be
retarded, and will move more slowly through the gel. This technique is very
useful for determining the exact nucleotides necessary for protein binding by
using mutant oligonucleotides. See EMSA
at Molecular
Biology of the Cell.
Western Blot Analysis: This is
an electrophoretic blotting technique used to detect and analyse proteins.
Protein is subjected to electrophoresis on polyacrylamide gels that often
contain a detergent to separate the molecules by their molecular weight. The
separated protein is then transferred to a filter using high-voltage
electrophoresis in a similar way to other blotting techniques. Enzyme-linked or
isotopically labelled antibodies that are species-specific are applied to the
filter. The antibodies bind to the proteins they are specific for and the
protein antibody bands can then be visualised by either autoradiographic or
calorimetric methods.
Serial Analysis of Gene Expression (SAGE): Serial
analysis of gene expression (SAGE) is a method for comprehensive analysis of
gene expression patterns. In SAGE, the investigator sequences a small and unique fragment of each
expressed gene (called a SAGE tag) and quantifies the number of times it
appears (called the SAGE tag number). The SAGE tag numbers, therefore, directly
reflect the abundance of the corresponding transcript. Unlike DNA chip
analysis, SAGE is able to detect and quantify the expression of previously
uncharacterised genes. The results are equivalent to those obtained from
constructing a cDNA library from the tissue of interest and sequencing every
clone.
SAGE methodology can be summarised by the following three principles: (1) a short
sequence tag (10-17bp) contains sufficient information to uniquely identify a
transcript, (2) sequence tags can be linked together to from long serial
molecules that can be cloned and sequenced. Multiple 10-base-pair SAGE tags can be packaged in a single
plasmid. This would reduce the number of plasmid preparations and DNA
sequencing reactions that are required to analyse a large number of genes. A
single sequencing reaction can provide information on 30 to 35 different SAGE
tags, and therefore 30 to 35 different genes, (3) quantitation of the number of times a particular tag is
observed provides the expression level of the corresponding transcript. Thus,
in SAGE, the investigator sequences a small and unique fragment of each
expressed gene (a SAGE tag) and quantifies the number of times it appears (the
SAGE tag number). The SAGE tag numbers directly reflect the abundance of the
corresponding transcript (see also SAGE for Beginners by EMBL;
Applications of SAGE).
RNA Differential Display Analysis: This
method is used to compare two cell populations. mRNA
is isolated from both populations. Reverse transcription and PCR are performed
using a poly-T primer, which will anneal to the poly-A tail of mRNA, and a set
of primers with random hexamers, which by chance will anneal to sequences
upstream of the poly-A tail in mRNA. Since the upstream primer will anneal at
random to different mRNA species, the lengths of the PCR products will vary for
nearly every mRNA. The amplification is performed in the presence of
radiolabelled nucleotides so that the products from the two reactions can be
visualised on a high-resolution polyacrylamide gel. Bands that are much darker
in one lane compared with another represent mRNA species that were
overexpressed in one cell population compared with another. There may also be
total absence of one band in one of the populations examined. The cDNA
representing such bands can be recovered from the gel for further analysis and
identification.
DNA Microarray Analysis: Comparative gene expression
profiling and genome-wide analysis of gene expression can be achieved by DNA
microarrays (DNA chips). Either 25-nucleotide long fragments of known DNA
sequences (oligonucleotide arrays for sequence variation studies) or cDNA
fragments (cDNA arrays for expression profile studies) are immobilised on glass
surfaces on a 1.3cm x 1.3cm microarray in a predetermined order (grid).
Thousands of fragments can be stored on a single chip. The sample of interest
(tumour, tissue, species) to be examined for gene expression profile should be
available in a form that will allow RNA extraction. The RNA is labelled with
fluorescent and hybridised with the fragments on the microarray. Hybridisation
events are captured by scanning the surface of the microarray with a laser
scanning device and measuring the fluorescence intensity at each position in
the microarray. The fluorescence intensity of each spot on the array is
proportional to the level of expression of the gene represented by that spot.
DNA microarrays have been used to understand the cell cycle, haematopoietic
differentiation, interferon gamma treatment and cancer classification. The
ability to monitor the expression levels of thousands of genes simultaneously
offers the opportunity to expand the analysis of cancer genetics beyond
single–candidate gene approaches. Microarrays are capable of monitoring
the expression levels of the entire human genome using nanograms of total RNA.
The challenge is,
however, the interpretation of the microarray data. The key is to develop
methods for recognizing meaningful gene expression patterns and distinguishing
those patterns from noise. Such noise (random gene expression levels) can be
generated by (1) variability among microarrays, (2) variability in RNA
labelling and hybridisation methods, and perhaps most importantly, (3)
biological variability among samples. It has become clear that the successful
elucidation of genetic networks through expression profiling will require the
expertise of a new generation of scientists (computational biologists). (See
also Ambion
Guide for Array Analysis; DNA
Microarray Web site; Software AMIADA (Analyzing
Microarray Data by X Xia.)
Real-time PCR is another method to
analyse the expression of one or several specific genes quantitatively. This
method requires first the conversion of mRNA to cDNA by reverse transcription.
The rest is based on the same principles of conventional PCR with important
modifications that allow quantitation. See real-time
PCR and an Ambion
article on real-time and quantitative RT-PCR.
Molecular Biology Links
Human Gene
Expression in Human
Molecular Genetics Online (Strachan & Read)
Gene Transcription
chapter in the Molecular Biology
Web Book
Eukaryotic
Gene Expression Tutorial in the Biology
Project
Molecular
Structure of Genes and Chromosomes in Molecular
Cell Biology
Regulation
of Transcription Initiation in Molecular
Cell Biology
Control of
Gene Expression in the Medical Biochemistry Page
DNA Learning
Center: DNA Interactive - Manipulation - Genome
- Applications
Biology Animations, Movies
and Tutorials: Transcription &
Translation
Gene Quantification Page by MW
Pfaffl
Differential Gene
Expression Techniques
Animations of Transcription and
Translation at Genetics (BA Pierce) Website
WH Freeman Biology Books Companion
Sites (Lodish 5e)
Biology
7/e (Raven et al): Online Learning Center: Online
Labs: Chapter
15 Animations
Critical
Reviews™ in Eukaryotic Gene Expression
Gene
Expression: mRNA
Transcript Analysis & Protein
Analysis in Cancer
Medicine e5 Online
NIH WebCasts: Current Topics in Genome Analysis &
Genome Analysis
BioInformatics Links
TransFac TFSearch (Transcription
Factors and Binding Sites)
TRRD Database of Transcription
Regulatory Regions of Eukaryotic Genes
Splice Predictor Online GenSCAN PromoSer FIEv2 Other Gene Prediction Programs
DNALC Bioinformatics in
the Classroom Course Notes: Identifying
Genes in DNA Sequences
Bibliography on Computational Gene
Recognition
Rogic et al. Evaluation of Gene-Finding Programs. Genome Res
2001
Mathe
et al. Current methods of gene prediction, their strengths and weaknesses. NAR
2002
Statistical
Analysis of Gene Expression by Terry Speed (Berkeley) (PPT)
Bassett et al. Gene Expression Informatics (review). Nat
Genet 1999
RT-PCR Primer Bank RT-PCR
Primer DataBase RT-PCR Primer Sets
Quantitative PCR Primer Database
- QPPD (NCI)
BioInformatics
Slide Presentation I & II
BMC BioInformatics BMC Genomics
Bioinformatics: A Practical Guide to the
Analysis of Genes and Proteins (Wiley, 2001)
Bioinformatics and Genome Analysis
(Springer-Verlag, 2002)
Handbook
of Statistical Genetics (Wiley, 2003), see TOC
(includes a chapter on Gene
Prediction)
Science
Magazine Gene Expression Link
Science Magazine Genes in
Action Special Issue (22 Oct 2004)
Address for
bookmark: http://www.dorak.info/genetics/notes04.html
M.Tevfik Dorak, MD, PhD
Last
updated on 2 March 2007
Genetics Evolution HLA MHC Epidemiology Genetic Epidemiology Population Genetics Glossary Homepage