Genetics Evolution
HLA MHC Genetic Epidemiology Population Genetics Glossary Homepage
POSSIBLE MISUNDERSTANDINGS AND
MISCONCEPTS IN GENETICS
M.Tevfik Dorak, MD, PhD
Please update your bookmark: http://www.dorak.info/genetics/misund.html
See also Misconceptions about Evolution: Berkeley,
TalkOrigins
and Wikipedia
*
OMIM
(Online Mendelian Inheritance in Man) database lists
non-Mendelian disorders as well. Likewise, NCBI-dbSNP records all base changes not just SNPs.
*
SNP is used to mean any sequence variation but this is not true. SNPs may be the most common type of variation but there are
many other kinds: most importantly structural variation (see Feuk, 2006) and copy number variations (see Redon,
2006). There were known sequence variations before the term SNP was
first used but they were called different things. No one should think that
there was no known variation in the genome before the term SNP was first used.
*
Mutation involves any change in the hereditary material: from a point mutation
to a chromosomal loss. To have a functional consequence, a mutation does not
have to be in the coding region. An intronic mutation
may well result in a non-functional gene (like the splicing site
mutation in CYP21A2).
*
Many authors use the term mutation for any rare allele (<1%) and the term
polymorphism for any common allele (>1%). This is one definition of mutation
and polymorphism. The other one is that ‘mutation’ is any variation
in the gene that causes an obvious change in phenotype whereas polymorphisms do
not change any obvious phenotypic variation. It is best to be aware of these
definitions while sticking with the recommendations of the Human Genome Variation Society
and to use 'sequence
variant', 'alteration' or 'allelic variant' for any genomic change regardless of their frequency or
phenotypic effects.
*
Even 1M SNP chips may not be able to cover the whole genome. One reason for
that will be those genes that need to be selectively amplified first (due to
the presence of a duplicated copy or pseudogene) will
not be represented at all. Highly polymorphic regions (such as HLA genes) are
not represented either due to difficulty with designing primers because of the
lack of constant regions flanking the variants.
*
Chromosomes do not have the shape of what they are most frequently illustrated
as. Those figures refer to a metacentric 'replicated'
chromosome.
*
Base-pair (bp) is used to quantitate the length of
nucleic acids but it should really be used for DNA only since RNA is
single-stranded.
*
Expressivity and penetrance are very different
concepts. Expressivity is the variation in the expression of a trait or a
disease (phenotypic heterogeneity). Gaucher disease
and neurofibromatosis are examples of variable expressivity. Penetrance refers to frequency of expression of a genotype
regardless of severity of the phenotype. Low penetrance
genotypes will only be expressed in a small frequency of individuals bearing
them (as in acute intermittent porphyria). This
expression may still vary in its clinical severity (this is expressivity) (a review
on penetrance and expressivity by Zlotogora J, 2003 in Genetics in Medicine).
*
If different alleles of a locus cause the same disease, this is called 'allelic
heterogeneity' but if they cause different diseases/phenotypes, this is also
called 'allelic heterogeneity'. This term should be used unambiguously.
*
Genetic heterogeneity and locus heterogeneity are used interchangeably in
practice but this requires attention. 'Locus heterogeneity' is only used for
the involvement of different loci in the causation of a disease/phenotype
individually (as in early-onset Alzheimer disease; three different genes may
cause the same phenotype). 'Genetic heterogeneity' may also be used to mean a
combined effect of different loci in the development of a (complex) disease (as
in diabetes; multiple loci are 'simultaneously' involved in the development of
diabetes).
*
There is a difference between a trait being influenced by genes and trait
variation being influenced by genetic variation. See
Terwilliger & Weiss, 2003.
*
Syntenic means two different things in different
contexts: It is (syntenic genes) described as
"genes thought to reside on the same chromosome" in Dictionary of
Genetics by King & Stansfield (see for example a Lecture
Note on Linkage & Recombination); and (syntenic
maps) as "genetic loci that lie in the same order on the same chromosome
in different species" in Dictionary of Biological Terms by E Lawrence (see
for example, NCBI Human-Mouse conserved synteny
maps or Human
chromosome 22 and syntenic Mouse chromosome 16 maps). Both are
correct but this dual usage may cause misunderstandings. Lee Silver explains
both in the same entry in the Encyclopedia
of Genetics: synteny describes two or more genes
or loci that have been mapped to the same linkage group. Conserved synteny refers to the situation where two linked loci in
one species have homologs that are also linked in
another species.
*
Genetic distance may also mean two different things: (1) in population genetics,
genetic distance is a measure of genetic relatedness of any number of species
or populations, (2) genetic distance is also a measure of recombination (the
parameter is centriMorgan or cM)
which roughly correlates with physical distance (the number of nucleotides
between two loci) in the human genome.
*
At the genomic level, it is often quoted that there is 99.8% similarity in
certain coding regions between humans and chimpanzee genomes. Remember that:
1. One nucleotide (out of 3
billion) difference may cause lethal diseases [sickle cell disease, hereditary hemochromatosis, etc]
2. Just one gene 'SRY' of the Y
chromosome is responsible for almost all the difference between a male and a
female
3. On the other hand, at the
sequence level any two human subjects differ from each other by 0.1%. This
corresponds to 3 million nucleotide differences. It is not the number of
differences but the nature and location of differences that matter. And most
importantly:
4. Any conclusion drawn only from
linear DNA sequence comparison ignores the effects of epigenetic differences,
posttranscriptional and posttranslational effects.
(See a discussion in Scientific American; Marks J: What It means To Be 98% Chimpanzee. University of California Press, 2002, Diamond J: The Third Chimpanzee. Perennial, 1994; What Makes us Humans? in Human Molecular Genetics; and Science Magazine Breakthrough of the Year 2005: Evolution in Action.)
* Humans
are not descendants of apes. They share a common ancestor who lived about 5-7 Mya and is now extinct. This is similar to say that modern
humans are not descendants of Neanderthals but they shared a common ancestor
lived about half a million years ago.
*
Different species of humans have been recognized (H. erectus, H. habilis etc). This does not mean these
‘species’ lived contemporarily and could not interbreed. This
operational classification is based on structural differences and not on
genetic isolation.
* Mendel's
experiments were on characters determined by single genes. Multiple genes and
environmental factors that interact with one another determine most characters.
Quantitative genetics deals with such characters or complex diseases. See also
‘Some apparent
exceptions to Mendelian rules’.
* A quantitative
locus is involved in the expression of a continuous character like weight or
height but not a countable one.
*
Multivariable analysis and multiple comparisons have nothing in common.
Performing multivariable adjustment (adjusting the odds ratio/P value for potential confounders) does
not adjust for multiple comparisons. Also, multivariable analysis result can go
either way: it is not fair to state that “even after adjustment in a
multivariable model, the association remained significant.” By the way,
multivariable and multivariate are not the same thing
and they have totally different meanings (see Biostatistics Glossary).
*
Dominance and recessiveness are the features of
phenotypes but not genes. It is more common to call genes dominant or recessive
but strictly speaking, this is wrong.
*
Dominance models in genetic epidemiology refer to associations with heterozygosity and ‘dominance’ here is very
different from ‘dominance’used in classic
genetics.
* DNA and
RNA are traditionally called nucleic acids. The so-called 'nucleic' acids
include extra-nuclear (cytoplasmic) DNA and tRNA/rRNA too. In other words, mitochondrial DNA is also a
nucleic acid but it is not in the nucleus. Similarly, viral DNA or RNA are
nucleic acids but not enclosed in a nucleus.
* DNA is
deoxyribonucleic acid composed of
four nitrogenous bases (A, T, G, C), deoxyribose and acidic phosphate groups. It is the
phosphate group that makes it acidic.
* Start codons AUG/GAG code for methionine
and valine. It does not necessarily mean that each
polypeptide starts with Met/Val because most of the time it is eliminated by
post-translational modifications.
* Genes
are said to be transcribed 5' to 3'. This is the direction on the coding strand
but it is actually the non-coding or template strand which is transcribed and
this happens 3' to 5'. The resulting mRNA is made 5' to 3'.
*
Antisense treatment is targeted against the mRNA (which is always a sense
strand) but not against the sense (coding) strand of DNA.
*
Beadle’s one gene-one enzyme/protein concept is essentially correct but
not strictly valid any more. Alternative splicing, overlapping genes,
posttranslational modifications and other mechanisms create more than one
protein product from a given sequence in the genome. This is similar to the
fact that same gene may cause more than one distinct clinical syndromes (due to
different mutations) (for more on this, see Clinical
Genetics). One reason for the estimated number of genes in the human genome
was initially so high (around 100 thousand) was strict interpretation of one
gene-one protein concept.
* The
number 30-35 thousand for functional genes in human genome is only for genes
identified structurally by conventional criteria. It does not include the
non-conventional genes such as small RNAs,
transcribed/processed pseudogenes, alternatively
spliced versions and some of the overlapping genes. Total number of proteins
encoded by the human genome is many times the number of structurally recognized
genes.
*
Confusion may arise when two different numbers are quoted for the number of
genes in a genome. One has to be specific about what is mentioned. Total number
of genes (loci) and number of protein-expressing genes are different in a
genome. Thus, number of polymorphic markers is (millions) far too much more
than the number of genes (which is 30-35 thousand in the human genome).
* Pseudogenes can be transcribed (examples are CYP21A1P
and DRB4-null).
Although not always translated into a protein product, a pseudogene
can be transcribed to a RNA product and this can be involved in gene expression
regulation (for an example, see Hirotsune
et al, 2003). This is similar to transcribed but untranslated
parts of a conventional gene. Pseudogenes and processed
pseudogenes should also be distinguished (for more, see http://pseudogene.org).
* DNA is not the blueprint for life. It can be
said that the DNA contains the biochemical instructions a living organism will
need. Think about sex determination in reptiles or the penetrance
issue for a cancer susceptibility gene. The epigenetic variation creates
difference between individuals bearing the same sequence of DNA (see the
chimpanzee vs. human sequence similarity discussion above).
* Leader
sequence and signal sequence are used interchangeably as if they were the same
thing. Although this is common practice, strictly speaking, leader sequence is
only transcribed but not translated and it leads the mRNA to the ribosomes; signal sequence is translated and helps the
protein to reach its final destination (this may be outside the cell) where it
is cleaved off.
* Another
unfortunate common practice is using allele/gene/antigen/phenotype/marker
frequency interchangeably. For a brief discussion of their exact definition,
see: Statistical Analysis of HLA and Disease
Associations. One thing that is more than just careless usage is the use of
‘carrier / carriage frequency’ instead of marker frequency. Carrier
frequency has a specific meaning in genetics (as in carrier frequency for thalassaemia trait) and should not be used to describe the
proportion of tested subjects positive for a marker (heterozygous and
homozygous combined) when marker frequency is the more appropriate term (for
Reference, see Svejgaard & Ryder, 1994).
* A huge
and unforgivable mistake is to compare allele (gene) frequencies (corresponding
to multiplicative model) with marker (allele positivity) frequencies
(corresponding to dominant model) in association studies (and usually to find a
very significant association!). This is not as rare as one might think (or
hope).
* Polymorphism has several
definitions. A polymorphic locus was originally defined as a locus in which the
least common allele occurs with a frequency of at least 1%. A more appropriate
definition has been suggested by Elston as a locus in
which the most common allele occurs with a frequency of at most 99% (Elston, 2000). The original definition would fail to
accommodate the HLA loci which have >100 alleles (although they are
cumulative product of multiple polymorphic sites within each gene) but Elston's definition allow for more than 100 alleles.
* A functional polymorphism may
either increase or decrease gene expression. This means not all polymorphisms
decrease gene expression/function. This is similar to say that not all
polymorphisms are deleterious or associated with risk,
they can be markers for risk or protection.
* A genetic association may be a
chance finding (due to sampling error) and opposite is also true that the lack
of it may be a chance finding (i.e., an association may be obscured due to
sampling error). However, no reviewer will tell you that failure to replicate
an association might have been due to the lack of chance! Population
stratification or any bias may also work either way but are usually suspected
or has to be ruled out when an association is found.
* In PCR, dNTPs are used but in the final product, the nucleotides
are dNMPs. The two phosphates are cleaved off to
obtain energy to drive the reaction (DNA replication).
* Meiosis
is said to create four daughter cells out of one. This does not apply to an oocyte, which gives rise to a single daughter cell (ovum)
and two polar bodies. In other words, an ovum's chromosome content is halved
only after fertilization.
* It is
quite common to name the growing human offspring as embryo or fetus regardless
of the period of intrauterine life. The correct terminology for the offspring:
[fertilization] zygote - conceptus - embryo (after
implantation) - fetus (after organogenesis is complete) [birth]. (Plural for
conceptus is either concepti
(Latin) or conceptuses (English)).
* 'Murine' refers to the rodent family Muridae,
which includes both rats and mice. By common practice, however, the term is
used almost exclusively for mice. If you mean mice, say mice.
*
Evolution does not prearrange what action to take. It proceeds by natural
selection in which individuals with the most adaptive characteristics in a
given environment are selected (favored). Over many generations, the number of individuals bearing the adaptive characteristics
increases until all individuals have them.
* Genetic
fitness is the overall ability to leave surviving offspring who themselves will
be able to reproduce successfully. It has no correlation with physical fitness.
Also reduced fitness does not necessarily involve death; there are a lot of
long-living people but infertile.
*
Heritability may be a high in a given population for a given character/disease
but this does not mean environment does not play any role in the expression of
that phenotype. If the same estimate is attempted in a different population, it
may not be that high.
* Ethical
issues aside, eugenics is a fallacy. Even if all individuals expressing a
recessive disease are eliminated from the population (most of which are not
fertile anyway), there will still be about 100 times more asymptomatic carriers
of the gene for the same disease. Unless based on a very comprehensive genetic
screening program, a eugenics program will never achieve its aim.
* It is now
possible to type thousands of polymorphisms of the genome in a single assay
using microarray technology. Every time a new susceptibility marker is
announced, there is a lot of talk about the possibility of an insurance
company’s use of it. If we are all screened for all of our genes, all of
us will have a few recessive lethal genes and a lot of susceptibility markers
for complex diseases. No one will ever be clear of all susceptibility genes.
The fallacy is that having the nucleotide sequence for a ‘bad’ gene
does not mean it will do any harm.
*
Susceptibility and predisposition are usually used interchangeably but most
genetic epidemiologists have begun to mean different things by these two words.
Predisposing genes are those high-penetrance genes
that are necessary and sufficient to cause a disease and susceptibility genes
are low-penetrance genes that are neither necessary
nor sufficient to cause a disease. Susceptibility genes contribute to disease
development in a multifactorial setting but the
disease can occur in
their absence (Greenberg,
1993; Greenberg
& Doneshka, 1996).
* Nick
translation is not actually a translation event in classical sense. It is the
replication of DNA by a polymerase.
* Gene
expression is the process that converts a gene's coded information into the structures
operating in the cell. Expressed genes include those that are transcribed and
translated all the way to proteins, and those that are transcribed into RNA but
not translated into protein (e.g., transfer and ribosomal RNAs).
XIST
and H19
genes are transcribed but not translated to a protein product (Joubel, 1996; Milligan,
2002). Thus, gene expression should not be described as conversion of
genomic information to protein sequences.
*
Nucleotide level changes are only one of many phenomena that affect gene
expression. Cis- and trans-acting modifiers, epistatic interactions, gene-environment interactions,
parent-of-origin, sex-specific imprinting and other epigenetic effects,
post-translational modifications and many others are involved in gene
expression. These are some of the reasons for the lack of strict
genotype-phenotype correlations for many genes.
* Gene
conversion is a misleading and confusing name for what it describes. It means conversion
(transformation) of one gene into another because of exchange of genetic
material with another gene.
*
Homology, analogy and paralogy are related but
different concepts and cannot be used interchangeably. Homology and analogy
concern different species (former with a common ancestor) and paralogy concerns the same species (see glossary). Most importantly, most of the time
there is no such thing as 'sequence homology' but what is meant by this is
'sequence similarity'
(see Similarity, Homology, Divergence and Convergence in the
NCBI Online Book: Sequence - Evolution - Function).
* Mitochondrial Eve theory does not
say that at some point in the history there was one female member of our
species. The difference between gene and individual genealogies is important in
the interpretation of the findings led to this theory. All genes eventually
coalesce into a single common ancestor. The mitochondrial Eve was not
necessarily the only female alive at that time, or the first female to have
that particular type. Many other mtDNA lineages may
have gone extinct before or after she lived.
* Linkage,
association and linkage disequilibrium (LD) are all different concepts. The
HLA-A locus and hereditary hemochromatosis show
linkage in families. The HLA-A3 allele shows an association at the population
level because of LD between the C282Y mutation of HFE and HLA-A3. Linkage does
not imply LD and vice versa. The delta (D) value
may be zero for linked loci (linkage only occurring in families and no LD at
the population level which would occur if the disease mutation occurred on multiple chromosomes independently),
and delta value may be different from zero for unlinked loci. Linkage stems
from no recombination within the family (limited number of meiosis events) and
LD is a reflection of proportion of recombination events over many generations
in the population.
* A “risk marker” does
not have to increase the risk, it can be protective. An association is either a
risk association or a protective one.
* Unless otherwise stated, a genetic
polymorphism association is with the variant (minor, rare) allele of a variant.
* The variant allele of a
polymorphism is not necessarily deleterious; it may cause a beneficial change
in gene function.
* A protective odds ratio (<1.0)
may be said to decrease the risk by (1-OR)x100 percent
but a risk OR cannot be translated to a percentage as frequently done. This is
because protective odds ratios lie between 0 and 1 but risk odds ratios lie between 1 and infinity.
* Model-based linkage analysis is based on a likelihood
ratio, the logarithm of which is called a LOD score. This is not the logarithm of the odds for
linkage but the logarithm of the likelihood ratio for a particular value of the
recombination fraction vs. free recombination, i.e., q (theta) = 0.5 (Elston, 1998; Olson,
1999).
* Linkage disequilibrium (LD) and
Hardy-Weinberg equilibrium (HWE) are also different things and have not got
much in common. LD is -not having- linkage equilibrium, which is quantitated by a delta value and an associated P value shows the significance of the disequilibrium. In HWE
tests, getting a significant P value also means disequilibrium,
which is a worrying thing when the population sample is supposed to be in HWE
(like the control group of an association study).
* Major
assumption of HWE is random mating while there is no random mating in any human
population. How many tall women do you know are married to short men? The
popular program Haploview uses P = 0.001 as
the threshold for Hardy-Weinberg equilibrium violation probably in recognition
of the unrealistic assumptions of HWE.
* The
definition of haplotype does not require having two
markers on the same chromosome. For the HLA system, for example, HLA-B44-DR4 is
a haplotype flanked by the genes encoding B44 and DR4
but it is also possible to talk about DR4 haplotypes.
This would mean an undefined length of chromosomal segment carrying DR4 gene
extending either way.
* Cancer
is a genetic disease but the genetic changes concern somatic cells. Cancer due
to germline DNA mutations (inherited cancer) is less
than 10% of all cancers. Genes are usually the target, not the origin, of the
cancer process. Most mutations found in cancer cells (somatic) cannot be
detected in surrounding healthy cells (germ-line).
* Germline DNA mutations that cause inherited cancer are both
necessary and sufficient for cancer development and have high-penetrance. The more common low-penetrance,
germline polymorphisms that modify cancer risk are
part of the complex genetic susceptibility and do not
cause cancer on their own.
* Oncogenes
should be called proto-oncogenes. Anti-oncogenes are not genes that antagonize the effects of the oncogenes. They are the genes with anti-oncogenesis
effects. Similarly, carcinogens (or cancerogens) are
not necessarily cancer-causing genes, they are more
frequently chemicals or other environmental factors (e.g., viruses) which
promote cancer development.
* The original Knudson's two-hit
hypothesis (Knudson,
1975) suggesting loss of heterozygosity or
homozygous deletion are the two hits required for the loss of tumour suppressor gene activity has now been extended to
include transcriptional silencing by DNA methylation
of promoters that can disable tumor-suppressor genes (eviews by Yu & Shen,
2002; Balmain et al, 2003 and Paige,
2003).
* X-linked diseases (eg, hemophilia, Wiskott-Aldrich
syndrome) occur almost exclusively in males but these are recessive ones.
X-linked dominant diseases are not seen in males because they are lost during
fetal development (see Clinical Genetics).
* Some X chromosome genes escape
inactivation (Shapiro,
1979). This creates a situation, which is opposite of what is achieved by Lyonisation (gene dosage compensation): X chromosome genes
that escape inactivation are represented twice as much in females (eg, MIC2X/CD99;
STS;
XG).
Such genes are either in pseudoautosomal region of
the X and Y chromosomes or they have a homologous gene on the Y chromosomes (eg, ZFX/ZFY).
* Not all genes on X-chromosome
behave like sex-chromosome-linked genes. Those within the pseudoautosomal
regions (PAR) exist in two copies in both sexes (see pseudoautosomal
region in Genes and Chromosomes).
* Does anyone have an idea about the following
(or would like to do research on these)?
1. Why is
it that primary sex ratio at fertilization may be as high as 165:100 (see for example: Tricomi, 1960; Shettles, 1964; Serr & Ismajovich, 1963; Lee
& Takano, 1970; McMillen, 1979;
Kellokumpu-Lehtinen & Pelliniemi,
1984; Vatten, 2004; C3 Newsletter 13/2)
but it falls down to 106:100 at birth in humans (and similarly in most
mammals) but nobody thinks about the reasons/implications of this? A
continuation of this process (elimination of excess males) is the increased
morbidity and mortality of male infants and children (well-known male
disadvantage (Stevenson,
2000) or fragile male (Kraemer, 2000),
which has evolutionary explanations (Trivers & Willard, 1973; Wells, 2000; Dorak,
2002)). Could the sex-chromosomes be involved? [One reason must be the
male-specific lethality associated with X-linked dominant diseases but this is
not frequent enough to explain the huge loss.]
2. Why is
everybody acknowledging that the carrier frequency for HFE-C282Y (causing hereditary hemochromatosis)
and CYP21A2
mutations (causing congenital adrenal hyperplasia) is so high (more than 1 in
50 people carry them in some populations) but not much is done about their
implications in public health?
[There was another question here:
* Why is it that everybody knows
HLA-identical sibling frequency for a leukemic child is >25% instead of the Mendelian expectation of 25% but nobody wonders about the
genetic basis for this?
The answer is as follows: This is
due to increased parental HLA sharing in leukemic families (Werner-Favre,
1979; MacSween, 1980; Nordlander, 1983; von
Vliedner, 1983; Carpentier, 1987. 25% expected frequency applies to situations
where parents are heterozygous and do not share any alleles.]
M.Tevfik Dorak, MD, PhD
Last
updated on 9 January 2010
Genetics Evolution
HLA MHC Genetic Epidemiology Population Genetics Glossary Homepage