Genetics Genetic Epidemiology Evolution
Biostatistics HLA MHC Glossary Homepage
BASIC POPULATION GENETICS
M.Tevfik Dorak, M.D., Ph.D.
G.H. Hardy (the English mathematician) and W. Weinberg (the German
physician) independently worked out the mathematical basis of population
genetics in 1908 (Hardy, 1908). Their formula predicts
the expected genotype frequencies using the allele frequencies in a diploid Mendelian population. They were concerned with questions
like "what happens to the frequencies of alleles in a population over
time?" and "would you expect to see alleles disappear or become more
frequent over time?"
Hardy and Weinberg showed in the following manner that if the population
is very large and random mating is taking place, allele frequencies remain
unchanged (or in equilibrium) over time unless some other factors intervene. If
the frequencies of allele A and a (of a biallelic locus) are p and q, then (p + q) = 1. This means
(p + q)2 = 1 too. It is also correct that
(p + q)2 = p2 + 2pq +q2
= 1. In this formula, p2 corresponds to the frequency of homozygous
genotype AA, q2 to aa,
and 2pq to Aa. Since 'AA, Aa,
aa' are the three possible genotypes for a biallelic locus, the sum of their frequencies should be 1.
In summary, Hardy-Weinberg formula shows that:
p2 + 2pq + q2 = 1
AA ... Aa .. aa
If the observed frequencies do not show a significant difference from
these expected frequencies, the population is said to be in Hardy-Weinberg
equilibrium (HWE). If not, there is a violation of the following assumptions of
the formula, and the population is not in HWE.
The assumptions of
HWE
1. Population size is effectively infinite,
2. Mating is random in the population (the most common deviation results
from inbreeding),
3. Males and females have similar allele frequencies,
4. There are no mutations and migrations affecting the allele
frequencies in the population,
5. The genotypes have equal fitness, i.e., there is no selection (in
viability and fitness).
The Hardy-Weinberg law suggests that as long as the assumptions are
valid, allele and genotype frequencies will not change in a population in
successive generations. Thus, any deviation from HWE may indicate the
following biological processes:
1. Small population size results in random sampling errors and unpredictable
genotype frequencies (a real population's size is always finite and the
frequency of an allele may fluctuate from generation to generation due to
chance events),
2. Assortative mating which may be positive (increases
homozygosity; self-fertilization is an extreme example) or negative (increases
heterozygosity), or inbreeding which increases homozygosity in the whole genome
without changing the allele frequencies. Rare-male mating advantage also tends
to increase the frequency of the rare allele and heterozygosity for it (in
reality, random mating does not occur all the time). Cryptic population
stratification is another reason for departure from HWE.
3. A very high mutation rate in the population (typical mutation rates
are < 10-5 per generation) or massive migration from a genotypically different population interfering with the
allele frequencies,
4. Selection of one or a combination of genotypes (selection may be
negative or positive). Selective elimination of homozygotes
as in some autosomal dominant diseases, where homozygotes for the mutation may die in utero,
is an example (in a very large sample, this could violate HWE). Similar to this
selection, sampling error (selection bias) may also affect HWE if bias
concerned ethnicity.
5. Unequal transmission ratio (transmission ratio distortion or
segregation distortion) of alternative alleles from parents to offspring (as in
mouse t-haplotypes),
6. Differential gene frequency among males and females,
In most population genetic estimations (like linkage disequilibrium
calculations), HWE is assumed. This means that genotype probabilities are
determined by allele frequencies and nothing interferes with that. If this
assumption is not met, the estimations will not be accurate. When HWE is
assumed, this means that genotype probabilities are determined by allele
frequencies, i.e., there is no transmission ratio distortion, selection against
a genotype (lethality) etc. If HWE is violated, statistical
methods using allele frequencies may not be valid and methods that use genotype
frequencies should be preferred (Xu,
2002) (discussed below). See HWE
simulation in EvoTutor for the effects of selection, mutation, migration, genetic
drift and assortative mating.
The implications of
the HWE
1. The allele frequencies remain constant from generation to generation.
This means that hereditary mechanism itself does not change allele frequencies.
It is possible for one or more assumptions of the equilibrium to be violated
and still not produce deviations from the expected frequencies that are large
enough to be detected by the goodness of fit test,
2. When an allele is rare, there are many more heterozygotes
than homozygotes for it. Thus, rare alleles will be
impossible to eliminate even if there is selection against homozygosity for
them,
3. For populations in HWE, the proportion of heterozygotes
is maximal when allele frequencies are equal (p = q = 0.50), and when this
happens the heterozygote frequency will be 0.50 (2x0.50x0.50). Unless HWE is violated (as in selective loss of homozygotes),
heterozygosity can never be more than 0.50 at any biallelic
locus (see Figure in Prof Whitloc's Review).
4. An application of HWE is that when the frequency of an autosomal recessive disease (e.g., sickle cell disease,
hereditary hemochromatosis, congenital adrenal hyperplasia) is known in a
population and unless there is reason to believe HWE does not hold in that
population, the gene frequency of the disease gene can be calculated. See also Clinical Genetics.
Likewise, the carrier rate may be calculated for autosomal
recessive disorders if the disease gene frequency is known. For example,
phenylketonuria (PKU) occurs in 1/11,000 (q2), which gives a
heterozygote carrier frequency of approximately 1/50 [ 2xq(1-q)
]. If the diseased individuals (q2) are deducted from the whole population, the
carrier rate in normal individuals approximates to [ 2q/1+q
].
It has to be remembered that when HWE is tested, mathematical thinking
is necessary. When the population is found in equilibrium, it does not
necessarily mean that all assumptions are valid since there may be
counterbalancing forces. Similarly, a significant deviance may be due to
sampling errors (including Wahlund effect,
see below and Glossary),
misclassification of genotypes, measuring two or more systems as a single
system, population substructure, failure to detect rare alleles and the
inclusion of non-existent alleles. The Hardy-Weinberg laws rarely holds true in
nature (otherwise evolution would not occur). Organisms are subject to
mutations, selective forces and they move about, or the allele frequencies may
be different in males and females. The gene frequencies are constantly changing
in a population, but the effects of these processes can be assessed by using
the Hardy-Weinberg law as the starting point.
The direction of departure of observed from expected frequency cannot be
used to infer the type of selection acting on the locus even if it is known
that selection is acting. If selection is operating, the frequency of each
genotype in the next generation will be determined by its relative fitness (W).
Relative fitness is a measure of the relative contribution that a genotype
makes to the next generation. It can be measured in terms of the intensity of
selection (s), where W = 1 - s [0 £ s £
1]. The frequencies of each genotype after selection will be p2 WAA,
2pq WAa, and q2 Waa. The highest fitness is always 1 and the
others are estimated proportional to this. For example, in the case of
heterozygote advantage (or overdominance), the
fitness of the heterozygous genotype (Aa)
is 1, and the fitnesses of the homozygous genotypes
negatively selected are WAA = 1 - sAA
and Waa = 1 - saa.
It can be shown mathematically that only in this case a stable polymorphism is
possible. Other selection forms, underdominance and
directional selection, result in unstable polymorphisms. The weighted average
of the fitnesses of all genotypes is the mean
fitness. It is important that genetic fitness is determined by both fertility
and viability. This means that diseases that are fatal to the bearer but do not
reduce the number of progeny are not genetic lethals
and do not have reduced fitness (like the adult onset genetic diseases:
Huntington's chorea, hereditary hemochromatosis). The detection of selection is
not easy because the impact on changes in allele frequency occurs very slowly
and selective forces are not static (may even vary in one generation as in
antagonistic pleiotropy).
All discussions presented so far concerns a simple biallelic
locus. In real life, however, there are many loci which are multiallelic,
and interacting with each other as well as with the environmental factors. The
Hardy-Weinberg principle is equally applicable to multiallelic
loci but the mathematics is slightly more complicated. For multigenic
and multifactorial traits, which are mathematically
continuous as opposed to discrete, more complex techniques of quantitative
genetics are required.
In a final note on the practical use of HWE, it has to be emphasized
that its violation in daily life is most
frequently due to genotyping errors (Gomes, 1999;
Lewis, 2002).
Allelic misassignments, as frequently happens when
PCR-SSP method is used, sometimes due to allelic dropout (increased homozygosity) are the most
frequent causes of Hardy-Weinberg disequilibrium. When this is observed, the
genotyping protocol should be reviewed. Another common reason is the presence
of an unknown allele which is not considered in the genotyping scheme (null
allele). This happens when a variant is considered to be bi-allelic while it is
actually multiallelic. In a case-control association
study, it is of paramount importance that the control group is in HWE to rule
out any technical errors and to avoid false-positive associations (Tiret, 1995; Schaid & Jacobsen, 1999). The violation
of HWE in the case group, however, may be due to a real association (Nielsen, 1998;
Lee, 2003; Czika & Weir, 2004). When
Hardy-Weinberg disequilibrium is not due to a technical error, the statistical
evaluation of the data should involve statistical tests using genotype
frequencies rather than allele frequencies (Xu,
2002; Lewis, 2002). One such test is the trend test (to
estimate common odds ratio) and can be run online for SNP data (Online HWE
& Association Testing).
Departure from proportions can bias
estimates of estimated haplotype frequencies from population data.
Departure from HWE may be a substantial source of error
in Expectation-Maximization haplotype frequency estimation, simply
because the algorithm relies on HWE in its 'expectation'
step (
See Testing for Consistency with HW Proportions
at GENESTAT
Statistical Genetics Pages; PEDSTATS HWE Tutorial; HWE Slide Show.
Some concepts
relevant to HWE
Wahlund
effect: Reduction in observed heterozygosity (increased
homozygosity) because of pooling discrete subpopulations with different allele
frequencies that do not interbreed as a single randomly mating unit. When all
subpopulations have the same gene frequencies, no variance among subpopulations
exists, and no Wahlund effect occurs (FST =
0).
F statistics:
The F statistics in population genetics has nothing to do the F statistics
evaluating differences in variances. Here F stands for fixation index, fixation
being increased homozygosity resulting from inbreeding. Population subdivision
results in the loss of genetic variation (measured by heterozygosity) within
subpopulations due to their being small populations and genetic drift acting
within each one of them. This means that population subdivision would result in
decreased heterozygosity relative to that expected heterozygosity under random
mating as if the whole population was a single breeding unit. Wright developed
three fixation indices to evaluate population subdivision: FIS
(interindividual), FST
(subpopulations), FIT (total population).
FIS
is a measure of the deviation of genotypic frequencies from panmictic
frequencies in terms of heterozygous deficiency or excess. It is what is known
as
FIT
is rarely used. It is the overall inbreeding coefficient (F) of
an individual relative to the total population (Individual within the Total
population).
See also a Lecture Note on Analysis of Molecular
Variance (AMOVA) is a method of estimating population differentiation directly
from molecular data (and Genetic
Epidemiology Glossary).
Detecting
Selection Using DNA Polymorphism Data
Several methods have been designed to use DNA polymorphism
data (sequences and allele frequencies) to obtain information on past selection
events. Most commonly, the ratio of non-synonymous (replacement) to
synonymous (silent) substitutions (dN/dS ratio;
see below) is used as evidence for overdominant selection (balancing selection)
of which one form is heterozygote advantage. Classic example of this is the mammalian MHC
system genes and other compatibility
systems in other organisms: the self-incompatibility
system of the plants, fungal mating types and invertebrate allorecognition
systems. In all these genes, a very high number of alleles is
also noted. This can be interpreted as an indicator of some form of balancing
(diversifying) selection. In the case of neutral polymorphism, one common
allele and a few rare alleles are expected. The frequency distribution of
alleles is also informative. Large number of alleles showing a relatively even
distribution is against neutrality expectations and suggestive of diversifying
selection.
Most
tests detect selection by rejecting neutrality assumption (observed data is
deviate significantly from what is expected under neutrality). This deviation,
however, may also be due to other factors such as changes in population size or
genetic drift (see Tripathy & Reddy,
2007 (Table 1) for a review in the context
of G6PD deficiency). The original neutrality test was Ewens-Watterson homozygosity test of neutrality (see Glossary)
based on the comparison of observed homozygosity and predicted value calculated
by Ewens's sampling formula, which uses the
number of alleles and sample size. This test is not very powerful.
Other
commonly used statistical tests of neutrality are Tajima's D (theta), Fu &
Li's D, D* and F. Tajima's test (Tajima, 1989) is based on the fact
that under the neutral model estimates of the number of segregating/polymorphic
sites and of the average number of nucleotide differences are correlated. If
the value of D is too large or too small, the neutral 'null' hypothesis is rejected.
DnaSP calculates the D and its
confidence limits (two-tailed test). Tajima did not base this test on
coalescent but Fu and Li's tests (Fu & Li, 1993) are directly based
on coalescent. The tests statistics D and F require data from intraspecific polymorphism and from an outgroup (a sequence from a related species), and D*
and F* only require intraspecific data. DnaSP uses the critical values
obtained by Fu & Li, 1993 to determine the
statistical significance of D, F, D* and F* test statistics. DnaSP can also conduct the Fs
test statistic (Fu, 1997). The results of this group
of tests (Tajima's D and Fu & Li's tests) based on allelic variation and/or
level of variability may not clearly distinguish between selection and
demographic alternatives (bottleneck, population subdivision) but this problem
only applies to the analysis of a single locus (demographic changes affect all
loci whereas selection is expected to be locus-specific which are
distinguishable if multiple loci are analyzed). Tests for multiple loci include
the HKA test described by Hudson et al (1987). This test is based on the idea
that in the absence of selection, the expected number of polymorphic
(segregating) sites within species and the expected number of 'fixed'
differences between species (divergence) are both proportional to the mutation
rate, and the ratio of them should be the same for all loci. Variation in the
ratio of divergence to polymorphism among loci suggests selection.
A
different group of neutrality tests that are not sensitive to demographic
changes include McDonald-Kreitman test (McDonald & Kreitman, 1991) and dN/dS ratio
test. McDonald-Kreitman test compares the ratio of the number of nonsynonymous to synonymous 'polymorphisms' within species
to that ratio of the number of nonsynonymous to
synonymous 'fixed' differences between species in a 2x2 table. The most direct
method of showing the presence of positive selection is to compare the number
of nonsynonymous (dN) to the number of
synonymous (dS) substitutions in a locus. A high (>1) value of
(dN/dS) substitutions suggest fixation of nonsynonymous mutations with a higher probability than
neutral (synonymous) ones. Statistical properties of this test are given by Goldman & Yang, 1994 and by Muse & Gaut, 1994.
The dN/dS ratio tests take into account of transition/transversion rate bias and codon
usage bias.
For
other tests and software to perform these statistics, see DNA Sequence Polymorphism, DnaSP.
See also: Statistical
Tests of Neutrality (Lecture Note by P Beerli);
Statistical Tests of Neutrality of Mutations against Excess
of Recent Mutations (Rare Alleles); Statistical Tests of Neutrality of Mutations against an
Excess of Old Mutations or a Reduction of Young Mutations;
Estimation
of theta; Innan & Tajima, 2002; Properties of
statistical tests of neutrality for DNA polymorphism data, Simonsen, 1995. Review of statistical tests of selective neutrality on genomic
data, Nielsen, 2001, Luikart, 2003, Harris & Meyer, 2006 and a Lecture
Note by Gil McVean. See
also a review by SP Otto (Detecting
the Form of Selection from DNA Sequence Data. TIG 2000).
Linkage disequilibrium (LD)
The tendency for two
'alleles' to be present on the same chromosome (positive LD), or not to
segregate together (negative LD). As a result, specific alleles at two different loci
are found together more or less than expected by chance. LD is the nonindependence, at a population level, of the alleles
carried at different positions in the genome. In this case, the expected
frequency of a two-locus haplotype can be calculated as the probability of the
occurrence of two independent (or joint) events simply by multiplying their
gene frequencies. The same situation may exist for more than two alleles. Its
magnitude is expressed as the delta (D) value and corresponds to the difference
between the expected and the observed haplotype frequency. If there is no LD, D will be zero (or not
significantly different from zero), if there is positive LD it will be a
positive value. It can also be negative if the two alleles tend not to occur
together. The statistical significance of LD, which depends on the sample size,
and the magnitude of LD are separate issues. The statistical significance is
determined by usually Fisher test and the magnitude is determined by either D value or alternative
measures. The magnitude can be normalized (for allele frequencies) to have the
same range of values for any frequency.
Ideally, the haplotype frequencies should be
calculated from family typing data. Obviously, this gives the most accurate
results. In practice, however, when family data are not available, D
and two-locus haplotype frequencies are calculated from a sample of the
population data by constructing 2x2 contingency tables for each allele pair. A
contingency table for this purpose contains the individual (observed) values
cross-classified by levels in two different attributes. A common 2x2 table
constructed in genetic studies is as follows:
|
|
|
|
allele i |
|
|
|
|
|
Present (+) |
|
Absent (-) |
Row totals |
|
|
Present (+) |
a (+/+) |
|
b (+/-) |
a+b |
|
allele j |
|
|
|
|
|
|
|
Absent (-) |
c (-/+) |
|
d (-/-) |
c+d |
|
|
Column totals |
a+c |
|
b+d |
N=a+b+c+d |
Counts for each combination of levels (presence or absence)
of the two factors (alleles) are placed in each cell. The corresponding Dij is estimated by the formula (usually in HLA
studies):
Dij = (d/N)1/2 – [((b+d)/N)((c+d)/N)]1/2
(originally described by Bodmer & Bodmer, 1970;
see also Schipper, 1998)
The haplotype frequency (HFij)
equals to GFi x GFj
+ Dij, where GF is the gene
frequency (the proportion of the chromosomes carrying a particular allele). The
haplotype frequency calculated with this formula from the population data
compares reasonably well with the estimates obtained directly from counting haplotypes
constructed from family segregation data. This method generates a reliable
estimate of a haplotype frequency with the exception of very small haplotype
frequencies. Also for other parts of the genome, it has been reported that
there is little or no advantage to constructing haplotypes from family data
rather than unrelated individuals. The major point is that when using
population data, genotyping errors become an issue. When
genotyping a large number of markers, an error rate of only 1% will produce a
large number of inaccurate haplotypes. Genotyping errors are not the
only possible sources of accuracy problems. Other factors include sample size,
allele frequency distributions and departures from HWE.
There are other measures of LD. Because the value
of D
depends on allele frequencies a normalization of D is needed. This is
achieved by taking into account the allele frequencies: normalized delta value
(D') = DAB
/ Dmax. Dmax is the lesser of pApb or papB if D is positive or pApB or papb if D is negative. Because the
sign is arbitrary, | D' | is often used rather than D'. Therefore, D' (normalized LD) is scaled to remove allele frequency effects. In a
large enough sample, D' = 1 indicates complete LD and D' = 0 corresponds to no
LD. |D'| is directly related to recombination fraction and its generalization
to more than two loci is the only measure of LD not sensitive to allele
frequencies. HAPLOVIEW
and MIDAS are some of the software that calculate D' values.
Another statistic
of linkage disequilibrium is the square of the correlation coefficient (r2) between the alleles at
locus A and B: r2 = D2/ (pA pa
pB pb)
which can also be expressed as r2 = D2 / (pA
(1-pA) pB (1-pB)) (for two loci with two alleles
each). The
measure r2
has several properties that make it more useful (Pritchard, 2001; Weiss, 2002; Carlson, 2004).
In brief, for low allele frequencies r2 has more reliable sample
properties than |D'|. The allelic
association metric r (rho) has the strongest
population theory basis, least sensitive to marker allele frequencies (Morton,
2001); and has several statistically optimal properties
such as consistency, asymptotic unbiasedness and
asymptotic efficiency (Shete, 2003). This parameter can be
estimated on MIDAS
for SNP data (along with other measures of LD). (See
CIGMR; GENESTAT LD Measures; CGIL Summer Course Notes and Measures of LD by Hedrick,
1987; Lewontin, 1988;
Devlin
& Risch, 1995; Devlin,
1996; Morton,
2001; Wall & Pritchard, 2003; Shete, 2003; Jorde, 2003;
Mueller, 2004; Zaykin, 2004; Wang, 2005; GOLD-Disequilibrium Statistics and a Lecture Note at UCL for further details.)
Statistical
Methods for LD Estimation: While the Mattiuz formula can be used to calculate two-locus haplotype
frequencies manually, an alternative method, the maximum likelihood estimation
(achieved by EM ' expectation-maximization' algorithm), can be used if
computing facilities are available. This test was originally described by
Yasuda and Tsuji (1975), compared with other methods by Schipper et al (1998) and Excoffier
et al. (1995). It can also be used for
multiple-locus haplotypes (Long, 1995). One of the most
sophisticated population genetic data analysis packages ARLEQUIN as well as EMLD use
EM algorithm to calculate multilocus LD. Some of the
other software to perform LD analysis are: MIDAS, HAPLOVIEW, Genetic Data Analysis (GDA), EH,
MLD,
DISEQ,
SHEsis, GOLD and PopGene. The proprietary LD Software HelixTree (manual) computes multilocus
haplotype probabilities using the composite haplotype method (CHM) and the
Expectation Maximization (EM) algorithm for SNP data. LD
analysis can also be performed online (Online LD Analysis; Genotype2LDBlock; VG2). All of the above (pairwise) LD
measures can be estimated from unphased SNP data on
STATA using David Clayton's program pwld within genassoc.pkg. For HLA data, HWE and LD (as well as Ewens-Watterson
test) can be assessed using PyPop.
Interpretation of LD Data: The patterns of LD
observed in natural populations are the result of a complex interplay between
genetic factors and the population's demographic history (Pritchard, 2001). LD is usually a
function of distance between the two loci. This is mainly because recombination
acts to break down LD in successive generations (Hill, 1966). When a mutation first
occurs it is in complete LD with the nearest marker (D' = 1.0). Given enough
time and as a function of the distance between the mutation and the marker, LD
tends to decay and in complete equilibrium reached D' = 0 value. Thus, it
decreases at every generation of random mating unless some process is opposing
to the approach to linkage 'equilibrium'. However, physical distance could
account for less than 50% of the observed variation in LD. One genetic
phenomenon that affects LD is gene conversion. Gene conversion is an important
mechanism in the breaking down of allelic associations over short distances,
i.e., decay of LD. Other factors that influence LD include changes in
population demographics (such as population growth, bottlenecks, geographical
subdivision, admixture and migration) and selective forces. Admixture
(intermixture of populations) would cause LD if the mixing populations have
different allele frequencies. LD will also be erased faster in large
populations than in small ones (chance in small populations maintain LD).
Permanent LD may result from natural selection if some gametic combinations confer
higher fitness than other combinations. An extraordinary example of the effect
of recombination rates on LD is the discrepancy between genetic distance and
physical distance between HFE and HLA-A, which generates strong LD despite 5Mb
distance (Malfroy, 1997).
Regional LD may also be variable according to haplotype. An example has been
presented for HLA haplotypes. Haplotype-specific patterns of LD (Ahmad, 2003) may reflect
haplotype-specific recombination hotspots as has been shown for mouse MHC.
Note that LD has nothing to do HWE and
should not be confused with it (see Possible Misunderstandings
in Genetics).
Genetic distance (GD)
Genetic
distance is a measurement of genetic relatedness of samples of populations
(whereas genetic diversity represents diversity within a population). The
estimate is based on the number of allelic substitutions per locus that have
occurred during the separate evolution of two populations. (See lecture notes
on Genetic Distances, Estimating
Genetic Distance; and GeneDist: Online
Calculator of Genetic Distance. The software Arlequin v3.01, PHYLIP, GDA, PopGene,
Populations and SGS are suitable
to calculate population-to-population genetic distances from allele frequencies
(Microsat is a microsatellite
distance program).
Genetic Distance can be computed on freeware
PHYLIP. Most components of PHYLIP are available on
the web (or on Pasteur Webserver).
One component of the package GENDIST estimates genetic distance from
allele frequencies using one of the three methods: Nei's,
Cavalli-Sforza's or Reynold's
(see papers by Cavalli-Sforza & Edwards, 1967, Nei, 1983, Nei, 1996 and lecture note (1) and (2) for more information on these methods).
GENDIST can be run online using default options (Nei's genetic distance) to obtain genetic distance
matrix data. The PHYLIP program CONTML estimates phylogenies from gene frequency data by
maximum likelihood under a model in which all divergence is due to genetic
drift in the absence of new mutations (Cavalli-Sforza's
method) and draws a tree. The program is also available on the web and runs with default options. If
new mutations are contributing to allele frequency changes, Nei's
method should be selected on GENDIST to estimate genetic distances first. Then
a tree can be obtained using one of the following components of PHYLIP: NEIGHBOR also draws a phylogenetic tree using
the genetic distance matrix data (from GENDIST). It uses either Nei and Saitou's (1987) "Neighbor Joining (NJ) Method," or the UPGMA (unweighted pair
group method with arithmetic mean; average linkage clustering) method (Sneath & Sokal, 1973).
Neighbor Joining is a distance matrix method
producing an unrooted tree without the assumption of
a clock (the evolutionary rate does not have to be the same in all lineages).
Major assumption of UPGMA is equal rate of evolution along
all branches (which is frequently unrealistic). NEIGHBOR can be run online. Other components of PHYLIP that draw
phylogenetic trees from genetic distance matrix data are FITCH / online (Fitch-Margoliash method with no assumption of equal evolutionary rate) and KITSCH / online (employs Fitch-Margoliash
and Least Squares methods with the assumption that all tip species are
contemporaneous, and that there is an evolutionary clock -in effect, a
molecular clock) (Mathematical Formulae of Various Genetic Distance Measures;
Genetic Distance Equations). Another
freeware PopGene calculates Nei's
genetic distance and creates a tree using UPGMA method from genotypes. For
genetic distance calculation on Excel, try freeware GenAlEx by Peakall
& Smouse.
Because of different assumptions they are based on the NJ and UPGMA
methods may construct dendrograms with totally
different topologies. For an example of this and a review of main differences
between the two methods, see Nei
& Roychoudhury, 1993 (PDF). Both methods use distance
matrices (also Fitch-Margoliash and Minimal Evolution
methods are distance methods). The principle difference between NJ and UPGMA is
that NJ does not assume an equal evolutionary rate for each lineage. Since the
constant rate of evolution does not hold for human populations, NJ seems to be
the better method. For the genetic loci subject to natural selection, the
evolutionary rate is not the same for each population and therefore UPGMA
should be avoided for the analysis of such loci (including the HLA genes). The
leading group in HLA-based genetic distance analysis led by Arnaiz-Villena
proposes that the most appropriate genetic distance measure for the HLA system
is the DA value first described by Nei, 1983. Unlike UPGMA, NJ produces an unrooted
tree. To find the root of the tree, one can add an outgroup. The point in the tree where the edge to
the outgroup joins is the best possible estimate for
the root position. One persistent problem with tree construction is the lack of
statistical assessment of the phylogenetic tree presented. This is best done
with widely available bootstrap analysis originally described in Felsenstein J: Evolution 1985;39:783-791
(available through JSTOR if
you have access) and Efron, 1996; and reviewed in (Nei,
1996). For a discussion of statistical tests of molecular
phylogenies, see Li & Gouy, 1990
and Nei,
1996. For the topology to be statistically significant
the bootstrap value for each cluster should reach at least 70% whereas 50%
overestimates accuracy of the tree. Bootstrap tests should be done with at
least 1000 (preferably more) replications.
Nei noted that some genes are
more suitable than others in phylogenetic inference and that most tree-building
methods tend to produce the same topology whether the topology is correct or
not (Nei
M, 1996).
He also added that sometimes adding one more species/population would change
the whole tree for unknown reasons. An example of this has been provided in a
study of human populations with genetic distances (Nei
& Roychoudhury, 1993). The properties of most
popular genetic distance measures have been reviewed (Kalinowski, 2002). Whichever is used, large
sample sizes are required when populations are relatively genetically similar,
and loci with more alleles produce better estimates of genetic distance.
However, in a simulation study, Nei et al concluded
that more than 30 loci should be used for making phylogenetic trees (Nei, 1983). There seems to be a consensus that estimated
tress are nearly always erroneous (i.e., the topological arrangement will be
wrong) if the number of loci is less than 30 (Nei,
1996; Jorde LB. Human genetic distance studies. Ann Rev Anthropol 1985;14:343-73;
available through JSTOR if you have access). If populations are closely
related even 100 loci may be necessary for an accurate estimation of the
relationships by genetic distance methods. Cavalli-Sforza
et al have noted important correlations between the genetic trees and
linguistics evolutionary trees with the exceptions for New Guinea, Australia
and South America (Cavalli-Sforza, 1994).
Especially for the HLA genes,
phylogenetic trees can be constructed by using the Nei's
DA genetic distance values and NJ method with bootstrap tests on DISPAN. Correspondence analysis, a supplementary analysis
to genetic distances and dendrograms, displays a
global view of the relationships among populations (Greenacre, 1984; Greenacre & Blasius, 1994; Blasius & Greenacre, 1998).
This type of analysis tends to give results similar to those of dendrograms as expected from theory (Cavalli-Sforza & Piazza, 1975), and is more
informative and accurate than dendrograms especially
when there is considerable genetic exchange between close geographic neighbors
(Cavalli-Sforza, 1994). In their enormous effort to
work out the genetic relationships among human populations, Cavalli-Sforza
et al concluded that two-dimensional
scatter plots obtained by correspondence analysis frequently resemble
geographic maps of the populations with some distortions (Cavalli-Sforza, 1994). Using the same allele
frequencies that are used in phylogenetic tree construction, correspondence analysis using allele
frequencies can be performed on the ViSta (v7.0), VST, SAS but most
conveniently on Multi Variate Statistical Package MVSP. Link to a
Tutorial on Correspondence Analysis.
Spreadsheet
Exercises in Ecology & Evolution
&
Spreadsheet
Exercises in Conservation Biology & Landscape Ecology
(including HWE and LD)
History of Population Genetics and Evolution in A History of Genetics by AH Sturtevant
Introduction
to Population Genetics @ NBII
ASHI 2001 Biostatistics and Population Genetics Workshop
Notes
Microsatellites and
Genetic Distance (Primer on Genetic Distance)
HWE in Kimball's Biology Pages Online
Biology Book: Genetics - HWE
Online HWE Test
Online GD Calculation Genetic Power Calculator
Simulations: Population Biology Population Genetics Evolutionary Biology
Human Genetics for the Social Sciences Interactive Learning
Exercises
Lectures on Population Genetics (1) & (2) & (3)
& (4)
& (5) & (6) Population Genetics Glossary
Population Genetics Course by Knud
Christensen (PDF of whole course)
Population
Genetics Course by Kent E Holsinger
Evolutionary Quantitative Genetics Notes by Bruce
Walsh
Human Genome Epidemiology Online Book Medical
Applications of Population Genetics
Genetic Epidemiology Notes Genetic Epidemiology Glossary
Software for Population Genetic Analyses
Freeware Population Genetic Data Analysis
Software (List of Features):
Arlequin v3.1 (2005) PopGene GDA Genetix GenePop GeneStrut SGS
EMLD LDA MIDAS PyPop (HLA & HWE/LD)
Gene[VA] Tools for Genetic Data
Analysis (AGP, Universite de Geneve, Switzerland):
HLA
Data Analysis (HWE, LD, Association) (EFI2006 Teaching Session)
GOLD-Disequilibrium Statistics
Extended
Haplotype Homozygosity (EHH) Web-Tool (ref) Haploplotter (ref)
HAPLOTYPER Genotype2LDBlock
DISEQ PHASE v2.0
UNPHASED HAPLOVIEW
(Tutorial)
HWE
PHYLIP (BioPortal) PHYLIP (Pasteur)
PHYLIP 3.62 (Download) DISPAN ViSta GenAlEx CLUMP TDT
Genetic Calculation Applets by Knud Christensen
Software at Dyer
Laboratory for Population Genetics
MSA (for microsatellite
data) POPULATIONS
WINPOP v2.0
QUANTO
Computational Genetics Software including EHAP
GSF: Genetic Software Forum Partition for Online Bayesian Analysis
Comprehensive
List of Genetic Analysis Software (1) (2) (3)
Review of common
Population Genetics software by Labate JA, 2000
Topali v2 for multiple sequence alignments by BioSS
Population Genetics Software Course-ECL290 at UCDAVIS Genomic
Variation Laboratory
Linkage
Disequilibrium Analysis Bibliography
Address For
bookmark: http://www.dorak.info/genetics/popgen.html
M.Tevfik Dorak, MD, PhD
Last updated 28 June 2009
Genetics Genetic Epidemiology Evolution Biostatistics HLA MHC Glossary Homepage