Genetics
Clinical
Genetics Population Genetics Biostatistics
Epidemiology Bias & Confounding Evolution
HLA MHC Glossary Homepage
GENETIC
EPIDEMIOLOGY
M.Tevfik DORAK
Genetic
Epidemiology PowerPoint Presentation (PPT)
Statistical Analysis of Genetic Associations
Classical epidemiology deals with disease patterns and
factors associated with causation of diseases with the ultimate aim of
preventing the disease. Molecular epidemiologic studies measure exposure to
specific substances (DNA adducts) and early biological response (somatic
mutations), evaluate host characteristics (genotype and phenotype) mediating
response to external agents, and use markers of a specific effect (like gene
expression) to refine disease categories (such as heterogeneity, etiology and
prognosis). Genetic epidemiology
overlaps with molecular epidemiology. It is the epidemiological evaluation of
the role of inherited causes of disease in families and in populations; it aims
to detect the inheritance pattern of a particular disease, localize the gene
and find a marker associated with disease susceptibility. Gene-gene and
gene-environment interactions should also be studied in genetic epidemiology of
a disease. The most widely accepted definition of genetic epidemiology is by
Morton: ‘a science which deals with the etiology, distribution, and control of
disease in groups of relatives and with inherited causes of disease in
populations’ (Morton
NE, 1982).
Genetic
epidemiology was born in the 1960s as the merger of
population/statistical/mathematical genetics and classical/molecular epidemiology.
The pioneers include Newton
Morton, Douglas
Falconer, Robert
C Elston, Elizabeth A Thompson and Neil Risch.
The steps, a genetic
epidemiologic research follows,
are:
1. Establishing that there is a genetic
component to the disorder.
2. Establishing the relative size of that
genetic effect in relation to other sources of variation in disease risk
(environmental effects such as intrauterine environment, physical and chemical
effects as well as behavioral and social aspects).
3. Identifying the gene(s) responsible for
the genetic component.
All of these can be achieved either in family studies (segregation, linkage,
association) or in population
studies (association).
General methods employed in genetic epidemiology:
* Genetic risk
studies: What is the contribution of genetics as opposed to environment to
the trait? Requires family-based, twin/adoption or migrant
studies.
* Segregation analyses: What does
the genetic component look like (oligogenic 'few
genes each with a moderate effect', polygenic 'many genes each with a small
effect', etc)? What is the model of transmission of the genetic trait?
Segregation analysis requires multigeneration family
trees preferably with more than one affected member.
* Linkage studies: What is the
location of the disease gene(s)? Linkage studies screen the whole genome and
use parametric or nonparametric methods such as allele sharing methods
{affected sibling-pairs method} with no assumptions on the mode of inheritance,
penetrance or disease allele frequency (the
parameters). The underlying principle of linkage studies is the cosegregation of two genes (one of which is the disease
locus).
* Association studies: What is the
allele associated with the disease susceptibility? The principle is the
coexistence of the same marker on the same chromosome in affected individuals
(due to linkage disequilibrium). Association studies may be family-based
(transmission / disequilibrium test - TDT; also called
transmission distortion test) or population-based. Alleles, haplotypes
or evolutionary-based haplotype groups may be used in
association studies (Clark,
2004; Tzeng, 2005). Most recently, genome-wide association
studies (GWAS) have been the norm for most robust results (Clark, 2005; Wang, 2005; Pearson, 2008;
McCarthy,
2008 the WTCCC
GWAS (PDF),
a list of recent GWAS in OEGE, NHGRI (NIH): catalog of GWAS; ; GWAS
on HuGENet) (see also the Glossary) .
The samples needed for these studies may be nuclear families (index case and parents), affected relative pairs (sibs, cousins, any two members of the family), extended pedigrees, twins (monozygotic and dizygotic) or unrelated population samples.
Genetic epidemiologic approach
When the question is whether a disease has a genetic component, the detection and estimation of familial aggregation (e.g., higher occurrence rates in siblings or offspring) is the first step in the approach. This may already be known from descriptive epidemiology studies. Results of observational studies on siblings, parent-offspring concordance, twins, adoptees and even migrants may suggest a genetic component in the etiology of a disease or trait. Familial aggregation of a trait is a necessary but not sufficient condition to infer the importance of genetic susceptibility, because environmental and cultural influences can also aggregate in families, leading to family clustering and excess familial risk. Similar environment may be the reason for familial aggregation. With rising divorce rates, study of recurrence risk in half-siblings is another powerful method to test for parent-specific events. In an application of this method, multiple sclerosis appeared to have a genetic basis transmitted more from mothers than fathers (Ebers, 2004).
Familial
aggregation for a disease is measured by
the relative recurrence risk (RRR) or familial
risk ratios (FRRs). These are quantities
denoted by lR, where R denotes a relationship (S=sib,
O=offspring, DZ= dizygotic twin, C=cousin etc), and
whose values are the risks of relatives of type R of affected individuals being
themselves affected, divided by the population prevalence. Examination
of relative recurrence risk
values for various classes of relatives can potentially suggest a polygenic
background and epistasis (Risch, 1990a; 2001).
The estimation of ls
is prone to ascertainment bias and should be performed with great care (Chakraborty, 1987; Guo, 1997). Stratification
of risk by degree of relatedness (e.g. in siblings versus cousins) and
comparisons with the risk to spouses living in the same household can help distinguish
between genetic and non-genetic contributions to familial effects. A
higher concordance rate for brother pairs than in father-son pairs suggests an
X-linked recessive genetic background (as has been observed X-linked prostate
cancer (Monroe,
1995)). Same sex pairs of affected
relatives suggest X-linked recessive (males) or X-linked dominant (females; due
to intrauterine loss of males) inheritance. The genes
within pseudoautosomal regions (PAR) of the X and Y
chromosomes have a unique segregation pattern that the disease can occur in
either sex but affected sibs tend to be same sex. The mutant
allele would consistently segregate, during male meiosis,
with sexual phenotype. A man could possess the mutant
allele on either his X or Y chromosome. If it
is on the X chromosome, then only his daughters would
inherit the allele, whereas if it resided on the Y
chromosome, then only his sons would inherit the allele (Crow, 1994).
Unequal frequencies of sex-discordant and sex-concordant affected
sib pairs may be used as the basis of a test for
linkage to the pseudoautosomal region
(Horwitz & Wiernik, 1999).
Other traditional designs for distinguishing
non-genetic shared family effects from genetic effects have been studies of twins and adoptees. Twin studies
provide somewhat more specific information than recurrence risk ratios (see Australian Twin Study; Swedish Twin Registry; Netherlands Twin
Register and St Thomas Hospital
Twin Research & Genetic Epidemiology websites).
Twin studies have been traditionally used to estimate the genetic contribution to a trait through the comparison of monozygotic (MZ) pairs (who share all their genes) with dizygotic (DZ) twins (who share half of their genes in common). The greater similarity of MZ twins than DZ twins is considered evidence of genetic factors. A standard measure of similarity used in twin studies is the concordance rate. This can be pairwise (Pr) or probandwise (Cc) (McGue, 1992). The Pr concordance is a descriptive statistic and simply gives the proportion of affected pairs that are concordant for the disease. It is calculated as the proportion of twin pairs with both twins affected of all ascertained twin pairs with at least one affected: Pr=C/(C+D), where C is the number of concordant pairs and D is the number of discordant pairs. The probandwise (Cc) concordance is the proportion of affected individuals among the co-twins of previously ascertained index cases. It allows for double counting of doubly ascertained twin pairs and has the advantage of being interpretable as the recurrence risk in a co-twin of an affected individual. The following formula is used in estimation: Cc=2C/(2C+D). Probandwise concordance rate Cc is more preferable of the two measures.
In theory, complete genetic determination of
a disease would equate to MZ twins having 100% concordance and DZ twins having
50% concordance (MacGregor,
2000). Naturally, this finding
should be considered in relation to other evidence and always with a critical
assessment of the assumptions behind the twin approach. Among the most central
assumptions is that the two types of twins share to an equal extent the
environmental experiences that are relevant for the development of the trait.
For behavioral and psychiatric conditions, this assumption of equal
environments has been shown to be true (Kendler, 1993) but challenged by Guo
who showed that even in the complete absence of any genetic factor and
bias, the greater environmental similarity alone in MZ twins can result in
higher concordance rate in MZ twins than in DZ twins (Guo, 2001). MZ
twin concordance rates of less than 100% emphasize the importance of
environmental factors. In lung cancer, concordance rates are very similar in MZ
and DZ twins suggesting the role of shared environment (smoking) rather than
genetics. For examples of twin studies on environmental and genetic causes of
cancer, see Lichtenstein,
2000 and Risch, 2001.
Because twins share intrauterine circulation, 100% concordance rates in twins
for leukemia (which may initiate prenatally) do not necessarily mean leukemia
is strongly genetically determined but suggests transfer of a leukemia clone to
the other twin.
The usual assumptions of a classic twin
study are random mating, no interactions between genes and
environment, and equivalent environments for MZ and DZ twins. In the
study of a complex trait, phenotypic variance is divided into a component due
to inherited genetic factors (heritability), a component due to
environmental factors common to both members of the pair of twins (the shared
environmental component), and a component due to environmental factors
unique to each twin (the nonshared
environmental component). Twin studies have been used to establish the
presence of a genetic component in the etiology in many diseases (MacGregor,
2000) including celiac disease (Greco, 2002),
Alzheimer disease (Raiha, 1996), schizophrenia (Sullivan,
2003) and cancer (Risch, 2001).
Alternatively, twins discordant for disease
have been used to examine possible environmental causes. Adoption studies also permit the separation of childhood rearing
effects from genetic effects by studying the similarity of adopted children
with their biological and foster parents. The assumptions are that the resemblance between an adopted child and
biological parent is due only to genetic effects, while that between the
adopted child and the adoptive parent is only environmental in origin (see Magnusson,
1999 for an example in cervical cancer; and the Colorado
Adoption Project website for adoptee studies). However, sometimes the representativeness of adoption studies can be
questioned due to special circumstances surrounding adoption (adoption bias).
In contrast, twins are born into all classes of society (Hublin & Kaprio, 2003). Migration
studies also provide clues for genetic vs
environmental causes (Parkin & Khjlat, 1996; see also Ecological
Studies). If the incidence in migrants revert to the host population's
incidence, this suggests stronger environmental factors in pathogenesis (as in
diabetes (Drash, 1990), hypertension (He,
1991) and cancer (McCredie,
1998; Sasco, 2003)).
Once a genetic basis is established, the
next step is to define the mode of inheritance of the trait/disease or to map
the specific gene(s) contributing to the trait. The former is achieved by segregation analysis in families.
Recurrence risk ratios within families allow an informal evaluation of the
possible segregation modes. However, it is possible to use maximum likelihood techniques to test
hypotheses representing different sources of genetic influence. For
instance, is there a single major gene or are there many genes of small effect
that influence the trait? Could there be two major genes that are interacting
to cause variation in the trait? Segregation analysis can be used to answer
these questions (see a Lecture
Note by SA Monks at
Segregation analysis estimates the genetic
model of a trait by looking at multigenerational family data. The components of
a genetic model are (1) transmission probabilities (the probability that a
parental genotype transmits a particular allele to an offspring); (2) penetrance for each genotype; and (3) allele frequencies in
the population (to determine prior probabilities of genotypes when inferring
genotypes from phenotypes) (Thompson,
1986; Olson,
1999). In model-free methods, the frequencies and penetrance
of disease genotypes need not be known in advance (Goldgar, 2001).
Segregation
analysis is a prerequisite
for linkage analyses. Segregation analysis reveals Mendelian
inheritance patterns (autosomal or sex-linked and
recessive or dominant); nonclassical inheritance
(mitochondrial diseases, genomic imprinting, parent of origin effect, genetic
anticipation etc); or non-Mendelian inheritance (no
pattern) (see Clinical Genetics for
details). Factors interfering with genotype-phenotype correlation such as
incomplete penetrance, variable expressivity,
confounding by other genes (allelic or locus heterogeneity, multigene
inheritance, epistasis, modifier genes, sex
influence, parental effect) or environmental factors, and nonclassic
genetic phenomena (imprinting - parent of origin effect, mitochondrial
inheritance) complicate the segregation analysis of complex diseases for which
no inheritance pattern is obvious despite familial aggregation. Complex
segregation analyses are based on more elaborate mathematical methods of
genetic transmission and liability (Morton,
1971; Elston, 1981; Lalouel, 1983 (PDF); Elston, 1992; Jarvik, 1998). Families with large pedigrees and many
affected individuals are particularly informative both for establishing that
genes matter and for identifying specific genes (Terwilliger & Goring, 2000).
Recurrence risk ratio for relative pairs, twin concordance rates, and heritability coefficients are all functions not only of genetic effects and gene frequency, but also of environmental effects, the distribution of environmental factors in the population, and of gene-environment interactions (for a review of GEI, see North & Martin, 2008). Thus, establishing the genetic effects on disease occurrence should not rely on purely these kinds of family-based measures (Guo, 2000a).
The genes that make
up the genetic component of a disease etiology can be localized by linkage (cosegregation)
and following association studies
identify the disease gene and its allele contributing to disease risk. If
unknown locus underlying a phenotype and the marker being studied are linked,
then pairs of relatives (usually sib pairs but can be uncle-nephew,
grandparent-grandchild, half-sib or first-cousin pairs) concordant for the phenotype will tend to be similar with respect to
the marker genotype (Risch, 1990b). For larger recombination fraction
(q )
values, grandparent-grandchild pairs are best; for small relative recurrence risk values, sibs are best. Although intuitively it sounds feasible, affected-unaffected
pairs generally represent a poor strategy (Risch, 1990b). In the absence of linkage, there is no
reason for the similarity of relatives should correlate with their similarity
for the marker genotype. Linkage analysis tests for such a correlation between
phenotype and genetic marker similarity.
Linkage studies aim to obtain a crude chromosomal location of the gene or genes associated with a phenotype of interest, e.g. a genetic disease or an important quantitative trait. Linkage strategies include traditional ones (linkage analysis on pedigrees; allele-sharing methods: candidate genes, genome screen; animal models: identifying candidate genes) and newer ones (focus on special populations (Wright, 1999; Peltonen, 2000; Arcos-Burgos, 2002) such as Finland, Iceland, Newfoundland, Sardinia, Amish and Hutterites, haplotype-sharing; congenic/consomic lines in mice). For a general review, see genetic linkage in Kimball’s Biology.
Recombination
fraction (denoted as q -
theta) between a known genetic locus (marker) and an unknown disease locus
(gene) lies at the heart of genetic linkage analysis (tutorial
by F Clerget-Darpoux). If the two loci are far
apart, segregation of one locus will be independent of the other (cosegregation and no-cosegregation
are equally likely). At q =1/2 (0.50);
four types of gametes (two recombinant and two nonrecombinant,
a total of four from a pair of homologous chromosomes) are equally likely to be
produced. This (q =1/2)
is the baseline value in linkage
studies as the proportion of gametes the person transmits following a
recombination event has occurred (when two loci are far apart on the same
chromosome, or in the extreme example, on different chromosomes, according to
the law of independent assortment, they will segregate independently to
different gametes and θ will be
equal to 50% (Forabosco, 2005; and see tutorial
by F Clerget-Darpoux)). Linked genes are, however, transmitted to the same gamete more than
50% of the time resulting in less than 50% of gametes being different from
parental chromosomes. Thus, when 0 ≤ q < ½, the 'parental-type' gametes
are more frequent than the 'recombinant-type' gametes.
Two approaches to genetic linkage (and
association) analysis have evolved for traits showing Mendelian
and non-Mendelian segregation patterns:
(1) those that
require prior specification of a genetic model (mode of inheritance) for the
trait under study (model-based / parametric methods),
(2) those that do not assume a specific trait
inheritance (model-free / nonparametric methods) (Risch, 1996; Elston, 1998; Olson,
1999; Goldgar, 2001). [Some model-free methods may be
parametric.]
Model-based linkage analysis is based on a likelihood ratio, the logarithm of which is called a lod score. This is not the logarithm of the odds for linkage but the logarithm of the likelihood ratio for a particular value of the recombination fraction vs. free recombination (q = 0.50; q = 0 for two genes that are completely linked and 0.50 for unlinked genes) (Risch, 1992; Elston, 1998; Borecki, 2001). In model-based linkage analysis, all aspects of the statistical model other than the recombination fraction are (allele frequencies and penetrance) known. Then, the likelihood of q, for 0 ≤ q < 1/2, is divided by the likelihood of q = 1/2, to yield a likelihood ratio. The logarithm to base 10 of this likelihood ratio is the lod score. Obviously a lot of lod scores can be obtained for a range of q values. Whichever q maximizes the lod (and if the lod score is > +3), this is the evidence for linkage with the particular recombination frequency (q) between the marker and the disease locus (Elston, 2000). Traditionally, a lod score > +3 is considered to be significant (and -2 may be used as an exclusion criteria). This corresponds to a P value of 10-4 (one-sided) (see also a presentation on LOD Score). A number of software is available to analyze linkage in pedigree data, most commonly used ones are Linkage, Genehunter, Mendel, Merlin and Allegro. For a complete list, see Genetic Analysis Software List (and the end of this page).
The
simplest (and nonparametric) linkage analysis is to test whether the proportion
of alleles the sibs share at a marker locus is greater than 1/2 (expected
sharing in sibs). This 'test of cosegregation' is
called the 'mean' test. In absolute linkage, 100% of concordant sibs will share
the marker as opposed to 50% (see Gulcher,
2001 for a review). In the past, when the disease gene was not known and
could not be analyzed directly, a linked genomic marker (usually a polymorphic
gene) was used for prenatal risk estimation in a given family. The assumption
would be that whichever linked allele the affected member of the family has,
the fetus would have the same if carrying the disease gene. Examples include
HLA-A linkage with hereditary haemochromatosis
(usually HLA-A3) and HLA-B linkage with congenital adrenal hyperplasia (usually
HLA-B47). Microsatellite loci in absolute linkage with then unknown disease
genes were used in prenatal diagnosis of single-gene diseases in the past (eg, congenital adrenal hyperplasia, Wiskott-Aldrich
syndrome etc). The methodology (maximum lod
score, MLS) for affected-sib-pair linkage analysis was first described by Risch (1990) and reviewed by Holmans (1998).
Linkage and association studies are occasionally mixed up. They aim to address different questions and provide different answers. Linkage is a phenomenon of cosegregating loci, not alleles, within families. Linkage studies are used for coarse mapping as they have a limited genetic resolution of about 1 cM. If two markers are close, there will not be much recombination between them and they will cosegregate. This leads to finding linkage in a pedigree-based analysis. Association studies at the population level are the next step for fine mapping. Association may result from direct involvement of the gene or linkage disequilibrium (LD) with the disease gene at the population level. Linkage always leads to an association but this is usually intrafamilial with no association at the population level (linkage of genotype for a genetic marker to disease may be unique to the particular family). In other words, linkage does not necessarily mean a consistent association with a particular allele. Allelic association, on the other hand, may or may not be due to linkage (except when LD exists between the associated marker and the unknown disease gene, association is not due to linkage). Not all associations are due to a direct genetic mechanism, i.e. being close to a disease gene. This is an important point because the value of family-based TDT test depends on co-presence of linkage and association (Spielman, 1993; Thomson, 1995; Elston, 2000). If more than one allele or haplotype shows an association with a disease, the association might reflect linkage. Different HLA-AB haplotype associations in hereditary haemochromatosis, for example, reflected linkage of HLA-A and -B loci to the HFE gene.
While recombination fraction is what linkage studies rely on, linkage disequilibrium (LD) is the foundation of association studies. The assumption is that the genetic marker studied is close enough to the actual disease gene and this will result in an allelic association at the population level (Jorde, 2000; Weiss, 2002; Carlson, 2004; Morton, 2005). Another critical assumption of both association and LD mapping is that there is little allelic heterogeneity within loci. The magnitude of LD is affected by many factors but if everything else is assumed to be equal, the most important factor is the physical/genetic distance between the disease and marker alleles: the closer they are the lower the recombination frequency and the stronger the magnitude of LD. This implies that close linkage between the marker and disease loci would result in longer periods of LD within the population. For LD to be detectable, linkage need not be present; allelic or gametic association is a better term to describe the general phenomenon of LD. See Basic Population Genetics for more on LD.
Association studies focus on population frequencies, whereas linkage studies focus on concordant inheritance. One may be able to detect linkage without association when there are many independent trait-causing chromosomes in a population (i.e., no LD of the disease causing allele to a specific marker nearby); or association without linkage when an allele explains only a minor proportion of the variance for a trait, so that the allele may occur more often in affected individuals but does a poor job of predicting disease status within a pedigree (Lander & Schork, 1994). Association is usually with a 'susceptibility' locus, which increases the probability of contracting the disease but is not 'necessary' or 'sufficient' for disease expression. In this case, the marker will not show linkage in families. If an association is, however, with a marker in LD with a 'necessary' locus for disease development, then there will be evidence for linkage in family data (Greenberg, 1993; Greenberg & Doneshka, 1996). Linkage analysis is not useful for finding loci that are neither necessary nor sufficient for disease expression (so-called susceptibility loci).
Association studies have several practical advantages over linkage studies. As opposed to linkage studies, families with multiple affected individuals are not required and no assumptions are made about the mode of inheritance of the disease. In addition, association studies have considerable statistical power to detect genes of weak effects unlike linkage studies in families (Risch, 1996; Morton, 1998; Risch, 2000). Most significant factors independently associated with increased success in linkage studies are (a) an increase in the number of individuals studied and (b) study of a sample drawn from only one ethnic group (Altmuller, 2001). For association studies, large datasets, small P values and independent replication of results are important for reproducible results (Editorial, Nat Genet 1999; Dahlman, 2002). Use of ancestral haplotype groups in association studies (evolutionary-based association study design) is another way to increase power (Templeton, 1987; 1995; 2000; Schork, 1998; Seltman, 2003; Fejerman, 2004; Tzeng, 2005).
Successful examples
of the use of linkage and association studies to locate and find disease
susceptibility genes in complex diseases include rheumatoid arthritis (PTPN22
gene; R620W variant) (Gregersen, 2005) and Crohn
disease (Ogura,
2001; Hugot, 2001; Todd,
2005).
Family-based vs population-based (case-control or prospective cohort) association studies
Family-based association studies (Thomson, 1995; Gauderman, 1999) include:
* Genotype Haplotype Relative Risk (GHRR) Method (Terwilliger & Ott, 1992)
* Haplotype Relative Risk (HRR) Method (Falk & Rubinstein, 1987; Knapp, 1993)
* Affected Family-Based Controls (AFBAC) Approach (Thomson, 1995)
* Transmission Disequilibrium/Distortion Test (TDT) (Spielman, 1993 & 1994; Ewens & Spielman, 1995; Clayton & Jones, 1999). See also sib-TDT (S-TDT) and extended-TDT (E-TDT).
TDT is robust to population stratification and can be performed in families used for linkage analysis. It is less powerful than case-control analyses because it requires 1/3 more individuals to observe the same effect (Risch & Teng, 1998; Long & Langley, 1999). TDT might be inconvenient to study diseases with late onset, but TDT-derived approaches have been developed like sib-TDT, which uses unaffected siblings as controls for affected individuals. An unaffected sibling does not have to exist for this design as parental chromosomes that have not been inherited to the affected child can be used as pseudocontrols (HRR method). TDT-based studies also benefit from the use of evolutionary-based haplotype analysis (ET-TDT) approach (Seltman, 2001). The real value of TDT is felt when population stratification cannot be controlled (Thomas, 2002; Wacholder, 2002). Because cases and controls are basically the same individuals (or their chromosomes), they originate from the same ethnicity. The TDT design tests the joint null hypothesis that there is no linkage and no allelic association (Spielman, 1993; Thomson, 1995; Elston, 2000). A significant result, chance findings aside, is due to the presence of both linkage and association. Like population stratification in population-based association studies, meiotic drive or transmission ratio distortion can cause spurious associations in TDT. For family-based association tests, see FBAT software & manual.
Possible
ascertainment problems in case-control studies in genetic epidemiology
* The sample should be representative of all
cases. Inclusion of those identified at a hospital clinic may or may not be
appropriate. They should be unrelated, incident (as opposed to prevalent) and
consecutively diagnosed ones. If the prevalence of the disease is known, this
would give an idea for the completeness of ascertainment (for a rare disease).
* If the disease requires medical attention
only in some cases, recruitment from a hospital will be selective (usually for
severe cases).
* If there is a survival effect of the
disease (as in Alzheimer disease and ApoE), and if
the associated allele also modifies the risk of death from competing causes,
the age-dependent frequencies will be different. In this case, age-matching of
cases and controls becomes particularly important.
* The controls
should be comparable to cases except for having the disease. Local, contemporary
controls should be selected via the same routes as the cases. For relevant
diseases, age- and sex-matching (by frequency or one-to-one) may be important.
The controls that have been self-selected like volunteer marrow or blood donors
are not ideal controls. A more acceptable control group is a truly
population-based one.
The pitfalls of
conventional epidemiologic studies equally apply to molecular epidemiological
research: selection bias, information bias and confounding (Vineis
& McMichael, 1998; Campbell,
2002; Boffetta, 2003) with additional problems unique to
genetic studies (Olson,
2000; Cordell,
2000; Elbaz & Alperovitch, 2002;
Lee
& Ho, 2003; Morimoto, 2003;
Potter,
2003). In genetic association studies, missing data
may be distributed differentially between cases and controls and may generate
spurious associations (Clayton, 2005).
This bias may be due to having subsets of DNA samples extracted using different
chemistries that influence the performance of the assay differentially. See Pitfalls in
Genetic Association Studies.
One potential
problem is that estimates of genetic effect are subject to confounding when
cases and controls differ in their ethnic backgrounds (population
stratification bias or confounding by ethnicity). This can occur when both
disease risk and genetic mutation frequencies vary among ethnic groups (Thomas, 2002;
Wacholder, 2002; Cardon, 2003). To avoid the problem of population
stratification bias, matching cases to controls on ethnic background, stratification,
family-based association studies or genomic controls (Devlin,
1999; Pritchard,
1999) can be used.
For more details on
genetic association studies, see Statistical Analysis
of Genetic Association Studies.
Genetic epidemiology of complex diseases
The term complex trait/disease refers to any phenotype that does not exhibit classic Mendelian inheritance attributable to a single gene although they may exhibit familial tendencies (familial clustering, concordance among relatives). The contrast between Mendelian diseases and complex diseases involves more than just a clear or unclear mode of inheritance. In Mendelian diseases, the risk to relatives decreases by a factor of ½ with each degree of relationship (from first to second to third degree) but in complex diseases the risk decreases more rapidly (Risch, 1990a). Other hallmarks of complex diseases include known or suspected environmental risk factors; seasonal, birth order, and cohort effects; late or variable age of onset; and variable disease progression. Many complex diseases are hard to diagnose accurately; even quantitative traits such as hypertension often involve sizable measurement errors (Guo, 2000b). Ultimate analysis of complex traits requires sophisticated statistical designs incorporating all genetic and nongenetic variables, their interactions, and familial correlations. In general, linkage is harder to show in a complex disease than a Mendelian disorder (Risch, 1992). A complex disease can be modeled in two different ways: (1) an additive model, closely approximates genetic heterogeneity, is characterized by no interlocus interaction, and (2) a multiplicative model, representing epistasis (interaction) among loci (Risch, 1990a). It should be recognized that, simple genetic traits are also complex when examined closely (Estivill, 1996) (see also Rannala, 2001 for a comprehensive review of complex disease genetics and the Journal of Clinical Investigation: Reviews on Complex Genetic Disorders (2005)).
Common susceptibility alleles in rare complex diseases?
One popular hypothesis proposes that the genetic factors underlying common diseases will be alleles that are themselves quite common in the population at large (Lander, 1996; Chakravarti, 1999 (PDF); Pritchard, 2001 (PDF)). When several different loci contribute to a phenotype (such as a complex disease), it is likely that the alleles at loci responsible for such interactions have high frequencies in populations (Carlson, 2004). If, for example, six genes contribute equally to a disease with an incidence of 1.5%, each susceptibility allele must have a population frequency of around 50%. Thus, modest-risk gene variants involved in polygenic diseases are often likely to be normal alleles from unsuspected loci that have relatively high frequencies (Reich, 2001). The identification of normal polymorphisms is of great importance for medical genetics (Cavalli-Sforza, 1998). However, rare coding region alleles are commonly deleterious and their contribution to the development of complex diseases is obvious (Kryukov et al, 2007; see also Ropers, 2007).
Genetic
Models
* Single major
locus: dominant, recessive
* Multifactorial / polygenic (Falconer, 1965 & 1967) in which dominant, recessive, additive, multiplicative genetic models may be possible for each locus. Dominance model is different from these and refers to association with heterozygous genotype (which does not get stronger in homozygotes; see also heterozygote advantage in Glossary).
* Mixed model: a single major locus with a
polygenic background (Morton
& McLean, 1974; Elston & Rao, 1978). For
more on genetic models, see Introductory
Statistical Genetics (PPT); Case-Control
Association Studies by CM Lewis and Lewis, 2002.
Polygenic model has its origins from Fisher's work
which concluded that 'many small, equal and additive loci' would result in
Gaussian distribution for a phenotype (Fisher RA.
Transactions of the Royal Society of Edinburgh 1918;52:399-433).
Falconer was the first to introduce the idea of a normally distributed,
quantitative trait as the 'liability' for a genetically determined disorder (Falconer, 1965). When
environmental factors are known to influence the phenotype, the model is called
multifactorial
(polygenes and environment). The presence or absence
of the phenotype is determined by a threshold, T (Falconer's multifactorial
liability threshold model; multifactorial
threshold model). This model applies to most complex
diseases (like diabetes, hypertension, schizophrenia, cancer) where both
multiple genes and environmental factors play a role in the development of the
disease ((for a review of GEI, see North & Martin,
2008). These disorders are presumed to result
from additive effects of multiple genes with low penetrance.
Individual mutations may not have any particular phenotype, but when act in
concert and in the presence of the necessary environmental conditions,
they may produce a disease phenotype. Under a model of multiple interacting
loci, no single locus could account for more than a five-fold increase in the
risk of first-degree relatives. The disease shows increased incidence in
families but with no recognizable inheritance pattern.
The features of multifactorial
inheritance are:
- The more severe the condition, the greater the risk to sibs,
- Carter
effect: the sibs or offspring of a patient in less commonly affected sex
have higher susceptibility to the disease,
- If it is a rare disease, the frequency of the disease among relatives
is higher,
- If more than one individual in a family is affected, recurrence risk is
higher,
- The risk falls rapidly as one passes from 1st to 2nd degree relatives.
It was later appreciated that there were at least two thresholds for
many diseases -differing by sex or causing different severity- (Reich,
1972). Examples include pyloric stenosis (sex dimorphism for liability) (Chakraborty, 1986) and orofacial cleft syndrome / cleft lip and palate
(two thresholds for fetal mortality and disease) (Dronamraju, 1982 & 1983).
The latter model proposes a lower threshold level of liability resulting in a
cleft formation and a higher level causing a fetal death (preferentially
males). The reason that multifactorial threshold
model is not very popular currently is that it has been replaced by the concept
of genetic heterogeneity.
Each model has their assumptions and parameters. Multifactorial
and mixed models are analyzed the same way as statistical modeling of polytomy as a function of continuous variables. In both
models, liability is assumed to be continuous (representing the sum of a large
number of independent genetic and environmental factors) and normally
distributed within the population. Another assumption is that all correlations
between relatives are due to shared genes but not to shared environment. Multifactorial modeling may fail due to a variety of
reasons. These include the presence of a major gene, invalidity of assumptions
of normality and shared genes, presence of dominance and epistatic
interaction (multiplicative as opposed to additive effect). Multifactorial
complex traits that have multiple genetic determinants in one population may
show simpler inheritance patterns in another due to allele distribution
differences. The genetic basis of a multifactorial
disease is that a genetically susceptible individual may or may not develop the
disease depending on the interaction of a number of risk factors, both genetic
and environmental (Risk
Estimation for Multifactorial Diseases. ANN ICRP, 1999; Ottman,
1996; Cooper,
2003).
As all model fitting applications, genetic modeling is an imperfect
science. For this reason and also because of the increasing knowledge of the
human genome, the trend is changing to skipping initial steps of exploring
genetic predisposition to specific candidate genes (Candidate-Gene
Association Studies in Complex Genetic Traits) or whole-genome
association studies (see also Clark, 2005).
Links
Online
Encyclopedia of Genetic Epidemiology
International Genetic Epidemiology Society
Courses
Wellcome Trust Genome Campus Advanced Courses: Genetic
Analysis of Multifactorial Diseases
Online Resources
Wellcome Trust Centre for Human Genetics: Course on
the Design and Analysis of Disease-Marker Association Studies
Wellcome
Trust Genome Sequence & Variation Course Manual (2003) (Other Course Manuals)
Genetic
Epidemiology Course Lectures by David Clayton
EFG: Statistical
Genetic Analysis of Complex Phenotypes Course Notes (2005) incl. Case-Control
Association Studies by CM Lewis
Introduction to
Genetic Linkage and Association Course Notes by Dave Curtis
Genetic
Epidemiology SuperLecture by Kevin Kip
Introduction
to Genetic Epidemiology Lecture by Hermine Maes
Human
Molecular Genetics (Strachan & Read; 1999): Genetic Mapping of Complex Characters & Complex
Disease Genetics
Centre for
Integrated Genomic Medical Research (CIGMR):
Statistical Genetic Analysis
GENESTAT: Genetic
Association Studies Portal
Statistics for
Genome-wide Association Studies by Laurent Briollais
@ Bioinformatics.ca: PDF
| PPT
CDC Genomics and
Disease Prevention Center: Research Methods Publications
Basic Molecular
Genetics for Epidemiologists
Genetic
Epidemiology Studies on Twins by Nick Martin
Quantitative
Genetics Lecture Notes (slide
presentation)
Human
Genetics Interactive Learning Exercises
Genetic
Calculation Applets by Knud Christensen (including
heritability and variance components)
NIH National Human Genome Research
Institute Programs (HapMap; ENCODE;
Genetic Variation)
SNP@Ethnos:
a database of ethnically variant SNPs (Park,
2007)
HuGE
Navigator / Genopedia / Phenopedia
Online
Encyclopedia for Genetic Epidemiology Studies
Glossary on Genetic Epidemiology: Basic
& Advanced
STROBE (STrengthening the Reporting of OBservational
studies in Epidemiology) & Checklists
Suggested
Books
* Armitage
P & Colton T. Encyclopedia of Biostatistics. Volumes
1-8. John Wiley & Sons, 2005
* Bishop T & Sham P: Analysis
of Multifactorial Diseases. Academic Press, 2000
* Clayton D & Hills M. Statistical
Models in Epidemiology. OUP, 1993
* Elston
RC, Olson J, Palmer L (Eds) Biostatistical
Genetics and Genetic Epidemiology. John Wiley, 2002
* Falconer
DS & Mackay TF. Introduction to Quantitative Genetics.
4th Ed. Essex, UK: Longman, 1996
*
Hedrick PW. Genetics of Populations. 3rd
Ed. Boston, MA: Jones and Bartlett, 2004
* Khoury MJ, Beaty TH, Cohen BH (Eds) Fundamentals
of Genetic Epidemiology. New York: Oxford University Press, 1993
* Khoury MJ. Fundamentals of Genetic Epidemiology (Book Chapter)
1997
* Lange K. Mathematical
and Statistical Methods for Genetic Analysis. Springer-Verlag
New York Inc, 2002
* Morton NE. Outline
of Genetic Epidemiology.
Basel: Karger, 1982
*
Neale B,
Ferreira M, Medland S, Posthuma
D. Statistical
Genetics: Gene Mapping Through Linkage and Association.
Oxford: Taylor & Francis, 2007 (Amazon)
*
Rao DC & Province MA. Genetic Dissection of Complex Traits (Advances in Genetics, Vol.
42). Academic Press, 2000
*
Sham P. Statistics
in Human Genetics. Hodder Arnold, 1997
* Thomas DC. Statistical Methods in Genetic Epidemiology. Oxford University Press, 2004
* Weir B. Genetic
Data Analysis III. Sinauer Associates Inc, 2009
* Yang MC. Introduction
to Statistical Methods in Modern Genetics. CRC, 2000
* Ziegler A & Koenig
IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2006
* Costa LG & Eaton DL. Gene-Environment Interactions.
Wiley Online, 2006
* Human Genome Epidemiology Online Book (CDC, 2004)
* Genetics and
Public Health in the 21st Century. Oxford
University Press, 2000
* Handbook of
Statistical Genetics 2nd Edition, 2004; Online (Balding,
Bishop, Cannings; John Wiley & Sons)
Articles
* Collected
Papers of R.A. Fisher (including Statistical
Methods in Genetics, Heredity 1952)
* Complex
Disease Trait Mapping & Linkage Analysis Journal Watch
*
Nature Reviews Genetics Web Focus: Statistical Analysis: Editorial, ( I ), ( II ), ( III ), ( IV )
*
The Lancet Septet on Genetic Epidemiology: Editorial, ( I ), ( II ), ( III ), ( IV ), (
V ) , ( VI ), ( VII )
*
Journal of Clinical Investigation Reviews
on Complex Genetic Disorders (2005)
*
Manolio et al: HapMap and
genetics of complex disease. JCI
2008:118(5):1590-1605
* Thompson EA. Genetic epidemiology: a
review of the statistical basis. Stat
Med 1986;5: 291-302
* Thompson EA. Likelihood and linkage:
from Fisher to the future. Ann Stat 1996;24:449-65 (JSTOR-UK)
* Morton
NE. Genetic epidemiology (review). Ann
Hum Genet 1997;61:1-13
* Balding
DJ: Tutorial on Statistical Analysis of Population Association Studies. Nat
Rev Genet 2006;7:781-91
* Elston RC. Introduction and overview.
Statistical methods in genetic epidemiology. Stat
Methods Med Res 2000;9:527-41
* Elston RC. Segregation analysis. Adv
Hum Genet 1981;11:63-120
* Elston RC. Linkage and association. Genet
Epi 1998;15:565-76
* Kaprio J.
Genetic Epidemiology. BMJ 2000;
320:1257-9
* Little J et al. Reporting,
appraising, and integrating data on genotype prevalence and gene-disease
associations.
Am J Epidemiol 2002;156:300-310
* North KE & Martin LJ: The
importance of gene environment interaction: implications for social scientists.
Sociological
Methods & Research 2008;37(2):164-200
* Rannala B.
Finding genes influencing susceptibility to complex diseases in the post-genome
era. AJPG 2001;1:203-221
* Risch N. Searching for genetic determinants in the new
millennium. Nature
2000;405:847-56
* Schork NJ. Genetics of
complex disease. Am J Respir Crit Care Med 1997;156:S103–9
* Sellers TA & Yates JR. Review of
proteomics with applications to genetic epidemiology. Genet
Epidemiol 2003;24:83-98
*
Tabor HK, Risch NJ, Myers RM. Perspective on
candidate-gene approaches for studying complex genetic traits. Nature
Rev Genet 2002;3:1-7
* Carlson CS et al. Mapping complex disease loci in
whole-genome association studies. Nature
2004;429:446-52
* Clayton D
& McKeigue PM. Epidemiological methods for
studying genes and environmental factors in complex diseases. Lancet
2001;358:1356-60
* Burton PR et al. Key concepts in genetic
epidemiology. Lancet
2005;366:941-51
* Houlston RS
& Peto J. The search for low-penetrance cancer susceptibility alleles. Oncogene 2004;23(38):6471-6
* Editorial: Freely Associating. Nat Genet 1999;22:1-2
* Ellsworth
DL & Manolio TA. Importance of
Genetics in Epidemiologic Research Part I - Part II - Part III. Ann
Epidemiol 1999
*
Whittemore AS & Nelson LM: Study Design in
Genetic Epidemiology. Journal
of the National Cancer Institute Monographs 1999: 26;61-9
*
Yu W et al: Phenopedia and Genopedia.
Bioinformatics
2010;26(1):145-6
* Advances
in Genetics Vol. 42, 2001 [
- Rao DC. Genetic dissection of complex traits: an overview. Advances
in Genetics 2001;42:13-34
- Rice TK, Borecki IB. Familial resemblance and heritability. Advances
in Genetics 2001;42:35-44
- Borecki
IB, Suarez BK. Linkage and association: Basic concepts. Advances
in Genetics 2001;42:45-68
- Province M. Linkage and
association with structural relationships. Advances
in Genetics 2001;42:183-190
- Schork
NJ, Fallin D, Thiel B, Xu X, Broeckel U, Jacob HJ, Cohen
D. The future of genetic case-control studies. Advances
in Genetics 2001;42:191-212
- Goldgar
DE. Major strengths and weaknesses of model-free methods.
Advances
in Genetics 2001;42:241-54
- Morton NE. Complex
inheritance: the 21st century. Advances
in Genetics 2001;42:535-43
* Int
J Epidemiol Oct 2004 Issue: Special
Theme - Genetic Epidemiology
* Annals
of the ICRP 1999;29(3-4): Risk Estimation for Multifactorial Diseases
* Statistical Methods in Medical Research,
Vol.9, No.6, 2000 (Genetic Epidemiology Issue)
*
Phil. Trans. R. Soc. B: Genetic Variation
and Human Health Issue (2005) (open access)
*
JAMA: Genetics
Issue (2008)
*
Human Molecular Genetics: Complex Diseases
(April 2004)
*
Human Molecular Genetics: Genome-wide Association Studies Issue
(October 2008)
* Human Heredity: Gene-Gene
and Gene-Environment Interaction in Complex Trait Genome Wide Association
(February 2007)
Journals
American Journal of
Human Genetics
Annual
Review of Genomics & Human Genetics
European Journal of Human Genetics
Twin Research & Human
Genetics
Multimedia
UAB Statistical Genetics Short Course Lectures on Video
Genetic Epidemiology on CD-ROM
Applied Genetic Epidemiology
Genetic
Epidemiology of Cancer
Genetic
Epidemiology of Prostate Cancer
Genetic Epidemiology of
Infectious Diseases
Genetic
Epidemiology of Coronary Heart Disease
Genetic Epidemiology of
Coronary Atherosclerosis
Genetic
Epidemiology of Essential Hypertension
Genetic Epidemiology
of Type 1 Diabetes
Genetic
Epidemiology of Inflammatory Bowel Disease
Genetic
Epidemiology of Multiple Sclerosis (PubMed); see also Willer, 2003 & Ebers, 2004
Genetic
Epidemiology of Psoriasis & Autoimmunity
Genetic Epidemiology of
Osteoarthritis
Genetic
Epidemiology of Orthopedic Conditions
Genetic
Epidemiology of Alzheimer Disease
Genetic Epidemiology of
Neurodegenerative Disease
Genetic
Epidemiology of Myopia
Genetic
Epidemiology of Glaucoma
Genetic
Epidemiology of Bipolar Disorder
Genetic Epidemiology of
Rheumatoid Arthritis
See
also: The North American Rheumatoid Arthritis Consortium (NARAC)
Seven
Examples of Applied Genetic Epidemiology of Complex Diseases in Human
Molecular Genetics by Strachan & Read
Research
at Melbourne University Centre for Genetic Epidemiology
Research at Erasmus MC
Genetic Epidemiology Program (Netherlands)
MSc
in Genetic Epidemiology (
MSc in Bioinformatics (
MSc in Genetic
Epidemiology and Bioinformatics (
Software
Genetic Epidemiology
Programs for Stata
Comprehensive List of Genetic
Analysis Software
NCBI: Software for
Genetic Linkage Analysis
Statistical Analysis in Genetic
Epidemiology (S.A.G.E.) Software Package
Software written by London Statistical Genetics Group
Software
by UCLA Human Genetics Department (incl Mendel)
Computational Genetics
Laboratory (Gene-gene
Interaction and Epistasis Analysis Software) (NH,
Statistical
Genetics Programs by JH ZHAO (MRC Epidemiology Unit,
Statistical
Genetics Programs by D CURTIS (Barts,
Statistical Genetics
Programs by D CLAYTON (STATA)
(
Statistical
Genetics Programs by F DUDBRIDGE (
Statistical
Genetics Programs by M STEPHENS (
Statistical Genetics
Programs by J PRITCHARD (
Statistical Genetics Programs by
G ABECASIS (
Statistical Genetics Programs
by NJ COX (
Statistical Genetics
Programs by A THOMAS (
Statistical Genetics Programs by S
PURCELL (
Statistical Genetics -SAS-
Programs by M KNAPP (
Statistical Genetics
Programs by JR GONZALES (CREAL, Barcelona)
Statistical Genetics Programs by
CHAN M-S (
Statistical Genetics Programs by J
MARCHINI (
Mathematical Genetics Programs by G McVEAN (
Quantitative Genetic Epidemiology
Programs (
Tools for
Genome Mapping of Disease (Kwiatkowski
Laboratory,
Wellcome Trust Centre for Human Genetics (
UAB SOPH Section on
Statistical Genetics (
UAB SOPH Section on
Statistical Genetics: Association & TDT
Programs (
Carnegie Mellon University Bioinformatics & Statistical Genetics Group Software (Pittsburgh,
USA)
University of Pittsburgh
Division of Statistical Genetics Software (Pittsburgh, USA)
University
of Pittsburgh Medical Center Computational Genetics Laboratory (Pittsburgh,
USA)
North Carolina State
University Statistical Genetics Software (NC, USA)
University of Michigan Center for Statistical
Genetics Software (MI, USA)
Computational Genetics Laboratory
at the Norris-Cotton Cancer Center and Dartmouth
Medical School (NH, USA)
University
of Southampton Genetic Epidemiology Software
(Southampton, U.K.)
Mayo Clinic Statistical and Genetic
Epidemiology (Rochester, MN, USA)
Gene[VA] Tools for
Genetic Data Analysis (AGP,
Universite de Geneve,
Switzerland)
Mathematical Genetics
Software (Oxford, UK)
Linkage
Disequilibrium Software (
Statistical Genetics
and Marker Assessment Programs (
Genome-wide
Association Study Software (
Edinburgh
University Medical Genetics Software (
Pedigree
Analysis for Genetics (PANGAEA) (
Variation
Discovery Resource Software (UW-FHCRC)
Software
and Databases at the Institute of Cancer Research (
MRC Rosalind
Franklin Centre for Genomic Research -RFCGR- Software Compilation
Software at Dyer Laboratory for
Population Genetics (including AMOVA)
Software at Erasmus
MC Genetic Epidemiology Program (Netherlands)
Genetic
Calculation Applets by Knud Christensen (including
heritability and variance components)
Online
Encyclopedia of Genetic Epidemiology Studies Software
S.A.G.E. MERLIN MENDEL ARLEQUIN AMOVA LOTUS (manual; Nickolov & Milanov, 2007) MDR (tutorial; Hahn, 2003) UNPHASED
(manual;
ref) PHASE CLUMP CLUMPHAP BLADE (Bayesian Haplotype LD Mapping)
BLADE
(ONLINE) PDT SimWalk2 (alternative) GenTools SNPTagger TAG
'n' TELL WinBUGS GENECOUNTING
(ONLINE) EHPLUS PHYLIP STRUCTURE & STRAT L-POP ADMIXMAP ANCESTRYMAP AFBAC GOLD HAPLOVIEW (tutorial) MIDAS SCORE (R) Genotype2LDBlock TFPGA GENEPOP GDA DISEQ SHEsis
(ref) PyPop ONLINE HWE and ASSOCIATION TESTING
(SNP) SNPator SNPLINK SNP Control SNPedia MedRefSNP PRESTO OEGE HWE CALCULATOR TDT & S-TDT E-TDT ET-TDT FBAT software (manual) DMLE-LD
Mapping ALLASS LDMAP TRIMHAP WHAP
GWAS-related
software: PLINK
& BEAGLE WGAviewer (Ge,
2008; Workshop
Notes) EIGENSOFT
(PCA) GenABEL
(tutorial) SNPassoc
(Ref)
MAVEN (Ref) INTERSNP
(Ref)
GWAS
databases: List of recent GWAS in OEGE & HuGENet; NHGRI
(NIH): catalog of GWAS (PPT); Japan: GWAS
Database; Open Access Database of GWAS Results: File
1 & File
2 (Johnson &
O’Donnell, 2009)
Address for bookmark:
http://www.dorak.info/epi/genetepi.html
M.Tevfik Dorak, MD, PhD
Last updated on 11 January 2010
Genetics Clinical Genetics Population Genetics Biostatistics
Epidemiology Bias & Confounding Evolution
HLA MHC Glossary Homepage