Genetics Clinical Genetics Population Genetics Genetic Epidemiology Bias & Confounding Biostatistics
Evolution Homepage
GENETIC
EPIDEMIOLOGY GLOSSARY
M.Tevfik DORAK
Accompanying Genetic Epidemiology Lecture Note & Presentation
ACCE project (analytic validity, clinical validity, clinical utility, ELSI): A CDC sponsored project for
evaluating data on emerging genetic tests. It takes its name from the four
components of evaluation: analytic validity, clinical validity, clinical
utility and associated ethical, legal and social implications (ELSI). For
details, see ACCE
Project website. See also Grosse
& Khoury, 2006 for the clinical utility of
genetic testing, Offit, 2008 for issues surrounding genomic disease
profiling, and Pharoah, 2008 for the possible utility of genomic
profiling in breast cancer risk assessment.
Additive genetic model: In a disease association study, if the
risk conferred by an allele is increased r-fold for heterozygotes
and 2r-fold for homozygotes, this corresponds to
additive model (Lewis, 2002).
These data are best analyzed using Armitage trend
test for genotype frequencies or by logistic regression in which the genotypes
are represented as (-1), 0, (+1). This genotype-based association test does not
require the locus to be in Hardy-Weinberg
equilibrium. In the case of an association with heterozygosity, the
additive model test may be statistically non-significant despite the presence of
an association. Thus, a non-significant additive model test does not rule out
an association. It has been pointed out that “genes do not generally act in a
simple additive manner but through complex networks involving gene-gene and
gene-environment interactions” (Colhoun, 2003). The effect that cannot be explained by
an additive (or heterogeneity /
non-interactive) model in complex disease genetics is due to the dominance (epistatic / interactive) model. See also multiplicative genetic model. See MODEL-online tool for genetic association analysis for different models.
Additive variance: The component of genetic variance due to
the additive effects of alleles segregating in the population. In evolutionary
genetics, additive genetic variance is a measure for the potential amount of evolutionary
change caused by natural selection. See Genetic
Calculation Applets: Additive
Variance Calculator and Genetic
Variance from a Single Locus.
Admixture mapping (mapping by admixture linkage disequilibrium-MALD): An association-based
approach to localizing
disease-causing variants that differ in frequency between two
historically separated populations by a whole-genome scan. Fundamental to the use of admixture mapping
is the knowledge that the disease of interest exhibits frequency differences
across the two populations because of genetic differences. See Collins-Schramm,
2002; Smith,
2004; McKeigue, 2005; Seldin,
2007.
Adoption studies: A design for assessing the proportion of variance due to genetic and
environmental sources. The assumptions are that the resemblance between an
adopted child and biological parent is due only to genetic effects, while that
between the adopted child and the adoptive parent is only environmental in
origin. The important issues in the interpretation of adoption studies are adoptees are a highly selected group of children, age at
adoption varies widely, and contact may have been maintained between adoptees and their biological parents. The Colorado Adoption Project is a rare
and successful example of full adoption study.
Affected family-based controls (AFBAC) method: One of several family-based association study designs (Thomson, 1995). This one uses both affected members of a family (when there are two) and uses the allele or alleles not transmitted to the affected case(s) as control. See also HRR and TDT (AFBAC Software; FBAT software & manual)
Affected pedigree member (APM) method: Like the ASP method, a nonparametric and model-free method for testing linkage. This method compares observed allele sharing patterns among affected pedigree members (not necessarily siblings) against those expected under random assortment. An affected relative pair statistic is given by a weighted sum of the frequencies of observed identical-by-descent configurations.
Affected sibpair
(ASP) method: A linkage
study design that tests for excess sharing of marker alleles identical by
descent in affected-affected sibpairs. This method is
often described as a nonparametric and model-free alternative to the parametric
lod score method.
Allele sharing method (affected sibling pairs
method): A non-parametric test of linkage that does not require assumptions on
model of inheritance, penetrance or disease allele frequency. Two siblings have
50% chance of inheriting an allele identical by descent (IBD). If the disease
is genetically determined, two affected siblings are more likely to have
inherited a common disease allele from one or both parents at the disease
locus. So if affected siblings inherit IBD alleles at a given locus more often
than expected by chance, there is a probability that the shared alleles are
responsible for the disease, or in linkage with the trait allele (see Engelmark, 2004).
Alternative explanation: Chance effect (random error), bias
(systematic error) and confounding are always alternative explanations for any
observed association. See also Bias and
Confounding Lecture Note and Presentation.
Analysis of molecular variance (AMOVA): A statistical (analysis of variance)
method for analysis of molecular genetic data. It is used for partitioning
diversity within and among populations using nucleotide sequence or other
molecular data. AMOVA produces estimates of variance components and F-statistic
analogs (designated as phi-statistics). The significance of the variance
components and phi-statistics is tested using a permutational
approach, eliminating the normality assumption that is inappropriate for
molecular data (Excoffier, 1992). AMOVA can be performed on Arlequin
or AMOVA. For examples,
see Roewer,
1996; Sawkins,
2001; Stead, 2003;
Watkins, 2003);
see also AMOVA
Lecture Note (EEB348);
AMOVA
& Population Differentiation; SMART Module on AMOVA.
Ancestry informative markers (AIMs):
Genetic markers that show large differences in frequency across population
groups. These loci are useful in ancestry determination as in case-control studies.
These markers can be used in admixture mapping or admixture-matching in
case-control studies. For sets of AIMs used in such
studies, see Shriver,
2003; Choudhry, 2006; Tsai,
2006; Seldin, 2006; Tian, 2006 & 2007.
Association: A statistically significant correlation
between an environmental exposure, a trait or a biochemical/genetic marker and
a disease or condition. An association may be an artifact (random error-chance,
bias, confounding) or a real one. Genotyping errors also cause spurious
associations. In population genetics, an association may be due to confounders
including population stratification (confounding by ethnicity), linkage disequilibrium, reverse
causation or direct causation. Association studies may prove useful in
identifying a genetic factor in a disease. A significant association should be
presented together with a measure of the strength of association (odds ratio, relative risk or relative hazard and its 95% confidence interval)
and when appropriate, with a measure of potential impact (attributable risk,
prevented fraction, attributable fraction/etiologic fraction).
Balancing
selection: Selection involving opposing forces in which selective
advantages and disadvantages cancel each other out. Heterozygote
advantage (or overdominant selection) is an example in which an allele
selected against in the homozygous state is retained because of the superiority
of heterozygotes. Other balanced states may occur
including when: an allele is favored at one developmental stage and is selected
against at another (antagonistic pleiotropy); an
allele is favored in one sex and selected against in another (sexual
antagonism); an allele is favored when it is rare and selected against when it
is common (negative frequency dependent selection).
Beavis effect: Upward bias in significant quantitative trait
(QTL) effects in a genome scan. This overestimation of QTL effects is astatistical artefact and named
after William D Beavis following his simulation studies (see Xu,
2003). See also Rockman & Kruglyak, 2006
and Gibson,
2002.
Bias: An estimator for a parameter is unbiased if its expected value is the
true value of the parameter. Otherwise, the estimator is biased. It is the
quantity E = (q-hat) - q. If the estimate of q is the same as actual but
unknown q, the estimate is unbiased (as in estimating the mean of normal,
binomial and Poisson distributions). If bias tends to decrease as n gets
larger, this is called asymptotic unbiasedness. One
of the most recent biases reported in genetic epidemiology is caused by
differential bias in genotype scoring (half-call rate) between case and control
DNA samples in large studies using automated genotype scoring algorithms (Clayton, 2005). See reviews by Sackett, 1979; Choi & Noseworthy, 1992; Grimez
& Schulz, 2002; Campbell,
2002; Potter, 2003;
Delgado-Rodriguez
& Llorca, 2004 (Bias Glossary) and Bias & Confounding in
Molecular Epidemiology, Bias of Ascertainment in Complex Disease Genetics and See also Bias and
Confounding Lecture Note & Presentation.
Carrier: A healthy person who is a heterozygote for a recessive trait. Also includes
persons with balanced chromosomal translocations. The unfortunate use of
‘carrier’ to describe individuals positive for a
genetic marker is wrong, and the use of ‘carrier frequency’ in that context
should be replaced by ‘marker frequency’.
Carter effect: Higher incidence of a genetically
determined condition in relatives when the index case is the less commonly
affected sex. This phenomenon was first demonstrated in Dr Cedric
Carter's study of pyloric stenosis, where
the incidence is highest in the sons of affected women and lowest in daughters
of affected men.
Case-control study: A design preferred over cohort studies for relatively rare
diseases in which cases with a disease or exposure are compared with controls
randomly selected from the same study base. This design yields odds ratio
(as opposed to relative risk from cohort
studies) as the measure of the
strength of association. See Case-control
Studies Chapter in Epidemiology
for the Uninitiated, Epidemiologic
Study Designs (PPT), and Case-Control
Genetic Association Studies in EFG
Summer School.
Case-only
design: A study design that
is used to assess deviations from purely multiplicative interactions (Begg & Zhang, 1994; Khoury & Flanders, 1996; Botto & Khoury, 2004).
The case-only design has been shown to be more efficient for detecting gene and
environment interaction than case–control studies (Piegorsch, 1994; Goldstein
& Andrieu, 1999). It estimates
departure from multiplicative risk ratios (if genotype and environmental
exposure are not associated in the population) as opposed to odds or rate ratio
(Schmidt
& Schaid, 1999). The method cannot be
used as a substitute for traditional case-control studies since it is limited
to the detection of interactions only. It has higher power than traditional
designs in detection of gene-gene and gene-environment interactions (Yang,
1997 & 1999;
Gauderman, 2002).
Causal relationship: It does not matter how small it is, a P
value does not signify causality. To establish a causal relationship, the
following non-statistical evidence is required: consistency (reproducibility),
biological plausibility, dose-response, temporality (when applicable) and
strength of the relationship (as measured by odds ratio/relative risk/hazard
ratio). See Hills's
Criteria of Causality & Seven
Common Errors in Statistics.
Coalescence time: Number of generations to the most recent
common ancestor carrying a mutation or DNA variant currently present in a given
population. See Gil McVean’s lecture on the
coalescent. See a lecture note on Introduction to
Coalescent Theory.
Codominance: Equal effect on the phenotype of two alleles
of the same locus (as opposed to recessive and dominant).
Cohort effect: The tendency for persons born in certain
years to carry a relatively higher or lower risk of a given disease. This may
have to be taken into account in case-control studies. For example the
penetrance of BRCA1 is greater for women born after 1930 than for those born
earlier (Narod, 1993), thus, the risk of breast cancer has
increased among BRCA1 mutation carriers.
Cohort study: A longitudinal follow-up study which
begins with a group of people who do not have the trait of interest at the
outset but a proportion of which will develop during the follow-up. The outcome
is modeled for the explanatory variables to obtain the relative risk. Cohort studies may be historical or prospective. See
Cohort
Studies Chapter in Epidemiology
for the Uninitiated; Epidemiologic
Study Designs (PPT).
Common-disease-common-variant (CDCV) hypothesis: This hypothesis predicts that the genetic risk for common diseases will often be due to disease-predisposing alleles with relatively high frequencies; there will be one or a few predominating disease alleles at each of the major underlying disease loci (Lander, 1996; Chakravarti, 1999; Weiss & Clark, 2002; Becker, 2004). The hypothesis speculates that the gene variation underlying susceptibility to common heritable diseases existed within the founding population of contemporary humans. Whether the CDCV hypothesis is true for most diseases is yet unknown but there are a few prototypical examples: the APOE e4 allele in Alzheimer disease, Factor VLeiden in deep venous thrombosis and PPARg Pro12Ala in type II diabetes. Recent studies have also shown the importance of rare variants in complex disease genetics (Liu, 2005; Kryukov, 2007).
Complementation: The production of a wildtype phenotype in
spite of recessive mutations in two different genes because of the presence of
normal copies of those genes on homologous chromosomes. If recessive mutations represent alleles of the same gene, this would
be compound heterozygosity and would not complement each other to produce a
wildtype phenotype because they both represent loss-of-function of the same
gene. Deafness in humans can be caused by a recessive mutation at a number of
genes, so it is not uncommon for two deaf parents to have children who hear.
Complex disease: The term complex trait/disease refers to
any phenotype that does not exhibit classic Mendelian
inheritance attributable to a single gene; although they may exhibit familial tendencies
(familial clustering, concordance among relatives). The contrast between Mendelian diseases and complex diseases involves more than
just a clear or unclear mode of inheritance. Other hallmarks of complex
diseases include known or suspected environmental risk factors; seasonal, birth
order, and cohort effects; late or variable age of onset; and variable disease
progression. See Genetic Epidemiology for Complex Disorders: Principles and Practice
(NHLBI
Webcast) and Rannala,
2001 for a comprehensive review of complex disease genetics.
Compound
heterozygote: An
individual who is affected with an autosomal
recessive disorder having two different mutations in the same gene on
homologous chromosomes. An individual in whom each of the two
alleles of the same locus carry a different mutation (for a recessive
disorder). The C282Y and H63D mutations of HFE frequently occur as compound
heterozygosity.
Confounding: The distortion of a measure of association
because of a non-intermediate factor that is correlated with the variable of
interest and independently associated with the outcome. An analysis done on observations that all
have the same value of the confounder will not be confounded. This can be
achieved by stratification for the confounder or by matching. See Taioli, 2002; Potter, 2003; Bias & Confounding in
Molecular Epidemiology; Bias &
Confounding (PPT).
Confounding
variable: A variable that
is associated with both the outcome and the exposure variable. A classic example is the relationship
between heavy drinking or gambling and lung cancer.
Here, the data should be controlled for smoking as it is related to both
drinking/gambling and lung cancer. A positive confounder is related to exposure
and response variables in the same direction (as in smoking); a negative
confounder shows an opposite relationship to these two variables (age in a
study of association between oral contraceptive use and myocardial infarction
is a negative confounder). The data should be stratified before analyzing it if
there if confounding is suspected. Mantel-Haenszel test is designed to analyze stratified data to
control for a confounding variable. Alternatively, a multivariable regression
model can be used to adjust for the effects of known confounders. The best
strategy to avoid confounding is randomization. See Bias and Confounding (PPT).
Copy number variation (CNV): Gains and losses of genomic segments
resulting in variation in the number of copies of a genomic region or gene per
diploid genome. Most genes show this variation and study of disease associations
with CNV is becoming common. Reference gene in CNV studies is commonly RNAse P (RPPH1), which invariably exists in two copies in
human diploid genome. See Redon, 2006; Estivil & Armengol, 2007; Sanger Institute CNV
Project; Database of Genomic
Variants; ABI
TaqMan® Gene Copy Number Assays.
Cramer’s V: This measure of the strength association for any
size of contingency tables is a transformation of the Chi-squared value for
sample size. It provides a value between 0 and 1 for relative comparison of the
strength of associations. For a 2x2 table, Cramer's V is equal to the Phi coefficient. Cramer’s V is most
useful for large contingency tables and it can be used as a global linkage
disequilibrium value for multiallelic loci (global D’
is another measure for multiallelic loci and both can
be calculated on UNPHASED
(manual;
ref). (See GOLD-Disequilibrium
Statistics; Online Cramer’s V calculation.)
Crossing-over (recombination): The exchange of genetic material
between non-sister chromatids of homologous
chromosomes (i.e., between maternal and paternal chromosomes) during meiosis.
This results in a new and unique combination of genes on the daughter
chromosome, which will be passed on to the offspring (if that particular gamete
is involved in fertilization). See a Demonstration
of Crossing-Over (JAVA Applet) and Genetic
Linkage Tutorial by F Clerget-Darpoux.
Dominance: In classic genetics, dominance is the property possessed
by some alleles of determining the phenotype by masking the effects of the
other allele (when heterozygous). Thus, homozygosity or heterozygosity for
the dominant allele results in the same genotype in complete dominance (if red
is dominant over white, the petals of a flower heterozygous for red and white
would be red). Incomplete dominance appears as a blend of the phenotypes
corresponding to the two alleles (like pink petals as opposed to red or white).
In co-dominance, both alleles equally contribute to the phenotype (red and
white petals occur together). See also recessive.
Dominance variance: The component of genetic variance due to non-additive effects of alleles at the same locus (Cockerham, 1954). This component represents all genetic effects other than the additive effects and includes intra-locus allelic interactions. This component is commonly ignored in analysis of genetic associations but can be calculated without much trouble. Dominance variance modeling should not be mixed up with dominant models. See an applet on Genetic Variance from a Single Locus at HGSS.
Dominant allele: An allele that masks an
alternative allele when both are present (in heterozygous form). Homozygous
dominant and heterozygous genotypes contribute the same to the phenotype. Most
common autosomal dominant diseases are due to
mutations in transcription factor genes (Jimenez-Sanchez,
2001). See Clinical Genetics.
Dominant model: A genetic association analysis mode that examines association with a dominant allele. The comparison groups are wild-type homozygous genotypes vs allele positivity (combining heterozygotes and homozygotes for the variant). See MODEL-online tool for genetic association analysis for different models. See also Lewis, 2002.
Dominant-negative mutation: A (heterozygous) dominant mutation on one allele blocking the activity of wild-type protein still encoded by the normal allele (often by dimerizing with it) causing a loss-of-function phenotype. The phenotype is indistinguishable from that of homozygous dominant mutation. P53 mutations may act as dominant-negative (see also haploinsufficiency). See Clinical Genetics.
Dosage compensation: The phenomenon in women, who have two copies of genes on the X chromosome, of having the same level of the products of those genes as males (who have a single X chromosome). This is due to the process of random inactivation of one of the X chromosomes in females (Lyonization).
Effect modification: The situation in which a measure of effect changes over values of another variable (the association estimates are different in different subsets of the sample). The relative risk or odds ratio associated with exposure will be different depending on the value of the effect modifier. For example if in a disease association study, the odds ratios are different in different age groups or in different sexes, age or sex are effect modifiers. Effect modification is highly related to statistical interaction in regression models. Where an exposure decreases the risk for one value of the effect modifier and increases the risk for another value of effect modifier, this is called crossover (Thompson, 1991). See also Bias and Confounding Lecture Note and Presentation.
EM algorithm: A method for calculating maximum
likelihood estimates with incomplete data. E (expectation)-step computes the
expected values for missing data and M (maximization)-step computes the maximum
likelihood estimates assuming complete data. It was first used in genetics (Ceppellini R et al, 1955) to estimate allele
frequency for phenotype data when genotypes are not fully observable (this requires
the assumption of HWE and calculation of expected genotypes from phenotype
frequencies). See ARC
CIGMR: EM Algorithm.
Environment: Almost anything that is not genetic.
Environmental factors include diet (food, preservatives, coloring, composition
of diet and amount); air (clean air, smog, pollution, tobacco, workplace
chemical fumes, dust, humidity, temperature); radiation (sunlight, tanning lights,
X rays, microwaves, radio waves); infectious agents (bacteria, viruses, fungi,
parasites), hormonal exposures and in utero environment.
Epigenetics: The study of heritable changes in gene
expression that occur without a change in DNA sequence. Epigenetic phenomena
such as imprinting
and paramutation
violate Mendelian principles of heredity. Epigenetic
studies link genotype to phenotype working out the chain of processes. See Epigenetics:
Special Issue of Science, 2001; a review by Petronis,
2001; a lecture by Shuk-mei Ho.
Epistasis: Original
meaning was related to the genetic interaction of two or more genes that encode
enzymes catalyzing steps in a common pathway. It has come to be synonymous with
almost any type of gene interaction. Formal definition is 'genetic variance due
to non-additive effects of alleles at distinct loci' thus,
it is included in the dominance variation component. The most extreme form of epistasis (interaction) results in a multiplicative model
in which the total risk is the product of the individual risks at each locus
(or allele). See a Review on Epistasis by Cordell; Commentary
on Epistasis by JH Moore.
Epistatic interaction: In genetic epidemiology, an epistatic effect is the modification of the risk conferred
by one marker by the presence of a marker from an unrelated gene (unlinked
gene-gene interaction). For examples, see Kajiwara, 1994 (retinitis pigmentosa); Olson,
2002; Pastor,
2003; Robson,
2004 (Alzheimer Disease); and Martin,
2002 (KIR3DL
in HIV-AIDS); a review on epistatic interaction (Cordell, 2002);
Epistasis Blog and Software
at Computational Genetics Laboratory.
Evolutionary-based
haplotype association: An association
study design which uses haplotypes grouped together based on their evolutionary
(cladistic) relationships. Use of ancestral haplotype groups in association studies is an efficient
way to increase power (Templeton, 1987; 1995; 2000; Schork,
1998; Seltman, 2003; Fejerman,
2004; Tzeng, 2005).
Ewens-Watterson
neutrality test: Also called E-W homozygosity statistics. Described by
Ewens (1972)
and Watterson (1978). A widely used test in population genetics to estimate the selection
acting on a locus. It compares the sum of observed homozygosity for each
allele of a given locus (Fo)
with the expected Fe value based on the number of alleles in
the locus of interest, neutrality expectations and random mating assumption. A
test of comparison yields an Fo
value. Values close to zero mean that the locus is evolving under neutrality
(genetic drift only) and there is no selection. Values of Fo significantly different from
zero suggest selection. When Fo
> Fe, the locus is undergoing purifying selection, and
when Fe > Fo,
the locus is under balancing selection (very common for HLA loci) (see Nielsen, 2001,
Luikart, 2003, Harris
& Meyer, 2006 for reviews). Alternative tests for
neutrality include Tajima's D (Tajima, 1989)
and Slatkin's exact test for neutrality (Slatkin, 1996; Slatkin & Muirhead, 2000).
See also Basic Population
Genetics.
Expressivity: The range of
phenotypes resulting from a given genotype (cystic fibrosis, for example, may
have a variable degree of severity). This is different from pleiotropy which refers to a variety of different phenotypes
resulting from the same genotype, or from penetrance.
Extended
haplotype homozygosity (EHH) test: The frequency of an allele corresponds to
its age, which in turn, correlates with decay of LD with alleles of adjacent
loci (an old allele has high frequency and is expected to show low LD with
adjacent loci). The EHH test compares the age of an allele based on its
frequency with its age based on its extended haplotype recombination. High
frequency alleles in the middle of a high LD region (haplotype block) represent
positive selection as opposed to neutral alleles that take a long time to reach
high frequency accompanied by low LD with adjacent loci. For discussion and
examples of EHH test, see Mueller
& Andreoli, 2004, Miretti, 2005 and Wang, 2005.
See also EHH web-tool.
External validity: The extent to which a study’s findings
apply to populations other than the one that was being investigated. See also internal validity.
F1: First filial (son or daughter) hybrids arising from
a first cross. Subsequent generations are denoted by F2, F3
etc. In animal studies of quantitative trait locus (QTL) mapping studies, two animals with extremes of the phenotype (like
lowest and highest blood pressure) are mated to generate F1 and then
F1 x F1 matings produce an F2
generation with a wide spectrum of the phenotype which are then used for
mapping studies.
Falconer's multifactorial liability threshold model: Originally described and modeled
in an analysis of polydactyly in guinea pigs (Wright
S, 1934) and applied to human genetics by Douglas Falconer (Falconer
DS. The inheritance of liability to certain diseases, estimated from the
incidence among relatives. Ann
Hum Genet 1965;29:51-76; see
also Falconer,
1967; Fraser
FC 1976 & 1980).
Nicely explained in Falconer's
polygenic threshold model for dichotomous nonmendelian
characters in
Human
Molecular Genetics. See also a Lecture Note
by Dr R Tissot; Genetic
Calculation Applets: Calculator
for Heritability in Threshold Traits; Understanding
the Threshold Model and an example by Wanstrat & Wakeland.
Founder effect: Coalescence of a mutation or DNA variant
in a given population to one of the original population founders or his/her
descendant.
Genotype-environment
(GxE) interaction: This term refers to both the modification of genetic risk factors by
environmental risk and protective factors, and the role of specific genetic
risk factors in determining individual differences in vulnerability to
environmental risk factors. When GxE interaction is
present, a specific environmental change influences the outcome in different
ways depending on the genotype. This requires inclusion of a multiplicative
interaction term into the statistical model. For reviews, see Ottman,
1996; Heath
& Nelson, 2002; Cooper, 2003; Hemminki, 2006a (PDF)
& 2006b;
Understanding
Gene-Environment Interactions; Environment,
Genes, and Cancer; Online
Book (Costa & Eaton): Gene-Environment Interactions. For an
example, see Carbone, 2007.
Genetic epidemiology: Genetic epidemiology is the
epidemiological evaluation of the role of inherited causes of disease in
families and in populations; it aims to detect the inheritance pattern of a particular
disease, localize the gene and find a
marker associated with disease susceptibility. Gene-gene and gene-environment
interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology
includes family studies, molecular epidemiologic studies with genetic
components, and more traditional cohort and case-control studies with family
history components. See Genetic Epidemiology
Lecture Note and Presentation.
Genetic heterogeneity. Distinct alleles at the same or different loci that
give rise independently to the same genetic disease. In clinical
settings genetic heterogeneity refers to the presence of a variety of genetic
defects which cause the same disease, which may be the mutations at different
positions on the same gene, a finding common to many human diseases (including
Alzheimer disease, cystic fibrosis, lipoprotein lipase and polycystic kidney
disease).
Genome-wide association study (GWAS): Simultaneous investigation of up to one
million genetic variants covering the whole genome in complex genetic diseases
(Clark, 2005; Wang, 2005; Pearson, 2008;
McCarthy,
2008). See NIH guide to GWAS; Presentation by G McVean; Lecture
Note by D Clayton; Presentation
by S Chanock; the WTCCC
GWAS (PDF);
a list of recent GWAS in OEGE and catalog of GWAS in NHGRI (NIH)). PLINK, BEAGLE, GenABEL and SNPassoc are commonly used statistical analysis
packages for GWAS. See also the Max-rank approach (Li, 2008) for
ranking associations, and WGAviewer (Ge,
2008; Workshop
Notes) for annotating, visualizing, and interpreting the full set of
P values emerging from a GWAS. EIGENSOFT can be
used to detect population stratification by PCA algorithm in GWAS (Price, 2006).
See Potential Criteria for Standardized
Reporting of GWAS Results in Johnson & O’Donnell,
2009; GWAS
Database & Open Access Database of GWAS Results: File
1 & File
2 (Johnson &
O’Donnell, 2009); Statistics
for GWAS Laurent Briollais @ Bioinformatics.ca: PDF
| PPT.
Genomic control: One method to adjust for population stratification bias in case-control association studies is to use a 'genomic control markers' panel (Reich & Goldstein, 2001). The panel consists of 20-50 polymorphic markers unlinked to the loci of interest. The information obtained from unlinked markers may be used in a variety of ways (genomic control, structured association, latent-class approach). The adjustments requires some statistical manipulation (Pritchard, 1999 & 2000; Bacanu, 2000; Devlin, 2001; Reich & Goldstein, 2001; Ardlie, 2002; Devlin, 2004; Purcell, 2004; Shi, 2004; Shmulewitz, 2004; Hao, 2004; Fu, 2005), which can be handled using a variety of statistical approaches (UPMC genomic control software; STRUCTURE & STRAT; ADMIXMAP; L-POP).
Genotype: The two alleles inherited at a specific locus. If the alleles are the same,
the genotype is homozygous, if different, heterozygous. In genetic association
studies, genotypes can be used for analysis as well as alleles or haplotypes.
Genotype relative risk (GRR): The risk of disease for one genotype at a
locus versus another. It is usually assessed as having one copy of the allele
of interest (Aa) vs having none (AA), which is GRR1; and having two copies
of the allele (aa) vs
having none, which is GRR2. In simple statistical analysis this is achieved by
using dummy variables for each genotype, selecting the genotype AA as referent
and obtaining odds ratios for other genotypes Aa and aa. Most of the
time, what is presented is actually genotype odds ratio. See Schaid & Sommer, 1993; Risch & Merikangas, 1996;
Camp,
1997.
Haplotype: Linear arrangements of alleles on the same chromosome that have been inherited as a unit. A person has two haplotypes for any such series of loci, one inherited maternally and the other paternally. A haplotype may be characterized by a single allele unless a discrete chromosomal segment flanked by two alleles is meant. See a discussion of the use of haplotypes as opposed to individual SNPs: Clark, 2004.
Haplotype blocks: A chromosomal region with high linkage disequilibrium and low haplotype diversity. Probably flanked by recombinational hotspots, haplotype blocks are shorter in African populations (average 11kb) than in other populations (average 22kb) (Gabriel, 2002). Haplotype block lengths correlate with recombinational rate (Greenwood, 2004) but most haplotype-block boundaries do not occur at hotspots (Wall, 2003). All pairs of polymorphisms within a block are expected to show high linkage disequilibrium. Haplotype blocks are useful in association studies and a representative set of haplotype tagging SNPs can be used instead of the whole set of polymorphisms within a block (Zhang, 2004). Haploview is the most popular software for haplotype block analysis (Barrett, 2005) (see documentation and tutorial). HapBlock (Zhang, 2005), HaploBlock Finder and SNPTagger can also be used for haplotype block partitioning. For a review, see Cardon, 2003.
Haplotype
relative risk (HRR) method:
This method uses non-inherited parental haplotypes of affected persons as the
control group and thus eliminates the potential problems of using unrelated
individuals as controls in case-control association studies. Haplotyping is not necessary to use this method; it can be
used for allelic associations. See Falk
& Rubinstein, 1987; Knapp,
1993; Terwilliger & Ott, 1992.
HapMap (International
Haplotype Mapping Project): A major international effort designed to obtain
a map of haplotype blocks, the specific SNPs
that identify the haplotypes (htSNPs) and linkage
disequilibrium patterns in European, African and Asian population. See HapMap website, HapMap
Genome Browser (v.B36), HapMap
description, HapMap User Guide, HapMap Webcast and a comprehensive review on HapMap
& Genetics of Common Disease in JCI by Manollo et
al (2008).
Hardy-Weinberg
equilibrium (HWE): In
an infinitely large population, gene and genotype frequencies remain stable as
long as there is no selection, mutation, or migration. For a bi-allelic locus
where the gene frequencies are p and q: p2+2pq+q2 = 1.
HWE should be assessed in controls in a case-control study and any deviation
from HWE should alert for genotyping errors (Gomes,
1999; Lewis, 2002)
but see also Zou & Donner, 2006.
Relying only on HWE tests to detect genotyping errors is not recommended as
this is a low power test (Leal,
2005). (Online HWE Analysis; HWE and Association Testing for SNPs in Case-Control Studies; HWE
Tutorial in Life,
7th Ed;
Basic Population Genetics).
Haseman-Elston regression: A sib-pair test for linkage between a
quantitative trait and a marker locus (Haseman & Elston, 1972).
It is a classical regression method using the squared sib-pair trait difference
as a dependent variable and the proportion of shared alleles identical by
descent by the sib pair as an independent variable, where a statistically
significant negative regression coefficient suggests linkage. Since then it has
been extended to multiple quantitative loci (Tiwari, 1997; Stoesz, 1997); revisited to incorporate information
from full sibs and other pairs of relatives (Elston, 2000); applied to X-linked traits (Wiener,
2003); and further modified to increase its power (Wang,
2004).
Heritability: The proportion of the phenotypic
variability due to genetic variance [(narrow-sense) h2= (additive)
genetic variance / total phenotypic variance]. Can be locus-specific or for all
loci combined. A high h2 does not mean that the trait cannot be
influenced by environment. In a different environment the same h2
may not be that high. Heritability does not indicate the degree to which a
trait is genetic; it measures the proportion of the phenotypic variance that is
the result of genetic factors (see Introduction
to Quantitative Genetics, Human
Genetics Interactive Learning Exercises, Effect of
Heritability on Response to Selection, Polygenic Inheritance;
Genetic
Calculation Applets: Calculator
for Heritability in Threshold Traits).
Heterozygosity: Presence of two different alleles at a
locus in a diploid organism (see homozygosity). It is the result of
inheritance of different alleles from parents. For relevance of heterozygosity
in disease states, see Beckman,
1990; Vockley, 2000; Vladutiu,
2001. Rarely, only heterozygosity but neither homozygous genotypes cause a
disease. For a review, see van Heyningen, 2004.
Heterozygote advantage: Also called overdominance (a form of balancing selection) and
opposite of underdominance (homozygote advantage).
For an example and a list of all known examples, see Gemmell & Slate, 2006 and Supplemental
Table 1. Genome-wide heterozygosity has been reported to confer
advantage for common diseases (Campbell, 2007)
and in particular, in cancer (Assie, 2008). See also balancing selection.
High-throughput genotyping: Simultaneous genotyping of large numbers
of samples. Most machines can run 4x96 (384) samples simultaneously (SNP
typing, real-time PCR, sequencing) with a queuing system that would allow
automatic continuation of the typing. A number of companies perform SNP
high-throughput genotyping (including K-Bioscience
and GeneService in
Homozygosity: Presence of two identical alleles at a locus in a
diploid organism (see heterozygosity).
It is the result of inheritance of identical alleles from both parents.
Homozygosity mapping: Recessive diseases require two copies of
an allele for expression. Because of linkage disequilibrium, loci surrounding
the disease locus will tend to be homozygous in affected individuals. Searching for homozygous segments in diseased individuals help to
locate the disease gene. This is called homozygosity mapping (Lander
& Botstein, 1987).
htSNP: Haplotype-tagging SNP. See also SNP Haplotype Tag Calculator and SNPTagger.
Identity by descent (IBD): Alleles that trace back to a shared
ancestor. For sibs, refers to inheritance of the same allele from a given
parent.
Interaction: If the effect of one factor depends on the level of
another factor, these two factors are said to interact. Factors A and B
interact if the effect of factor A is not independent of the level of factor B.
For example, when there are two main effects on a response variable, if their
combined effect is higher than the sum of their main effects, they have an
interaction (meaning a simple additive model is not sufficient to account for
the observed data and a multiplicative term must be added). Also, there would
be an interaction between the factors sex and treatment if the effect of
treatment is not the same for males and females in a drug trial. Interaction is
closely linked with effect modification in epidemiology. See Wikipedia: Statistical
Interaction.
Internal validity: Internal validity is determined by the
presence or absence of systematic error (bias) that causes the study findings
to differ from the true values. A study that suffers from non-causal reasons
for an association between an exposure and outcome (bias, confounding and
serious random error) lacks internal validity. See also external validity.
Kin-cohort study: A study design for estimation of
penetrance of a disease mutation. Individuals with and without family histories
are included in the study sample and the family histories of the mutation
carriers are compared with the family histories of the non-carriers. This
design works only when the carrier frequency is more than 1% and when a founder
effect is present (i.e, no genetic heterogeneity). Described by Wacholder, 1998.
Linkage: The tendency of 'genes' on the same chromosome to segregate together.
This means that linked genes are transmitted to the same gamete more than 50%
of the time. Genetic linkage reflects a lack of meiotic crossovers between two
genes one of which is usually a latent/unknown disease locus. A number of
software is available to analyze linkage in pedigree data, most commonly used
ones are Linkage, Genehunter and Allegro (Genetic Analysis Software
list and A Survey
of Current (2003) Software for Linkage Analysis by F Dudbridge). See exercises on Gametes
under Linkage and Linkage
Pedigrees. For a general review, see genetic
linkage in Kimball’s
Biology and Tutorial
by F Clerget-Darpoux. See also quasi-linkage.
Linkage disequilibrium (LD): Two alleles at different loci that occur
together on the same chromosome (or gamete) more often than would be predicted
by random chance. It is a measure of co-segregation of alleles in a population.
Also called population 'gametic association' and may be defined as 'nonzero’ if
multilocus gamete frequencies are different from the
product of allele frequencies at each locus. For details, see Basic
Population Genetics; for software, see Genetic Epidemiology.
Linkage
disequilibrium (LD) mapping of disease genes: Marker loci near a disease gene are often observed to be in LD with
the disease; that is, the relative frequencies of marker alleles in affected
individuals differ from those in the general population. LD occurs because each
new disease-predisposing mutation originally appears on a single chromosome.
Individuals who inherit a disease mutation are likely to also inherit the
alleles of the original chromosome, at neighboring marker loci. Because
recombination with the disease gene happens less often for nearby marker loci,
markers in the immediate vicinity of the gene should remain in greater
disequilibrium than more distant marker loci and this is the basis of
associations with the disease. See Terwilliger & Weiss, 1998; Lazzeroni,
1998 & 2001;
Pritchard,
2001; Tishkoff, 2002; Jorde,
2003 and GENESTAT: LD
Mapping.
LOD score: The LOD score method for testing linkage was first proposed by Morton
in 1955 (Morton,
1955). Stands for the logarithm of odds but it is not the logarithm of the odds for linkage but the logarithm of the
likelihood ratio for a particular value of the recombination fraction vs. free
recombination (q = 0.5) (Elston, 1998; Borecki, 2001). Thus, the LOD score serves as a test of
the null hypothesis of free recombination versus the alternative hypothesis of
linkage. It is a statistical measure of the likelihood that two genetic markers
occur together on the same chromosome and are inherited as a single unit of
DNA. Determination of LOD scores requires pedigree analysis and a score of
>+3 is traditionally taken as evidence for linkage (and -2 may mean the
opposite). Linkage is between two genetic loci but not alleles. An example is
the linkage between the hemochromatosis gene (HFE) and HLA-A. This means that within the same family all affected subjects
will have the same HLA-A
allele, i.e., there will be no recombination between HFE and HLA-A. LOD score has nothing to do with linkage disequilibrium.
See also Significance
of LOD Scores by Dave
Curtis and a presentation on LOD Score).
Major
gene: A gene whose
variant(s) confer a high lifetime risk of a disease. The penetrance of a major
gene might be conditional on the presence of the relevant variant of a modifier
gene. All high-penetrance cancer predisposition genes (BRCA1, BRCA2, TP53, APC, MSH2, LMH1, PTEN, CDNK2A etc) are major
genes. For an example, see Narod, 2002.
Manifesting heterozygotes: A heterozygote for a recessive autosomal gene mutation or a female heterozygote for a
recessive sex-linked gene mutation who has the same
phenotype as homozygotes for the same mutation (or as
a hemizygote male in the case of sex-linked
mutation). Manifesting heterozygotes usually have a
milder form of the phenotype and may only have biochemical signs without
clinical phenotype. This situation is an exception rather than a rule but
occurs in a proportion of heterozygotes for most
major autosomal recessive disease genes: CYP21A2
(Witchel, 1997), HFE (Bulaj, 1996; Burt, 1998),
CFTR (Super, 1999),
ATM (Fearon, 1997; Scott, 2002) and McArdle disease (Manfredi, 1993) are among the examples. See Medline,
OMIM
and Google
searches for manifesting heterozygotes; see also Clinical Genetics.
Marker frequency: In model-free analysis of an association
study, the use of allele frequencies is not favored and it is recommended that
marker frequencies (frequency per number of subjects) be used in comparisons (Svejgaard, 1994; Sasieni, 1997). The use of allele frequencies is appropriate
when a multiplicative model is hypothesized (and when the locus is in
Hardy-Weinberg equilibrium) and the use of marker frequencies implies a
dominant model is being tested. See also Lewis, 2002.
Markov Chain Monte Carlo (MCMC) strategy: A randomized computational approach for
identifying the most likely among many possible models. For MCMC applications in biostatistics, see
Gelman, 1996; MCMC algorithm has been used in
segregation and linkage analysis (Heath,
1997), analysis of association with polymorphic loci (Sham,
1995; CLUMP),
LD estimation (Ayres,
2001), haplotype construction (Stephens,
2001; PHASE),
and multilocus association analysis (Kilpikari, 2003; BAMA). WinBUGS is freely available software for MCMC
applications.
Mendelian
gene: A gene with a strong
effect on phenotype, giving rise to a (near) one-to-one correspondence between
genotype and phenotype. Phenotypes caused by such a gene is
called Mendelian traits or Mendelian
(single-gene) conditions.
Mendelian
randomization: A
natural randomization process that occurs at conception to determine a person's
genotype. It is possible to use 'Mendelian
randomization' to derive an estimate of the association that is free of the
confounding and reverse causation typical of classical epidemiology. According to the second law of Mendel
(random assignment of genes), the inheritance of one trait is
independent of the inheritance of other traits. The distribution of
genetic polymorphisms is largely unrelated to the confounders (socioeconomic or
behavioral) that distort interpretations of observational
epidemiological studies. The basis of Mendelian
randomization is best seen in parent–offspring designs that study
the way phenotype and alleles co-segregate during transmission from
parents to offspring. This study design is closely
analogous to that of randomized clinical trials as by Mendelian
principles there should be an equal probability of either allele being
randomly transmitted to the offspring. Due to Mendelian
randomization, genetic association studies are less prone to confounding than
conventional risk-factor epidemiology (pleiotropy and
linkage disequilibrium can still produce confounding; see Lee
& Ho, 2003). Mendelian randomization concept can be used as a tool for epidemiological inference on
environmental risk factors by examining the genetic counterpart of a suspected
environmental exposure association free of confounding by conventional
confounders (Davey-Smith & Ebrahim, 2003;
Khoury, 2004; Davey Smith, 2005). See also a commentary
on Mendelian Randomization by F Cambien.
Meta-analysis: A systematic approach
yielding an overall answer by analyzing a set of studies that address a related
question. This approach is best suited to questions, which remain unanswered
after a series of studies. Meta-analysis provides a weighted average of the
measure of effect (such as odds ratio). The rationale is to increase the power
by analyzing the sets of data. The selection of studies to include in a
meta-analysis study is the main problem with this approach. Funnel Plot is an informal method
to assess the effect of publication bias in this context. See also Introduction to
Meta-Analysis by the Cochrane Collaboration; Meta-Analysis
by Genstat; Meta-Analysis
in Epidemiology by Stroup et al (2000); Methods
for Meta-Analysis in Medical Research by AJ Sutton; Introduction
to Meta-Analysis by Borenstein et al
(2009), and Online
Meta-Analysis Tests. See also a comparative study of meta-analysis and consortium
studies in genetic associations (Janssens, 2009).
Microsatellite: A DNA variant due to tandem repetition of
a short DNA sequence (usually two to four nucleotides). Also called short
tandem repeat (STR). As multilallelic markers, they
provide higher polymorphism
information content (PIC) than SNPs (see Schaid, 2004 for a comparative study). Average length
of LD with microsatellites is 100kb which is
considerably higher than for SNPs (Bahram & Inoko, 2007). It
is therefore to do a whole genome association study using 30K microsatellites (Tamiya, 2005).
Migrant studies: Studies on migrants based on the assumption
that in migrants genetic components remain the same but environment has
changed. If the rates of disease among migrants change in the new environment,
this is taken as evidence for environmental influence. Considerations in the
interpretation of migrant studies include the following: migrants are a highly
selected group (usually younger, healthier and of higher socioeconomic
status), age at migration varies
(exposure to relevant environmental factor may have already occurred) and most
migrants may retain their lifestyle (environmental) factors. Successful
examples of migrant studies are the increased colon cancer incidence in the
Japanese migrants to USA (Cancers
in Asian-Americans & Pacific Islanders: Migrant Studies; see also Parkin & Khlat, 1996; Kolonel, 2004) and decreased risk of multiple sclerosis
in migrants from high to low altitude countries in the first two decades of
their lives (Gale
& Martyn, 1995).
Misclassification: Errors in the classification of
individuals by phenotype, exposures or genotype that can lead to errors in
results. The probability of misclassification can be the same across all groups
in a study (nondifferential) or vary among groups
(differential). One group of major biases. See also Bias and Confounding Lecture Note
and Presentation.
Mode
of inheritance: The manner
in which a particular genetic trait or disorder is passed from one generation
to the next. Autosomal dominant or recessive,
X-linked dominant or recessive, multifactorial and
mitochondrial inheritance are examples. Complex traits encompass modes of
inheritance involving more than a single genetic factor, reduced penetrance and
variation due to environmental factors.
Modifier genes: Not all genes that influence the
appearance of a trait contribute equally to the phenotype: major genes have a large influence, while modifier genes have a more
subtle, secondary effect. Modifier genes alter the phenotypes produced by the
alleles of other genes. There is no formal distinction between major and
modifier genes; there is a continuum between the two and the cut-off is
arbitrary. Modifier genes may affect the action of a major gene or the trait
independently. See Narod, 2002 for modifier genes in BRCA1/BRCA2 carriers; Dipple,
2000 for modifier genes in simple Mendelian
disorders.
Multifactor dimensionality reduction (MDR): Algorithms and software for the detection
and characterization of epistasis (gene-gene
interactions) and plastic reaction norms (gene-environment interactions) in
genetic and epidemiological studies of common human diseases developed at the Computational Genetics Laboratory, Dartmouth
Medical School, Lebanon, NH, USA (MDR website; MDR software).
See also Ritchie,
2001; Hahn, 2003;
Moore,
2004. Alternative algorithms to select most interesting subsets of
polymorphisms include LOTUS (manual; Nickolov & Milanov, 2007)
and PIA (Mechanic, 2008).
Multifactorial inheritance
with a threshold: Quite
often certain characters have a discontinued binary distribution, meaning that
they are present or not in an individual (cleft palate, pyloric stenosis, diabetes, leukemia, schizophrenia) but they are
inherited as if they were multifactorial characters;
this is due to a threshold effect that makes them appear as discontinued. This
is called multifactorial inheritance with a
threshold. See Understanding
the Threshold Model. See also Falconer's
multifactorial liability threshold model.
Multiplicative genetic model: In a disease association study, if the
risk conferred by an allele is increased r-fold for heterozygotes
and r2-fold for homozygotes, this
corresponds to multiplicative model (Lewis, 2002).
These data should be analyzed using the allele frequencies (Sasieni, 1997).
See also Additive genetic model. See MODEL-online tool for genetic association analysis for different models.
Multivariable analysis: As opposed to univariable
analysis, statistical analysis performed in the presence of more than one
explanatory variable to determine the relative contributions of each is (or
should be) called multivariable analysis (in practice, however, it is called
univariate and multivariate analysis more frequently). It is a method to
simultaneously assess contributions of multiple variables or adjust for the
effects of known confounders. Multiple linear regression, multiple logistic
regression, proportional hazards analysis are examples of multivariable
analysis, which has no similarity whatsoever to multivariate analysis.
See a review on Multivariable Methods by
MH Katz (and a book on Multivariable
Analysis by MH Katz).
Multivariate analysis: Methods to deal with more than one related
'outcome/dependent variable' (like two outcome measures from the same
individual) simultaneously with adjustment for multiple variables (covariates).
When there is more than one dependent variable, it is inappropriate to do a
series of univariate tests. Hotelling's T2
test is used when there are two groups (like cases and controls) with multiple
dependent measures, and multivariate analysis of
variance (MANOVA) is used for more than two groups. Unfortunately, the word
'multivariate' is most frequently used instead of 'multivariable' analysis
(which means multiple independent/explanatory variables but one
outcome/dependent variable). (see multivariate analysis book,
notes,
lecture notes,
slide
presentation, glossary and MultiVariate
Statistical Package-MVSP).
Multivariate analysis of variance (MANOVA): An extension of Hotelling's
T2 test to more than two groups with related multiple outcome
measures. Groups are compared on all variables simultaneously (rather than
one-by-one as ANOVA does).
Mutation: Any heritable change (not only point
mutation) brought about by an alteration in the genetic material. Includes gene conversion, deletion, duplication, insertion and so
forth. Mutation is preferred to polymorphism to describe a disease
causing gene variation regardless of its frequency. Link to Human Gene
Mutation Database (
Nomenclature (reports): Any report of a human genetic study should
conform to the requirements of HUGO
Gene Nomenclature Committee - Guidelines
and HGVS - Nomenclature for the Description of
Sequence Variations (see also den
Dunnen & Antonarakis,
2000; Wildeman, 2008). The current NCBI policy on disease
names is not to use (‘s) in them: see OMIM entries for Alzheimer
Disease, Down
Syndrome, Crohn disease and Hodgkin
Lymphoma, for example.
Non-mendelian gene: A gene with some but not a strong effect
on phenotype, giving rise to significant overlap of genotype distributions and
lack of one-to-one correspondence between genotype and phenotype.
Odds ratio (OR): Also known as relative odds and
approximate relative risk. It is the ratio of the odds of the risk factor in a
diseased group and in a non-diseased (control) group (the ratio of the
frequency of presence / absence of the marker in cases to the frequency of
presence / absence of the marker in controls). The interpretation of the OR is
that the risk factor increases the odds of the disease ‘OR’ times. OR is used
in retrospective case-control studies (relative
risk (RR) is the ratio of proportions in two groups which can be estimated
in a prospective -cohort- study). These two and relative hazard (or hazard
ratio) are measures of the strength/magnitude of an association. As opposed to
the P value, these do not change with
the sample size. OR and RR are considered interchangeable when certain
assumptions are met, especially for large samples and rare diseases. Odds ratio
is calculated as ad/bc where a,b,c,d are the entries in a 2x2 contingency table
(hence the alternative definition as the cross-product ratio). In logistic
regression, the coefficient b corresponds to the loge of the odds
ratio. There are statistical methods to test the homogeneity of odds ratios (Online Odds-Ratio Calculation (with 95% CI);
Odds Ratio-Relative Risk
Calculation).
Overlapping
Genes: Genes that are encoded
on the sense and anti-sense strands of the same chromosome region in opposite
direction (for an example, see CYP21A2
and TNXB;
Tee,
1995). Overlapping genes are frequent in viruses and plasmid /
phages who need to pack a lot of information in a small, compact genome (HIV is
an example). Degeneracy of the genetic code facilitates the presence of
overlapping genes. It has been suggested that overlapping gene groups are more
likely to be disease-associated (Karlin, 2002).
Overmatching bias: When cases and controls are matched by a non-confounding variable that is associated to the exposure but not to the disease, this is called overmatching. Overmatching can underestimate an association. For a numerical example, see slides 41-49 in the Case-Control Studies presentation by Chen. See also Bland & Altman, 1994 and Sorensen & Gillman, 1995. Matching should only be considered for confounding variables but such known confounding can be controlled at the analysis phase in an unmatched design.
Penetrance: The proportion of individuals with a given genotype
(heterozygotes for a dominant gene) who express an
expected trait, even if mildly. If a disease gene is not causing the disease in
all its carriers, its penetrance is low [not to be mixed with variable expression].
BRCA1 mutations show both
age-dependent penetrance and overall reduced penetrance, the lifetime risk for
a female mutation carrier being estimated at around 70%. Breast cancer is also
an example of an autosomal condition where penetrance
is sex-dependent. While male mutation carriers can develop breast cancer
(particularly with BRCA2 mutations),
females are at much greater risk. HFE has a very
low penetrance, which is age and sex-dependent.
Permutation Test: A statistical approach to examine
statistical significance of associations based on
PHASE: Haplotype construction from multilocus population data software that employs a Markov
Chain Monte Carlo algorithm based on the coalescent model (Stephens,
2001). The newer version fastPHASE is for faster
haplotype reconstruction and estimation of missing genotypes from population
data (Scheet & Stephens, 2006). See PHASE or fastPHASE Download and PHASE
Documentation. See also UNPHASED
(manual;
ref).
Phenocopy: A non-genetic condition resembling a genetically determined one. Such conditions confound the interpretation of pedigrees and therefore genetic counseling. Some teratogens may cause congenital anomalies mimicking genetically caused anomalies (thalidomide syndrome vs phocomegalia). Deafness is another example of phenocopy which may be genetic (autosomal or sex-linked) or non-genetic (rubella embryopathy).
Phenotype: The visible or measurable (i.e., expressed) characteristics of an organism.
Pleiotropy: The potential for genotypes to have more
than one specific phenotypic effect.
Polymorphism: The existence of two or more
variants at a locus. Conventionally, the prevalence in the
population should be above 1% to be referred to as a polymorphism;
if prevalence is below this, variants are referred to as mutations
(especially if they are disease-causing ones). Because of the confusion between
polymorphism and mutation, the Human
Genome Variation Society recommends the use of 'sequence variant',
'alteration' or 'allelic variant' for any genomic change regardless of their
frequency or phenotypic effects. Polymorphism at a genetic locus is due to
either balanced polymorphism (heterozygous advantage, frequency-dependent
selection) or unequilibrium states (temporary
polymorphism) as occurs during frequency-dependent selection and genetic drift
(alleles becoming fixed or extinct).
Polymorphism
information content (PIC): An index of informativeness
of a genetic marker which takes into account the number of alleles and their
frequencies (Botstein,
1980; Guo & Elston, 1999). For
details, see a Lecture
Note.
Population genetics: The branch of genetics that deals with
frequencies of alleles and genotypes in breeding populations. It also deals
with selective influences on the genetic composition of the population (links
to freeware population genetic data analysis software: Arlequin 2000, PopGene, GDA, Genetix, Tools for Population Genetic Analysis, GenePop, GeneStrut, SGS, GenAlEx; WinPop,
Quanto,
features of data analysis software; lectures on population genetics). See also Basic Population Genetics.
Population
stratification: An example of 'confounding by
ethnicity' in which the co-existence of different disease rates and allele
frequencies within population sub-sections leads to a spurious association at
the population level. Differing allele frequencies in ethnically different
strata in a single population may lead to a spurious association or 'mask' an
association by artificially modifying allele frequencies in cases and controls
when there is no real association (for this to happen, the subpopulations
should differ not only in allele frequencies but also in baseline risk to the
disease being studied) (Mark,
1996; Altshuler, 1998). Confounding, cryptic relatedness
(which increases overdispersion of the test
statistics and leads to inflation of significance levels overall) and selection
bias are potential consequences of population stratification (Thomas,
2005). It is notable that the consequences of population structure on
association outcomes increases with sample size, i.e., larger sample size is
not a remedy for this issue and may make it worse (Marchini, 2004). Case-control association studies can
still be conducted by using genomic controls (Devlin, 1999; Pritchard, 1999) even when
population stratification is present. The software STRUCTURE and STRAT, ADMIXMAP or L-POP can be used
to analyze case-control data with genomic controls. See Cardon & Palmer, 2003 for an example
of spurious association due to population stratification; a presentation by
David Clayton on Confounding
by Stratification and Admixture. See presentations on Genetic Epidemiology and Pitfalls
in Genetic Association Studies.
Predisposition gene: A gene that is necessary and
sufficient to cause a disease. This is different from a 'susceptibility gene' (neither necessary nor sufficient for disease
development).
Principal component analysis (PCA): In genetic epidemiology, PCA is
used to detect population stratification in genome-wide association studies (Price, 2006).
It is implemented in a software program called EIGENSOFT.
Proteomics: Proteomics is the study of proteins in
aggregate. It applies to the translation from the mRNA to the primary protein
products, and their maturation and modification to yield active proteins as
components of a cell, tissue or organism. The collection of proteins in a given
cell at a given stage of differentiation is called proteome. See the websites
for the Human Proteome Organization (HUPO)
and Human Proteome Project (HPP).
See also a 'Review of Proteomics with Applications to Genetic Epidemiology' (Sellers
& Yates, 2003) and Ahsan & Rundle, 2003.
Pseudo-SNP: Ectopic sequence
variants (ESVs) and paralogous
sequence variants (PSVs) (Estivill, 2002; Cheung, 2003). Pseudo-SNPs are one reason for genotyping errors and the main
non-biological reason for violation of HWE (Leal,
2005).
Publication bias: Editors and authors tend to publish articles containing positive findings as opposed to negative result papers. This results in a belief that there is a consistent association while this may not be the case. Plots of relative risks by study may be used to check publication bias in meta-analyses. If publication bias is operating, one would expect that, of published studies, the larger ones report the smaller effects, as small positive trials are more likely to be published than negative ones. This can be examined using the funnel plot in which the effect size is plotted against sample size (Sterne & Egger, 2001). If this is done, the plot resembles an inverted funnel, with the results of the smaller studies being more widely scattered than those of the larger studies, as would be expected if there is no publication bias. One consequence of publication bias is that the first report of a given association may suffer from an inflated effect size (Ioannidis, 2001). See Publication Bias in Cochrane Collaboration.
Quantile-Quantile plot (Q-Q plot): In a GWAS, the Q-Q
plot is used to assess the number and magnitude of observed associations
compared with the expectations under no association. The nature of deviations
from the identity line provide clues whether the observed associations are true
associations or may be due to population stratification or cryptic relatedness
or something else. (see
WTCCC
GWAS (PDF);
Pearson, 2008
(Figure
1); McCarthy, 2008).
Quantitative character: A character displaying a 'continuous'
phenotypic range rather than discrete classes; characters measured rather than
counted such as metabolic activity, height, length, width, arm span, body fat
content, growth rate, milk production, blood pressure. The genetic variation
underlying a continuous character distribution may be the result of segregation
at a single genetic locus or more frequently, at numerous interacting loci
which produce a cumulative effect on the phenotype (with contributions from the
environment). A gene affecting a quantitative character is a quantitative trait
locus, or QTL (should be seen as a
continuous trait locus). See also Introduction to Genetic
Epidemiology.
Quantitative genetics: The statistical study of the genetics
of quantitative characters (biometrical genetics) as opposed to Mendelian (discrete) characters. Quantitative genetic
characters are those that do not assort in a simple way in crosses. Examples
include physiological activity, behavior, size and height. A major task of
quantitative genetics is to determine the ways in which genes (QTL) interact with the environment to
contribute to the formation of a given quantitative trait distribution (and the
estimation of genetic and environmental variance). See also Quantitative
Genetics Resources; Quantitative
Genetics in Modern
Genetic Analysis; and Introduction to Genetic Epidemiology; Theory
and Practice in Quantitative Genetics (Posthuma, 2003).
Quasi-dominance: Direct transmission, generation to
generation, of a recessive trait giving the impression of dominance. It happens
if the recessive gene is frequent or inbreeding is intense.
Quasi-linkage: The non-random segregation of non-homologous
chromosomes, which can be a confounding factor in linkage studies of complex
traits. This phenomenon results in significant linkage finding between unlinked
markers. See a review by Sivagnanasundaram, 2004.
Random sampling: A method of selecting a sample from a target population or
study base using simple or systematic random methods. In random sampling, each
subject in the target population has equal chance of being selected to the
sample. Sampling is a crucially important point in selection of controls for a
case-control study. By randomization, systematic effects are turned into error
(term), and there is an expected balancing out effect: known and unknown
factors that might influence the outcome are assigned equally to the comparison
groups. One disadvantage of randomization is generation of a potentially large
error term. This can be avoided by using a block
design. See Basic Concepts
of Sampling and
Wikipedia: Random Sampling.
Randomization: Randomization of the study population in groups, differing
only for the factor of interest leads to a random distribution of known and
unknown confounders in the different groups, therefore removing potential bias
that might result in a spurious finding. See also Bias and Confounding Lecture Note
and Presentation.
Recall bias: Bias in results due to systematic
differences in the accuracy or completeness of recall of past exposures or
family history. One group of major biases. See also Bias and Confounding Lecture Note
and Presentation.
Recessive: A trait that is not expressed in heterozygotes
(i.e., that can only be expressed in the homozygotes).
Most common recessive disease genes are those encoding metabolic enzymes (Jimenez-Sanchez,
2001). See Clinical Genetics.
Recessive model: A genetic association analysis mode that examines
association with a recessive allele. The comparison groups are variant
homozygous genotypes vs the rest (combining heterozygotes for the variant and homozygotes
for the wild-type allele). See MODEL-online tool for genetic association
analysis for different models. See also Lewis, 2002.
Relative recurrence risk (RRR): A measure of familial aggregation
for a disease. This is the probability
that a particular type of relative (sibling, cousin etc) of a proband is affected, divided by the prevalence of the
disease in general population. These are quantities denoted by lR, where R denotes a relationship (S=sib,
O=offspring, DZ= dizygotic twin, etc), and whose
values are the risks of relatives of type R of affected individuals being
themselves affected, divided by the population prevalence. In general, the risk
of recurrence in first-degree relatives equals the square root of the incidence
of the disease in general population (P1/2; where P = incidence in
general population). For second and third degree relatives, corresponding
figures are P3/4 and P7/8, respectively. See Genetic Epidemiology Lecture Note and Presentation.
Relative risk (RR): The ratio of the risk of the phenotype
among individuals with a particular exposure, genotype or haplotype to the risk
among those without that exposure, genotype or haplotype. Also
known as risk ratio.
Residual confounding: Confounding within stratum. If
stratification is used to control confounding but the strata are broad (like a
broad age range), there may still be residual confounding within stratum.
Residual confounding is also used to describe confounding from factors that are
not controlled at all or controlled but inaccurately measured.
Reverse causation: The possibility that an observed
association may actually reflect the relationship in the opposite direction.
Childhood infections are believed to reduce the risk for asthma but 'reverse
causation' meaning that asthma may cause increased risk for infections to
result in the observed association is a distinct possibility (see Pekkanen,
2004). Increased cancer risk associated with low lipid levels (Davey-Smith & Ebrahim, 2003)
and the relationship between sleeping less and obesity may be examples of
reverse causation. For a discussion of reverse causation, see Dowd & Town: Does X Really Cause Y.
R project for statistical computing: R is a language
and environment for statistical computing and graphics which can be seen as a
different implementation of the S language. R and a comprehensive set of
programs written for a variety of statistical analysis are all available as
Free Software. See the R Project Website & List of Contributed R Packages (including gap, genetics, popgen, qgen, GenABEL, SNPassoc).
Sampling: In genetic epidemiologic research, sampling is an important design
consideration. The sampling unit, the sampling method, and the sample size are
all critical. For example, sampling larger sibships
yields more power per sampled subject than sampling independent sibpairs (Todorov, 1997). Use of extremely discordant (ED) and/or
extremely discordant and concordant (EDAC) sibpairs
increases the power (Gu, 1996; Gu, 1997).
Selection bias: A bias in results due to systematic differences
between those who are selected for study and those who are not selected. See Bias and Confounding Lecture Note
and Presentation.
Short tandem repeat (STR):
See microsatellite.
Sibling recurrence risk (sibling risk ratio): The disease risk for a sibling of an
affected individual compared to the disease risk in the general population. See
relative recurrence risk.
Signal-to-noise ratio: In an association study of a complex disease,
detection of a significant signal from a single locus is diminished due to
genetic heterogeneity. This is a major problem in outbred
populations. Isolated populations such as Finland,
Iceland
and Newfoundland
with relative genetic and environmental homogeneity offer better opportunities
to detect modest signals because of the lack of too much noise.
Single nucleotide
polymorphism (SNP): A
single nucleotide variation in the DNA code. It is the most common type of
stable genetic variation and usually bi-allelic. SNPs
may be silent -no change in phenotype- (sSNP), may
cause a change in phenotype (cSNP) or may be in a
regulatory region (rSNP) with potential to change
phenotype. Thus the effects of SNPs, if any, are
generally on gene expression or protein structure (Williams, 2007).
Functional changes that may be caused by SNPs are
gene transcription changes (promoter and intronic
enhancer SNPs), truncated protein (nonsense coding
region SNPs), structural changes (coding region SNPs), alternative splicing (intronic
splice site SNPs), and mRNA stability changes (3’UTR SNPs). Synonymous SNPs are the
most common ones. These are in non-coding regions and used as genetic markers.
On average, each 1 kb of human genome contains 2-10 SNPs,
i.e., one in every 100-500 nucleotides is polymorphic; most frequently a C to T
(C>T) substitution (links to a Overview, SNP
Consortium Website, dbSNP, SNP500 Cancer, SNPator,
SNPedia, MedRefSNP, SNP Control, HapMap-B36 (Mar 2008),
Ensemble (tutorial),
GeneSNPs, MIT SNP DBase, Seattle SNPs,
Regulatory-rSNP Guide). See also Bioinformatics
Tools.
SNP@Ethnos: A database of ethnically variant SNPs (Park,
2007).
Software: Software development for genetic epidemiology studies has gained
momentum in recent years. For an up-to-date list of software, see Genetic Epidemiology.
Stata: A powerful statistical package
particularly useful for epidemiologic and longitudinal data management and
analysis. It is mainly a command driven program produced by Stata Corporation.
See the list of Stata capabilities, Stata starter kit
with learning modules
by UCLA; tutorial by
University of Essex; tutorial by Princeton
University; tutorial
by Carolina Population Center; Stata
Highlights by Notre Dame University; genetic
data analysis on Stata and Stata programs for genetic
epidemiologists.
Statistical
power: The probability that
a test will produce a significant difference at a given significance level is
called the power of the test. This is equal to the probability of rejecting the
null hypothesis when it is untrue, i.e., making the correct decision. It is 1
minus the probability of a type II error. The true differences between the
populations compared (effect size), the sample size
and the significance level chosen affect the power of a statistical test.
Ideally, power should be at least 0.80 to detect a reasonable departure from
the null hypothesis. See a discussion of statistical power;
and online calculators: General Statistical Calculators Including a Power Calculator
(UCLA); Statistical Power Calculator for
Frequencies; Retrospective
Power Calculation; Genetic Power Calculator;
Wise Project Applets: Power Applet;
Downloadable calculators: CaTS (Skol, 2006), Quanto
(sample size or power
calculation for association studies of genes, gene-environment or gene-gene
interactions);
Calculation of Power for Genetic
Association Studies 'AssocPow' (Ambrosius, 2004), PS:
Power and Sample Size Calculation; and Power
& Sample Size Calculations on STATA.
STROBE (STrengthening the Reporting of OBservational
studies in Epidemiology):
An international collaboration integrating epidemiology, statistics and
other relevant disciplines to strengthen the reporting of observational
studies in epidemiology. See the checklists for
different types of epidemiologic studies.
Susceptibility gene: A gene that is neither necessary nor
sufficient to cause a disease but increases the risk of its development. These
low-penetrance genes would be detected by association studies but would show no
evidence for linkage with the disease. See Greenberg,
1993; Greenberg
& Doneshka, 1996. (Weakly penetrant predisposing
genes may act as a susceptibility gene.)
T2 test for genome association: Instead of examining the association of a
single marker in a population-based case-control study of a complex disease,
this test measures the strength of cumulative association of multiple markers. First
described by Xiong et al (2002)
and then extended to haplotype blocks by Fan & Knapp (2003).
Transmission
disequilibrium test (TDT):
A family-based study to compare the proportion of alleles transmitted (or inherited)
from a heterozygous parent to a disease-affected child. Any significant
deviation from 0.50 in transmission ratio implies an association (Spielman, 1993
& 1994;
Lewis, 2002).
See also FBAT software
(manual)
for family-based association tests.
Variant: Because of the ambiguity in the definitions of mutation and polymorphism,
any genetic change is called a sequence variation and such alleles are called
variant (see Nomenclature for
the Description of Sequence Variations and Cotton,
2001). See also polymorphism.
Whole genome amplification (WGA): Representational amplification of
total genomic DNA to increase the quantity and quality for further studies. WGA
improves amplification success with degraded DNA (Holbrook, 2005; Ballantyne,
2006). Reliability, robustness and accuracy of WGA methods in general have
been shown in genotyping of highly polymorphic loci such as HLA (Gillespie,
2000; Shao, 2004) and SNP genotyping and sequencing (Dean, 2002; Lovmar, 2003; Hosono,
2003; Alsmadi, 2003; Tranah, 2003; Shao, 2004; Yan, 2004; Bannai,
2004; Paez, 2004; Barker, 2004; Holbrook, 2005; Thompson,
2005) and particularly useful in molecular (childhood cancer) epidemiology
studies (Zheng, 2001; Yan, 2004). STR genotyping may require a little more
attention (Dickson,
2005; Ballantyne, 2006). WGA may also be used with Illumina Golden-Gate assays (Cunningham,
2008). As long as a minimum of 10 nanogram
genomic DNA is used in WGA, SNP genotyping can be accurately performed on whole
genome amplified DNA (Lovmar, 2003; Bergen, 2005a; Bergen,
2005b; Holbrook,
2005) with possible exception of loci near the end of chromosomes (Tzvetkov, 2005). Commercially available WGA kits
include GenomePlex (OmniPlex
PCR-based WGA), REPLI-g
(multiple displacement amplification) and GenomiPhi (multiple displacement amplification). See WGA portal.
Genetic Epidemiology: Basic
& Advanced
Glossaries of Genome / Human Genetics
Terms
Genetics Glossary Biostatistics Glossary
Statistical Analysis
of Genetic Associations
Please
update your bookmark: http://www.dorak.info/epi/glosge.html
M.Tevfik Dorak, MD, PhD
Last
updated on 4 July 2009
Genetics
Clinical Genetics Population Genetics Genetic Epidemiology Bias & Confounding Biostatistics
Evolution Homepage