Understanding of basic issues and terminology in the design, conduct, analysis and interpretation of population-based genetic association studies, including twin studies, linkage and association studies

PLEASE NOTE:

We are currently in the process of updating this chapter and we appreciate your patience whilst this is being completed.

The DFPH syllabus includes a unit on genetics, and this website provides notes on that section here. Readers unfamiliar with the terminology of genetics are advised to read those notes first.

The DFPH syllabus requires that candidates have an ‘understanding of basic issues and terminology’ around genetic association studies. This page seeks to provide that. Readers seeking more detailed information are directed to the series of articles on genetic epidemiology published by The Lancet in 2005. (These can be freely downloaded at the bottom of this external page). Much of the information below is drawn from these papers.

Genetic epidemiology and association analysis

Genetic epidemiology is closely allied to traditional epidemiology, focussing on familial, and in particular genetic, determinants of disease and the joint effects of genetic and non-genetic determinants. It takes into account the biology that underlies the action of genes and the known mechanisms of inheritance to investigate the health consequences of genetic variants.

Extensive information about the human genome is now available for use in genetic epidemiology studies. Once it is known which two versions of a potentially causative gene an individual possesses, looking for an association between variants in that gene and the disease of interest is fundamentally no different from an exploration of a disease-exposure association in traditional epidemiology.¹

Traditional epidemiology often seeks to prove that across a study population environmental exposure X is consistently associated with observed disease Y. Association analyses in genetic epidemiology asks the same question of genetic exposures, and many of the analytical approaches used in epidemiology and medical statistics can be applied directly in genetic epidemiology.

Twin studies

Twin studies were one of the earliest forms of genetic studies, first carried out in the 19th century by Francis Galton, considered by many to be father of medical genetics. He investigated the extent to which the similarity of twins changes over the course of development.

Twin studies involve comparing both monozygotic (identical) twins and dizygotic (non-identical) twins to estimate the relative contributions of genes and the environment to specific traits. Monozygotic twins share the same genetic material whereas dizygotic twins, like other siblings, have only 50% of their genes in common. If identical twins are more likely to develop an outcome of interest than non-identical twins, it suggests that genes contribute to the outcome.

Monozygotic twins serve as excellent subjects for controlled experiments because they share prenatal environments and those reared together also share common family, social, and cultural environments. However, twin studies have the potential to over- or underestimate the role of genetics, because of the challenges of quantifying environmental influences.

Linkage studies²

Linkage analysis is often the first stage in the genetic investigation of a trait, since it can be used to identify broad genomic regions that might contain a disease gene, even in the absence of previous biologically driven hypotheses.

Genetic linkage analysis can be used to identify regions of the genome that contain genes that predispose to disease. It involves two key concepts:

Linkage - two genetic loci are linked if they are transmitted together from parent to offspring more often than would be expected under independent inheritance. Recombination (the rearrangement of genetic material produced by the crossing over and rejoining of chromosomal segments) at random points of the chromosome can result in two gene loci being separated. The closer two genes are, the less likely it is that a recombination event will occur between them – i.e. they are more linked than two genes that are further apart on the chromosome.
Linkage disequilibrium – two genetic loci are in linkage disequilibrium if, across the population as a whole, they are found together on the same haplotype (the group of genes inherited from a single parent) more often than expected.

In general, two loci in linkage disequilibrium will also be linked, but the reverse is not necessarily true. Every time recombination occurs between two loci in the population, the linkage disequilibrium between them is weakened, and is maintained only if the two loci are very close together.

There are two major types of linkage analysis, described in more detail by Teare and colleagues:²

Parametric linkage analysis - the analysis of how genetic loci co-segregate in pedigrees or family units. Loci that are close together on the same chromosome segregate together more often than loci on different chromosomes. The further apart two loci are on the same chromosome, the more likely it is that a recombination event at meiosis will break up the co-segregation. The main quantity of interest is the recombination fraction (the probability of recombination between two loci at meiosis). By genotyping genetic markers and studying their segregation through pedigrees, it is possible to infer their position relative to each other on the genome. This can then be used to map genetic markers or disease loci.
Model-free (non-parametric) linkage analysis – this is used for multifactorial diseases, where several genes (and environmental factors) might contribute to disease risk and there is no disease model available. The rationale is that, between affected relatives, excess sharing of haplotypes that are identical by descent (IBD) in the region of a disease-causing gene would be expected, irrespective of the mode of inheritance. Various methods test whether IBD sharing at a locus is greater than expected under the null hypothesis of no linkage.

Linkage is usually reported using a LOD (logarithm of the odds) score, which takes into account the recombination fraction and chromosomal positions. Large positive LOD scores are evidence for linkage and negative scores are evidence against.

Genetic association studies³

Genetic association studies aim to detect associations between one or more genetic polymorphisms and a trait, for example a disease. Association differs from linkage in that the same allele (or alleles) is associated with the trait in a similar manner across the whole population, while linkage allows different alleles to be associated with the trait in different families.

Genetic associations only arise because humans share common ancestry and it has been argued that association studies are really just a special form of linkage study in which the extended family is the wider population. However, this type of research has more in common with classical epidemiology than the family studies described above, because they are examining associations at a population level. These can be done, for example, as case-control studies, where disease cases and controls are compared for the proportions of each which have a certain polymorphism under investigation. In genome-wide association studies (GWAS), a large number (>300,000) of genetic variants are simultaneously compared between groups, using a hypothesis-free approach, to look for significant associations.⁴

Cordell and colleagues outline three reasons why there might be an association between a polymorphism and a trait in a population:

Direct association – the polymorphism has a causal role
Indirect association – the polymorphism has no causal role but is associated with a nearby causal variant
Confounded association – the association is due to some underlying stratification or admixture of the population, requiring further investigation

Familiar epidemiological study designs such as case-control or cohort designs are often used for genetic association studies and the data are analysed much the same way. Risk factors or exposures such as smoking are replaced by the presence or absence of a particular genetic polymorphism.

Mendelian randomisation studies

If a genetic variant has an effect on a modifiable risk factor, which itself alters disease risk, then it follows that the genetic variant should also be related to disease risk. For example, if a given genetic variant is associated with cholesterol levels, and blood cholesterol levels are associated with the development of heart disease, then the genetic variant itself is also associated with heart disease. Where this is the case, the genetic variant can be used as a proxy for the risk factor, and there is thus no need to measure the risk factor. Natural experiments can thus be conducted using an individual’s genetics to assign them to risk groups (e.g. those with, or without a specific genetic variant), and measuring the resulting outcome. This is known as Mendelian randomization. The advantage of this approach is that, unlike modifiable risk factors, an individual’s genetics are not affected by potential confounders – so gene variant-disease associations are more likely to be causative. More information on this form of study design can be found here.

Appraising association studies

Hattersley and colleagues⁵ provide guidance on assessing the quality of association studies. They propose asking a series of questions, similar to those that would be addressed in the appraisal of a standard epidemiology paper:

What are we hoping for from an association study? Many association studies have only limited power to detect true susceptibility effects and even less power to exclude the involvement of a gene in causing a trait.
How good a candidate is the gene in question? There are up to 30,000 genes in the human genome, so it is unlikely that more than a few hundred make a meaningful contribution to any single trait. The probability that a gene selected at random will influence a given trait is very low.
How strong is the case for the variants that have been typed? To detect all possible disease-associated gene variants it would be necessary to examine unfeasibly large samples. This is often unrealistic, due to cost reasons.
How appropriate are the samples typed? Although prospective cohort studies are often regarded as the gold standard, they are usually not efficient for the initial stages of gene discovery. Unless the disease is very common, the study samples generated will have far fewer individuals with disease than without. Furthermore, the unselected nature of the cases could compromise power, especially when compared with samples that are deliberately enriched for genetic aetiology and disease homogeneity. The case-control study remains the mainstay of genetic association studies, and the most important issues relate to choice of the two study groups.
Is the study size large enough? Sample size is a key determinant of quality in an association study
How good is the genotyping? Most association studies assume implicitly that the genotypes are accurate. However, even with the best methods, some assays will be unreliable, and the accuracy of earlier genotyping methods may have been even worse.
How appropriate is the analysis?
How appropriate is the interpretation? There is still much discussion about the level of evidence needed before a genetic association can be regarded as proven.

Papers by Frayling⁶ and Ioannidis⁷ both provide further guidance on interpreting association studies, for those seeking more information.

Problems with genetic association studies

Genetic association studies are central to efforts to identify and characterise genomic variants underlying susceptibility to multifactorial disease. However, their role in the characterisation of genes contributing to common traits remains controversial. Bird and colleagues⁸ identify several potential pitfalls with studies of this kind:

Accuracy of diagnostic criteria for the disorder to be studied. Investigators should provide evidence that all the subjects have the same disease.
Selection of appropriate control subjects, especially regarding age, sex, and ethnic background.
Choice of study strategy, for example using a population-based, case-control study vs. a family approach.
The problem of multiple comparisons leading to a high likelihood of false-positive results occurring by chance because of large numbers of comparisons in the study.
Choice of statistical analysis and threshold for significance.
The tendency of both investigators and journals to report only studies with positive rather than negative results. As a result, the literature becomes heavily weighted toward unconfirmed associations.

References

Burton P, Tobin M, Hopper J. Key concepts in genetic epidemiology. Lancet 2005; 366: 941–51.
Teare MD, Barrett J. Genetic linkage studies. Lancet 2005; 366: 1036–44.
Cordell H, Clayton D. Genetic association studies. Lancet 2005; 366: 1121–31.
Soler Artigas M, Wain L, Tobin M. Genome-wide association studies in lung disease. Thorax, 2012;67:271-273.
Hattersley A, McCarthy M. What makes a good genetic association study? Lancet 2005; 366: 1315–23.
Frayling T. Genetic association studies see light at the end of the tunnel. International Journal of Epidemiology 2008;37:133–135.
Ioannidis J et al. Assessment of cumulative evidence on genetic associations: interim guidelines. International Journal of Epidemiology 2008; 37:120–132.
Bird T, Jarvik G, Wood N. Genetic association studies: genes in search of diseases. Neurology 2005;57:1153-54.

@Helen Barratt 2009, Saran Shantikumar 2018