Alternatively, a measure of composite genotypic disequilibrium can be computed directly from two-locus genotypic data [ 6 ]; under the assumption of random mating, it corresponds to the aforementioned allelic LD measure. A number of other common LD coefficients and their properties have been studied both analytically and via simulations [ 7 , 8 ].

Once LD is created by a number of evolutionary forces, it is subjected to recombination events taking place between loci, which cause it to decay with time. The essential idea is that a marker in strong LD with a disease locus is expected to be located nearby. But how is LD used to map a gene? After all, when one of the loci of interest is the gene that is being mapped, we have no information about its allele or genotype frequencies.

Association mapping techniques attempt to detect LD indirectly , by measuring the association between a candidate marker and the phenotype of interest, provided there is a rich pattern of LD between some of the typed markers and the real, unobserved causal variant. How dense this map of markers should be and what the distribution of LD looks like in modern human populations are crucial issues being extensively explored [ 1 , 9 ].

Notably, the HapMap project is enabling the characterisation of genome-wide patterns of LD in several populations [ 10 ]. In what regions of the genome do we look for disease-bearing genes? In candidate-gene approaches, it is assumed that prior biological hypotheses about plausible locations of the candidate gene have been previously obtained, and therefore the search is localised to those regions of interest.

### 1st Edition

Genome-wide studies , on the other hand, screen the entire genome, thus enabling a more comprehensive search for genetic risk factors. These studies will soon be less expensive, and therefore more routinely employed.

From a statistical and computational standpoint, genome-wide explorations introduce non-trivial challenges due, among other causes, to the very large amount of markers to be included in the analysis compared to the usually smaller sample sizes; these issues will be considered further later. Another question generating much discussion, and fuelling the development of new analytical methods, is whether complex diseases are caused by a single common variant or many variants having small effects. So far, evidence in its favour has been limited.

It is plausible to assume that common diseases are expected to be controlled by more complex genetic mechanisms characterised by the joint action of several genes, with each gene having only a small marginal effect, perhaps because natural selection has removed the genes having larger effects. In this scenario, groups of markers should be tested jointly for association, which can be done in two main ways: by grouping markers together in multi-locus genotypes so that the basic unit of statistical analysis is still the individual, or via haplotypes, thus effectively doubling the sample size.

We next review a number of selected techniques, starting from the simplest case of single-marker analysis. Suppose we are investigating the effects of biallelic markers, e. SNPs, on disease. In a case-control setting, counts of either the two alleles or the three genotypes at a locus in the two groups, affected and controls, are compared.

If there is a difference in frequencies between the two samples, there is evidence that the marker is in LD with the gene affecting the disease susceptibility. A simple test for independence is the Pearson's chi-squared test statistic, at both the allelic and genotypic level. However, it has been noted that this test is not robust to departures from Hardy—Weinberg equilibrium HWE in control subjects [ 15 ]. The utility of single-marker LD testing that uses a case-control study design with either diallelic or multiallelic markers is discussed in [ 16 ].

In this work, under the alternative hypothesis of unequal marker allele frequencies between cases and controls, the asymptotic distribution of the chi-squared test is expressed as a function of G 2 , a genetic distance measure, which depends on the population history; using a simple deterministic population genetic model accounting for a single mutation and ignoring genetic drift, the value of G 2 can be computed and the power of the test obtained under various disease models and population histories.

A more robust test statistic is Cochran—Armitage CA trend test, a method of directing chi-squared tests towards narrow alternatives [ 17 ]. This test should be used at the genotypic level when HWE fails to hold for both cases and controls [ 15 , 18 ]. Fine localisation of a disease—susceptibility locus can be accomplished by investigating deviations from HWE among affected individuals alone [ 19 , 20 ].

Hybrid tests have also been suggested, for instance a test statistic obtained as a weighted average between the CA trend test statistic and the difference between test statistics based on HWE computed in cases and controls [ 21 , 22 ]. Departures from HWE can also serve as a quality check on the data, as experience suggests that gross deviations from HWE often indicate genotyping errors.

Alternatively, one can explicitly model the penetrance of the disease, that is the conditional probability that a randomly selected individual in the population possesses the disease, given the data.

- The United Nations, Peace and Security: From Collective Security to the Responsibility to Protect.
- Media Contact?
- Inclusiveness in India: A Strategy for Growth and Equality.
- Web Semantics Ontology.
- Genomics introduction.
- Bibliographic Information.
- Good As Gone (Simon Fisk, Book 1).

In the logistic regression LR formulation, the logit transform of the penetrance parameter is modelled as a linear combination of the marker data. Then, by asymptotic results of maximum likelihood estimators, inferences can be based on standard Walk, likelihood ratio and score methods [ 14 ]. In particular, the score statistic in this case corresponds to the CA trend test.

An obvious advantage of this formulation is that covariates can be easily added into the model. Recent developments that allow efficient approximate conditional inference for LR models include Monte Carlo methods and saddle-point approximations [ 23 ]. Several statistical methods for association mapping, including LR as well as other generalised linear models, require the specification of a genetic model of inheritance. For instance, in a CA test, or score statistics from logistic regression, an additive model can be imposed by giving genotype weights 0, 1 and 2, depending on the number of copies of the minor allele.

Forcing a specific genetic model provides a powerful means of detecting association when the hypothesised model is close to the true underlying genetic mechanism, but may also lead to very low power when the true model is different [ 18 , 24 ].

Methods that do not require the specification of a genetic model are usually recommended [ 25 ]. The approaches presented so far rely on two fundamental assumptions: first, the population under study must be genetically homogeneous, i. Tests of association that do not protect against departures from these two assumptions may have inflated type I error rates. If the target population does consist of several subpopulations, spurious associations at a candidate marker may occur if the disease prevalence differs between subpopulations, i.

When the population is indeed heterogeneous, family-based association studies are generally more powerful than case-control studies, and tests that rely on the transmission of alleles from parents to off-springs are usually adopted [ 26 ].

However, these study designs present other drawbacks, most notably the difficulty of collecting DNA from relatives of affected individuals, especially for late-onset diseases, thus mitigating against the recruitment of large samples. Research in the area of case-control studies has actively addressed these issues and several alternatives are available.

If population structure or cryptic relatedness is present in the sample, the variability and magnitude of the test statistics at the null markers are inflated and tests computed at candidate loci can be adjusted accordingly. A different remedy, often referred to as structured association , prescribes using loci unlinked to candidate genes under study to infer subpopulation membership and conduct a test of association within subpopulations.

The idea is that, conditional on subpopulation, there is neither bias nor excess of variance due to population substructure. The method can be implemented as a two-step procedure, in which subpopulation proportions are estimated first and then incorporated into a test statistic [ 28 , 29 ] for instance, as covariates in a LR model , or as a unified analysis which may account for estimation uncertainty [ 30 , 31 ]. Either way, the task of estimating subpopulation memberships from genotype data is essentially a clustering or unsupervised learning application, often addressed by using finite mixture distributions [ 32 ].

Both Bayesian and likelihood-based inferential procedures can then be employed [ 33—35 ]. In a variation of the structured association idea, the disease status is included in the clustering algorithm used for inferring the hidden population structure, leading to a supervised clustering approach [ 36 ]. The related question of what markers are particularly informative for ancestry estimation is also important and has been investigated from an information-theoretic perspective [ 37 ].

Recently, it has also been suggested that the use of LR alone, which dispenses entirely with the notion of subpopulation and is computationally faster, may be a better alternative [ 38 , 39 ]. Unlike genomic control, structured association alone does not protect against cryptic relatedness, and more specialised solutions are needed.

However, a recently developed theory to predict the amount of cryptic relatedness expected in random mating populations suggests that confounding effects in this situation are particularly serious only in special cases [ 40 ]. On the other hand, even moderate levels of population stratification may lead to an increased number of false positives [ 41 ], especially in large case-control studies [ 42 ] and even in well-designed studies [ 43 ] or when the population under study is believed to be homogeneous [ 44 ].

## CSAMA – Statistical Data Analysis for Genome Scale Biology

In a single-marker analysis, a test statistic is computed at each candidate marker. One way to deal with this situation is to control the family-wise error rate FWER , i. Since this correction is too conservative when M is large, leading to power loss, a number of step-wise procedures have been developed. An alternative multiple hypothesis testing error measure is the false discovery rate FDR [ 46 ], which is loosely defined to be the expected proportion of false positives among all significant hypotheses. The FDR is especially appropriate for exploratory analyses in which one is interested in finding several significant results among many tests.

As a special case, when all the hypotheses are true, this error rate equals the FWER. It must be noted that there are substantial correlations among test statistic values along the map induced by LD between genetic markers. Generally, it is difficult to formally account for this serial correlation. For instance, distributions for products of p -values are only known when the tests are independent. Monte Carlo procedures are commonly used, for instance by randomly permuting phenotypic labels [ 52 ] or by using permutation sampling for fitting extreme value distributions [ 53 ].

However, many simulation-based methods become extremely time consuming when applied to large studies. Other approaches deal with this problem by sequentially decorrelating the tests, either by application of a single transformation derived from the correlation matrix [ 47 ] or by successive greedy transformations [ 54 ]. Under simple models of evolution, ignoring population-specific history and structure, the probability of recombination is a monotonic function of genetic distance, and the degree of LD across a chromosome is expected to follow a unimodal curve with a peak at the true location of the disease mutation.

Under this assumption, a strategy to combine information from single markers in a region is to fit a smooth curve to the LD coefficients computed at all markers and then look for its mode. In practice, however, the pattern of observed LD may fluctuate substantially, even erratically, across contiguous genomic regions [ 1 ]. The gene-mapping problem then becomes one of pattern recognition: the task is to look for regions with a consistent overall pattern of LD supporting the existence of a disease-associated marker.

Non-parametric curve-fitting methods embracing this idea have been initially developed for fine-mapping, thus assuming that the region under study does contain a true peak [ 55 , 56 ]. However, in large exploratory scans, the possibility that the data may have a varying number of true signals, or even no signal at all, has to be taken into account. In this respect, a curve-fitting method based on Bayesian adaptive regression splines with a variable number of knots has been applied with some success [ 57 , 58 ].

A different proposal consists of fitting a semi-Bayesian hierarchical model, where a pair-wise LD measure is first estimated for each locus using a first-stage model, and then spatially smoothed along the candidate region using a second-stage model that can include information on genetic or physical distances as well as haplotype structure [ 59 ].

An alternative solution for combining information from neighbouring markers consists of forming sums of single-marker test statistics and then testing the null hypothesis that none of the selected markers in each sum is associated with the disease, in what has been called the set association approach [ 60—62 ]. Combining genetic main effects in this way may facilitate the detection of susceptibility genes while avoiding the need to characterise detailed interaction patterns among markers.

## Statistical Applications in Genetics and Molecular Biology

When compared with Bonferroni or FDR procedures, the sum statistics seem to show greater power [ 54 ]. Rather than considering each marker individually, specific combinations of allelic variants at a series of tightly linked markers on the same chromosome, i. Incorporating information from multiple adjacent markers, haplotypes preserve the joint LD structure and more directly reflect the true polymorphisms.

Therefore, in a generalisation of the single-marker analyses presented in the preceding text, haplotype frequencies between cases and controls can be compared instead of allelic and genotypic frequencies [ 63 ]. The simplest way to test whether there is an association between a haplotype and the disease status is to regard each haplotype as a distinct category, possibly lumping all rare haplotypes together into an additional class. To deal with the inflated variance of the test statistic due to the haplotype estimation, the distribution of the test under the null can be obtained by randomly shuffling the disease status and then re-estimating haplotype frequencies [ 64 ].

Although this approach assesses overall association between haplotypes and disease, it does not provide inference on the effects of specific haplotypes or haplotype features. To address these issues, a number of tests of specific haplotype effects are based on a prospective likelihood of disease [ 65 , 66 ], where the disease status is treated as an outcome, and haplotypes enter a regression model as covariates.