Guidelines


Guidelines for GWAS

Quality Control (QC)

High throughput genomic technologies are error prone. QC in GWAS is compulsory. Software have different inbuilt facilities for QC. Nevertheless, we recommend some data cleaning steps prior QC.
1.       Remove completely missing SNPs and individuals with no genotypes.
2.       Remove monomorphic SNPs.
3.       Remove sex-linked SNPs if pseudoautosomal region is unknown.
Retain SNPs and/or individuals that fulfil the following recommended criteria (filters and thresholds within filters):
1.       Compulsory QC criteria
a.       Call rate (CR) for SNPs and individuals ≥0.95 (that is, <5% missing data).
b.      Minor allele frequency (MAF) ≥0.05.
c.       P(HWE)>0.01 (p-value for the Hardy-Weinberg Equilibrium test, e.g. χ2).
d.      Identity-by-state (IBS) ≤0.95 (rejecting repeated genotypes).
2.       Discretional QC criteria
a.       Sex-linked marker errors (odd ratio <1000 with 1% error allowed (ref.)).
b.      No outliers due to population structure, e.g. multi-dimensional scaling (r.s. R).
c.       Non-pedigree errors (r.s. pedstat).

 Thresholds

Deterministic

In the human literature, p<10-8 is a significant result (ref.).
Lander and Kruglyak (1995) developed the following formula m(T) = (C+2rGT) a(T), where m(T) is the mean number of significant results for a genome scan assuming a threshold T, C is the number of chromosomes in the species (27 in sheep, 23 in humans), r is the recombination rate per map distance (~1 recombination per Morgan in humans), G is the map distance (33.5 Morgans in sheep and humans) and a(T) would be the point-wise probability of obtaining a value equal or greater to T (the significance p-value). The number of significant results above a threshold is expected to be low and therefore it can be modelled as a Poisson distribution with mean m(T). The probability of obtaining 1 or more significant tests is 1-exp(-m(T)). Suggestive genome thresholds are assumed to render 1 significant result per scan. Significant genome thresholds are assumed to render 0.05 significant results per scan. The numbers 1 and 0.05 are arbitrary. In practice, C, r, G and T are fixed, m(T) takes values 1 or 0.05 and the above equation is solved for a(T). For example, in domestic mammals, a significant result could be when p<0.00005 and a suggestive one when p<0.0019.

Permutations

Permutations consist of fixing genotypes and randomly sampling phenotypes without replacement. This will break SNP-phenotype associations but not linkage disequilibrium (LD) structure among SNPs. The latter is particularly useful when testing haplotypes. Given sufficient permutations (>103), empirical distributions of statistics under the H0 can be obtained. Thus, there is no need to appeal to a theoretical distribution for statistics. However, empirical thresholds are only appropriate on unrelated samples or, if families are sampled, on residuals of models taking into account all nuisance parameters, including average relationships, but not SNP genotypes. The reason is that random permutations across families will destroy the heritability of the character.

False Discovery Rate

Genomic control (GC)

If an empirical distribution of a test has heavier tails than its expected theoretical distribution then false positive errors increase. GC is a penalty applied to empirical distributions with heavy tails. Assume the model Y = XB + e, where Y are phenotypes, X the design matrix (with a column of ones and another with genotype scores for a SNP), B’=[b0, b1] (b0 = population mean, b1 = additive SNP effect) and e the errors. The test T2=b2/V(b) is asymptotically distributed as a chi-square 1 df. In GWAS there are m SNPs tested. The GC parameter, lambda, is the median of all T2 tests divided by the theoretical median (0.455) , where the numerator is the empirical median of the tests and the denominator is the median of a theoretical chi-square 1 df, thus, λ·chi-square=T2. If λ=1 then the empirical and theoretical distributions are the same. If λ>1 then the empirical distribution has heavier tails than the theoretical distribution. In this case the penalty T2/ λ is applied. The case λ<1 is rare and we generally no action is taken. It is common to report GC-corrected p-values as well as uncorrected ones.