May 26, 2011

Forcing the reference allele in PLINK

Aim: my data showed a region of nearly complete homozygosity in one population (let's say, pop1) and I wanted to see whether the same region (or a shorter part of it) was also nearly fixed in other three populations and also if it was the same allele that was fixed in the populations.

Method: my first try was to use PLINK to calculate the allelic frequencies (--freq) in that region (--chr --from-mb --to-mb) independently in the three populations using the minor allele in pop1 as the reference allele for the other breeds. This can be done in PLINK by adding --reference-allele, which specifies a file with two fields: the SNP identifier and the allele to be used as reference. The most obvious way to me to generate this file was from the previously generated file pop1.frq (--frq), which amongst other fields includes the SNP identifier and the minor and the major alleles in pop1 (denoted as A1 and A2, respectively).

Problem: in the *.frq file, SNPs that are fixed (i.e. MAF = 0), since there are not counts of the minor allele, PLINK codes the A1 as zero and one will end up with a list of reference alleles informativeless for some markers. Weird things may happen then if this list is used to calculate the derived allelic frequencies in the other populations. I do not fully understand why PLINK does not simply extracts the actual minor allele from the *.bim file (which is included by default in any PLINK analysis) instead of just using zero coding.

May 24, 2011

R/fGWAS: Functional GWAS package for R (Version 1.0)

The R/fGWAS is a bayesian analysis for GWAS implemented in R. For CRAG members, additional information is available and contributions are wellcome at CRAGBioWiki. The link for the package website will be permanently posted at the 'Used software' section.

May 23, 2011

Accuracy in phenotype measurement in GWAS

This article was recently presented at our department's journal club and has been added to the 'Further reading' section. Basically, it assesses how differences in independent measurements of the same trait affect results in GWAS such as significance, position of the associated region and estimated allelic effects.

May 16, 2011

ibs()

Please, can anybody help to interpret the formula used in GenABEL's ibs() function:

f_{i,j} = Σ_k \frac{(x_{i,k} - p_k) * (x_{j,k} - p_k)}{(p_k * (1 - p_k))}


I am especially troubled to understand how to read the "\" and "frac{..." ?


Thanks!

May 12, 2011

Converting PLINK to GenABEL files

There are two R functions to convert from PLINK to GenABEL files: convert.snp.ped() (which inputs PLINK's *.ped and *.map files) and convert.snp.tped() (same for *.tped and *.tfam). An error happened when I tried to use convert.snp.ped() and R was forced to close. I had no problems with convert.snp.tped(). I know of one person who experienced the same issue although I do not know whether the same error message was prompt (and the cause of the problem).

On the other hand, be aware that any of the two functions (although both use as input files containing the phenotypic information) will only produce one of the input files (*.raw) for the GenABEL function that loads data: load.gwaa.data(). The second input file (*.dat) will have to be produced for instance from the *.tfam file. This means that (1) *.tfam does not have header and and *.dat must have (at least 'id' and 'sex' fields are required) and (2) PLINK uses 1/2 coding for sex by default whereas load.gwaa.data() uses 1/0. On the other hand, one would have to create anyway the *.dat file anyway if using a alternative phenotype file in PLINK (to allow for >1 phenotype or covariates). It is not the end of the world but it makes the conversion a bit less direct than expected...

Maybe somebody can add his/her experience?

May 11, 2011

A fixed significance threshold for GWAS in humans

Pe'er et al., 2008 aimed to define the testing burden (tb) as the factor by which significance is exagerated. Phased chromosomes from the Human Haplotype Map (HapMap) ENCODE regions (representing a fraction g = 1/600 of the genome) were used to generate randomly 1,000 cases and 1,000 controls (no association expected). Association statistics and p-values were calculated and the process was simulated N = 10e7 times.

As I understand it, for a given p, a nominal p-value computed from the theoretical distribution, n(p) was calculated as the number of simulations out of N simulations at which the best simulated p-value region-wide (i.e. in the region of size g) exceeded p. H(p) = n(p)/g·N, was defined as the number of expected regions in the genome that have a SNP exceeding p (i.e. expected significant hits in the genome by chance). tb is defined as H(p)/p and by consesus they define H(p) = 1 so that tb = 1/p. So far so good.

What it is not so clear to me is why they set p to be the gN th element of the list of the top single-hits in each of the simulations sort from the smallest (most significant) to the largest. Less clear is how they extrapolate to estimate the number of independent tests (1 million for all ENCODE SNPs in the CEU HapMap population), which is used to set a fixed threshold of genome-wide significance of P = 0.05 / 1 million = ~10e-8.

May 10, 2011

Motivation

Genome wide association studies (GWAS) are becoming standard genetic analyses. However, there is still a lack of consensus about the best-practices during the typical pipeline of a GWAS: (1) quality control (QC, which types of filters, and thresholds within filters, to apply); (2) association testing (genetic model, genetic stratification, multiple-testing correction and significance thresholds); (3) functional analysis (from associated SNPs to annotated genes, pathway analysis, candidate genes). Agreeing upon certain guidelines will greatly facilitate future GWAS and this is the aim of this forum.