Some tips for quality control (QC):
Despite controlling for HWE, MAF and Call Rate, some funny SNPs can scape. I recently analysed some SNPs which pass all QC criteria but had only 1 individual with genotype 22 and an extreme value. This individual screwed up my analysis. Solutions:
1. Eliminate phenotypic outliers.
2. Add another QC criterion: minimum genotype frequency (MGF) 5%.
Moreover, if you know the structure of your population, e.g. breeds, HWE should be tested having such structure into account, e.g. within breeds.
Everyday we know a bit more than yesterday and a bit less than the morrow
Jules
Jul 13, 2011
Jun 15, 2011
#----------------------------------------------------------------
# genetic stratification
#----------------------------------------------------------------
# this script calculates:
# (1) fraction of genetic variance explained by each multidimennsional scaling (MDS) dimension
# (2) MDS plot
# files required
# *.mds file (from PLINK)
# a file with phenotypic information (e.g. breed, affection status, sex, etc.), with FID and IID fields
# -------- edit this ---------------------
# specify *.mds and file path
mds.path <- "D:/jqo/Rscripts/disease.mds"
pheno.path <- "D:/jqo/Rscripts/gwas.pheno"
# ----------------------------------------
#----------------------------------------------------------------
# prepare files
#----------------------------------------------------------------
# import
mds.file <- read.delim(mds.path, sep = "")
pheno.file <- read.delim(pheno.path, sep = "")
# merge files
mds <- merge(pheno.file, mds.file, by = c("FID", "IID"))
# number of samples and columns where MDS dimensions start
samples <- length(mds$IID)
# pheno.file adds n colums + 1 from the "SOL" field in the mds.file
start.dimensions <- ncol(pheno.file) + 2
# variance for each dimension
var.mds <- data.frame(apply(mds[, start.dimensions:(start.dimensions + samples - 1)], 2,
function(x)
{c(variance = var(x, na.rm = TRUE))}))
names(var.mds) <- "varC"
# total variance
all.var.mds <- sum(var.mds[, 1])
# relative variance
var.mds <- transform(var.mds, rel.varC = varC / all.var.mds)
#----------------------------------------------------------------
# plots
#----------------------------------------------------------------
# relative genetic fraction
windows()
plot(var.mds[, 2] * 100,
type = "b",
ylab = "Relative genetic fraction (%)",
xlab = "MDS dimension (Ci)")
# MDS plot for C1 and C2
rel.var.C1 <- paste("C1 (", round(var.mds[1, 2], digit = 3) * 100 , "%)")
rel.var.C2 <- paste("C2 (", round(var.mds[2, 2], digit = 3) * 100 , "%)")
windows()
plot(mds$C1, mds$C2,
col = mds$disease,
pch = 16,
xlab = rel.var.C1,
ylab = rel.var.C2)
# genetic stratification
#----------------------------------------------------------------
# this script calculates:
# (1) fraction of genetic variance explained by each multidimennsional scaling (MDS) dimension
# (2) MDS plot
# files required
# *.mds file (from PLINK)
# a file with phenotypic information (e.g. breed, affection status, sex, etc.), with FID and IID fields
# -------- edit this ---------------------
# specify *.mds and file path
mds.path <- "D:/jqo/Rscripts/disease.mds"
pheno.path <- "D:/jqo/Rscripts/gwas.pheno"
# ----------------------------------------
#----------------------------------------------------------------
# prepare files
#----------------------------------------------------------------
# import
mds.file <- read.delim(mds.path, sep = "")
pheno.file <- read.delim(pheno.path, sep = "")
# merge files
mds <- merge(pheno.file, mds.file, by = c("FID", "IID"))
# number of samples and columns where MDS dimensions start
samples <- length(mds$IID)
# pheno.file adds n colums + 1 from the "SOL" field in the mds.file
start.dimensions <- ncol(pheno.file) + 2
# variance for each dimension
var.mds <- data.frame(apply(mds[, start.dimensions:(start.dimensions + samples - 1)], 2,
function(x)
{c(variance = var(x, na.rm = TRUE))}))
names(var.mds) <- "varC"
# total variance
all.var.mds <- sum(var.mds[, 1])
# relative variance
var.mds <- transform(var.mds, rel.varC = varC / all.var.mds)
#----------------------------------------------------------------
# plots
#----------------------------------------------------------------
# relative genetic fraction
windows()
plot(var.mds[, 2] * 100,
type = "b",
ylab = "Relative genetic fraction (%)",
xlab = "MDS dimension (Ci)")
# MDS plot for C1 and C2
rel.var.C1 <- paste("C1 (", round(var.mds[1, 2], digit = 3) * 100 , "%)")
rel.var.C2 <- paste("C2 (", round(var.mds[2, 2], digit = 3) * 100 , "%)")
windows()
plot(mds$C1, mds$C2,
col = mds$disease,
pch = 16,
xlab = rel.var.C1,
ylab = rel.var.C2)
May 26, 2011
Forcing the reference allele in PLINK
Aim: my data showed a region of nearly complete homozygosity in one population (let's say, pop1) and I wanted to see whether the same region (or a shorter part of it) was also nearly fixed in other three populations and also if it was the same allele that was fixed in the populations.
Method: my first try was to use PLINK to calculate the allelic frequencies (--freq) in that region (--chr --from-mb --to-mb) independently in the three populations using the minor allele in pop1 as the reference allele for the other breeds. This can be done in PLINK by adding --reference-allele, which specifies a file with two fields: the SNP identifier and the allele to be used as reference. The most obvious way to me to generate this file was from the previously generated file pop1.frq (--frq), which amongst other fields includes the SNP identifier and the minor and the major alleles in pop1 (denoted as A1 and A2, respectively).
Problem: in the *.frq file, SNPs that are fixed (i.e. MAF = 0), since there are not counts of the minor allele, PLINK codes the A1 as zero and one will end up with a list of reference alleles informativeless for some markers. Weird things may happen then if this list is used to calculate the derived allelic frequencies in the other populations. I do not fully understand why PLINK does not simply extracts the actual minor allele from the *.bim file (which is included by default in any PLINK analysis) instead of just using zero coding.
Method: my first try was to use PLINK to calculate the allelic frequencies (--freq) in that region (--chr --from-mb --to-mb) independently in the three populations using the minor allele in pop1 as the reference allele for the other breeds. This can be done in PLINK by adding --reference-allele, which specifies a file with two fields: the SNP identifier and the allele to be used as reference. The most obvious way to me to generate this file was from the previously generated file pop1.frq (--frq), which amongst other fields includes the SNP identifier and the minor and the major alleles in pop1 (denoted as A1 and A2, respectively).
Problem: in the *.frq file, SNPs that are fixed (i.e. MAF = 0), since there are not counts of the minor allele, PLINK codes the A1 as zero and one will end up with a list of reference alleles informativeless for some markers. Weird things may happen then if this list is used to calculate the derived allelic frequencies in the other populations. I do not fully understand why PLINK does not simply extracts the actual minor allele from the *.bim file (which is included by default in any PLINK analysis) instead of just using zero coding.
May 24, 2011
R/fGWAS: Functional GWAS package for R (Version 1.0)
The R/fGWAS is a bayesian analysis for GWAS implemented in R. For CRAG members, additional information is available and contributions are wellcome at CRAGBioWiki. The link for the package website will be permanently posted at the 'Used software' section.
May 23, 2011
Accuracy in phenotype measurement in GWAS
This article was recently presented at our department's journal club and has been added to the 'Further reading' section. Basically, it assesses how differences in independent measurements of the same trait affect results in GWAS such as significance, position of the associated region and estimated allelic effects.
May 16, 2011
May 12, 2011
Converting PLINK to GenABEL files
There are two R functions to convert from PLINK to GenABEL files: convert.snp.ped() (which inputs PLINK's *.ped and *.map files) and convert.snp.tped() (same for *.tped and *.tfam). An error happened when I tried to use convert.snp.ped() and R was forced to close. I had no problems with convert.snp.tped(). I know of one person who experienced the same issue although I do not know whether the same error message was prompt (and the cause of the problem).
On the other hand, be aware that any of the two functions (although both use as input files containing the phenotypic information) will only produce one of the input files (*.raw) for the GenABEL function that loads data: load.gwaa.data(). The second input file (*.dat) will have to be produced for instance from the *.tfam file. This means that (1) *.tfam does not have header and and *.dat must have (at least 'id' and 'sex' fields are required) and (2) PLINK uses 1/2 coding for sex by default whereas load.gwaa.data() uses 1/0. On the other hand, one would have to create anyway the *.dat file anyway if using a alternative phenotype file in PLINK (to allow for >1 phenotype or covariates). It is not the end of the world but it makes the conversion a bit less direct than expected...
Maybe somebody can add his/her experience?
May 11, 2011
A fixed significance threshold for GWAS in humans
Pe'er et al., 2008 aimed to define the testing burden (tb) as the factor by which significance is exagerated. Phased chromosomes from the Human Haplotype Map (HapMap) ENCODE regions (representing a fraction g = 1/600 of the genome) were used to generate randomly 1,000 cases and 1,000 controls (no association expected). Association statistics and p-values were calculated and the process was simulated N = 10e7 times.
As I understand it, for a given p, a nominal p-value computed from the theoretical distribution, n(p) was calculated as the number of simulations out of N simulations at which the best simulated p-value region-wide (i.e. in the region of size g) exceeded p. H(p) = n(p)/g·N, was defined as the number of expected regions in the genome that have a SNP exceeding p (i.e. expected significant hits in the genome by chance). tb is defined as H(p)/p and by consesus they define H(p) = 1 so that tb = 1/p. So far so good.
What it is not so clear to me is why they set p to be the gN th element of the list of the top single-hits in each of the simulations sort from the smallest (most significant) to the largest. Less clear is how they extrapolate to estimate the number of independent tests (1 million for all ENCODE SNPs in the CEU HapMap population), which is used to set a fixed threshold of genome-wide significance of P = 0.05 / 1 million = ~10e-8.
May 10, 2011
Motivation
Genome wide association studies (GWAS) are becoming standard genetic analyses. However, there is still a lack of consensus about the best-practices during the typical pipeline of a GWAS: (1) quality control (QC, which types of filters, and thresholds within filters, to apply); (2) association testing (genetic model, genetic stratification, multiple-testing correction and significance thresholds); (3) functional analysis (from associated SNPs to annotated genes, pathway analysis, candidate genes). Agreeing upon certain guidelines will greatly facilitate future GWAS and this is the aim of this forum.
Subscribe to:
Posts (Atom)