Some tips for quality control (QC):
Despite controlling for HWE, MAF and Call Rate, some funny SNPs can scape. I recently analysed some SNPs which pass all QC criteria but had only 1 individual with genotype 22 and an extreme value. This individual screwed up my analysis. Solutions:
1. Eliminate phenotypic outliers.
2. Add another QC criterion: minimum genotype frequency (MGF) 5%.
Moreover, if you know the structure of your population, e.g. breeds, HWE should be tested having such structure into account, e.g. within breeds.
Everyday we know a bit more than yesterday and a bit less than the morrow
Jules
GWAS Forum
A forum for researchers working on genome wide association studies...
Jul 13, 2011
Jun 15, 2011
#----------------------------------------------------------------
# genetic stratification
#----------------------------------------------------------------
# this script calculates:
# (1) fraction of genetic variance explained by each multidimennsional scaling (MDS) dimension
# (2) MDS plot
# files required
# *.mds file (from PLINK)
# a file with phenotypic information (e.g. breed, affection status, sex, etc.), with FID and IID fields
# -------- edit this ---------------------
# specify *.mds and file path
mds.path <- "D:/jqo/Rscripts/disease.mds"
pheno.path <- "D:/jqo/Rscripts/gwas.pheno"
# ----------------------------------------
#----------------------------------------------------------------
# prepare files
#----------------------------------------------------------------
# import
mds.file <- read.delim(mds.path, sep = "")
pheno.file <- read.delim(pheno.path, sep = "")
# merge files
mds <- merge(pheno.file, mds.file, by = c("FID", "IID"))
# number of samples and columns where MDS dimensions start
samples <- length(mds$IID)
# pheno.file adds n colums + 1 from the "SOL" field in the mds.file
start.dimensions <- ncol(pheno.file) + 2
# variance for each dimension
var.mds <- data.frame(apply(mds[, start.dimensions:(start.dimensions + samples - 1)], 2,
function(x)
{c(variance = var(x, na.rm = TRUE))}))
names(var.mds) <- "varC"
# total variance
all.var.mds <- sum(var.mds[, 1])
# relative variance
var.mds <- transform(var.mds, rel.varC = varC / all.var.mds)
#----------------------------------------------------------------
# plots
#----------------------------------------------------------------
# relative genetic fraction
windows()
plot(var.mds[, 2] * 100,
type = "b",
ylab = "Relative genetic fraction (%)",
xlab = "MDS dimension (Ci)")
# MDS plot for C1 and C2
rel.var.C1 <- paste("C1 (", round(var.mds[1, 2], digit = 3) * 100 , "%)")
rel.var.C2 <- paste("C2 (", round(var.mds[2, 2], digit = 3) * 100 , "%)")
windows()
plot(mds$C1, mds$C2,
col = mds$disease,
pch = 16,
xlab = rel.var.C1,
ylab = rel.var.C2)
# genetic stratification
#----------------------------------------------------------------
# this script calculates:
# (1) fraction of genetic variance explained by each multidimennsional scaling (MDS) dimension
# (2) MDS plot
# files required
# *.mds file (from PLINK)
# a file with phenotypic information (e.g. breed, affection status, sex, etc.), with FID and IID fields
# -------- edit this ---------------------
# specify *.mds and file path
mds.path <- "D:/jqo/Rscripts/disease.mds"
pheno.path <- "D:/jqo/Rscripts/gwas.pheno"
# ----------------------------------------
#----------------------------------------------------------------
# prepare files
#----------------------------------------------------------------
# import
mds.file <- read.delim(mds.path, sep = "")
pheno.file <- read.delim(pheno.path, sep = "")
# merge files
mds <- merge(pheno.file, mds.file, by = c("FID", "IID"))
# number of samples and columns where MDS dimensions start
samples <- length(mds$IID)
# pheno.file adds n colums + 1 from the "SOL" field in the mds.file
start.dimensions <- ncol(pheno.file) + 2
# variance for each dimension
var.mds <- data.frame(apply(mds[, start.dimensions:(start.dimensions + samples - 1)], 2,
function(x)
{c(variance = var(x, na.rm = TRUE))}))
names(var.mds) <- "varC"
# total variance
all.var.mds <- sum(var.mds[, 1])
# relative variance
var.mds <- transform(var.mds, rel.varC = varC / all.var.mds)
#----------------------------------------------------------------
# plots
#----------------------------------------------------------------
# relative genetic fraction
windows()
plot(var.mds[, 2] * 100,
type = "b",
ylab = "Relative genetic fraction (%)",
xlab = "MDS dimension (Ci)")
# MDS plot for C1 and C2
rel.var.C1 <- paste("C1 (", round(var.mds[1, 2], digit = 3) * 100 , "%)")
rel.var.C2 <- paste("C2 (", round(var.mds[2, 2], digit = 3) * 100 , "%)")
windows()
plot(mds$C1, mds$C2,
col = mds$disease,
pch = 16,
xlab = rel.var.C1,
ylab = rel.var.C2)
May 26, 2011
Forcing the reference allele in PLINK
Aim: my data showed a region of nearly complete homozygosity in one population (let's say, pop1) and I wanted to see whether the same region (or a shorter part of it) was also nearly fixed in other three populations and also if it was the same allele that was fixed in the populations.
Method: my first try was to use PLINK to calculate the allelic frequencies (--freq) in that region (--chr --from-mb --to-mb) independently in the three populations using the minor allele in pop1 as the reference allele for the other breeds. This can be done in PLINK by adding --reference-allele, which specifies a file with two fields: the SNP identifier and the allele to be used as reference. The most obvious way to me to generate this file was from the previously generated file pop1.frq (--frq), which amongst other fields includes the SNP identifier and the minor and the major alleles in pop1 (denoted as A1 and A2, respectively).
Problem: in the *.frq file, SNPs that are fixed (i.e. MAF = 0), since there are not counts of the minor allele, PLINK codes the A1 as zero and one will end up with a list of reference alleles informativeless for some markers. Weird things may happen then if this list is used to calculate the derived allelic frequencies in the other populations. I do not fully understand why PLINK does not simply extracts the actual minor allele from the *.bim file (which is included by default in any PLINK analysis) instead of just using zero coding.
Method: my first try was to use PLINK to calculate the allelic frequencies (--freq) in that region (--chr --from-mb --to-mb) independently in the three populations using the minor allele in pop1 as the reference allele for the other breeds. This can be done in PLINK by adding --reference-allele, which specifies a file with two fields: the SNP identifier and the allele to be used as reference. The most obvious way to me to generate this file was from the previously generated file pop1.frq (--frq), which amongst other fields includes the SNP identifier and the minor and the major alleles in pop1 (denoted as A1 and A2, respectively).
Problem: in the *.frq file, SNPs that are fixed (i.e. MAF = 0), since there are not counts of the minor allele, PLINK codes the A1 as zero and one will end up with a list of reference alleles informativeless for some markers. Weird things may happen then if this list is used to calculate the derived allelic frequencies in the other populations. I do not fully understand why PLINK does not simply extracts the actual minor allele from the *.bim file (which is included by default in any PLINK analysis) instead of just using zero coding.
May 24, 2011
R/fGWAS: Functional GWAS package for R (Version 1.0)
The R/fGWAS is a bayesian analysis for GWAS implemented in R. For CRAG members, additional information is available and contributions are wellcome at CRAGBioWiki. The link for the package website will be permanently posted at the 'Used software' section.
May 23, 2011
Accuracy in phenotype measurement in GWAS
This article was recently presented at our department's journal club and has been added to the 'Further reading' section. Basically, it assesses how differences in independent measurements of the same trait affect results in GWAS such as significance, position of the associated region and estimated allelic effects.
May 16, 2011
May 12, 2011
Converting PLINK to GenABEL files
There are two R functions to convert from PLINK to GenABEL files: convert.snp.ped() (which inputs PLINK's *.ped and *.map files) and convert.snp.tped() (same for *.tped and *.tfam). An error happened when I tried to use convert.snp.ped() and R was forced to close. I had no problems with convert.snp.tped(). I know of one person who experienced the same issue although I do not know whether the same error message was prompt (and the cause of the problem).
On the other hand, be aware that any of the two functions (although both use as input files containing the phenotypic information) will only produce one of the input files (*.raw) for the GenABEL function that loads data: load.gwaa.data(). The second input file (*.dat) will have to be produced for instance from the *.tfam file. This means that (1) *.tfam does not have header and and *.dat must have (at least 'id' and 'sex' fields are required) and (2) PLINK uses 1/2 coding for sex by default whereas load.gwaa.data() uses 1/0. On the other hand, one would have to create anyway the *.dat file anyway if using a alternative phenotype file in PLINK (to allow for >1 phenotype or covariates). It is not the end of the world but it makes the conversion a bit less direct than expected...
Maybe somebody can add his/her experience?
Subscribe to:
Posts (Atom)