Hi all,
I generated PCA plots for my RNASeq data (3 biological replicates per sample) using rld and vsd normalized count data (images below).
I then removed data for three samples that appeared like outliers on the PCA plots (marked in red circle) and also I know that those three samples have mutations in certain genes and other samples should be wild-type like. I regenerated the PCA plot with all samples except the three outliers and get very different PCA plots using rld and vsd normalized counts.
I am therefore wondering what could contribute to the differences? Also the blue dots and the red dots in the PCA represent samples processed as different experiments (batch effects). So, can I use design = ~ batch + condition to account for the batch effect. How should I re-plot PCA for the normalized counts after removing the batch effect ?
Regards,
Priyanka
Thanks for your help Micheal. Based on the comments on Bioconductor DESeq2 threads, I understood that I can remove batch effects by simply accounting for it in the design.
I tried the following command: ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~ batch + group).
I get the following error : Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified.One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.
Here is my experimental design (link for the table below): I have 50 strains in triplicates (150 samples). The 50 strains are from three genetic groups (group 1,2 and 3). Majority of the strains in each group are wild-type, however some strains have mutations in specific genes (status column). From the PCA plots (shown in the above thread), I see clustering due to batch effect - strains done as different experiments (batch - A and B) cluster together. So here are my two questions:
1. I want to compare strains based on genetic groups and account for the genetic status (wild type or not) and the batch effects. Is the following design appropriate:
design = ~batch+status+group
Finally perform wald tests for different contrasts: group 1 versus group 2; group 2 versus group 3 and group 3 versus group 1
2. How do I resolve the error : checkFullRank(modelMatrix) ?
Here is the link for Coldata:
https://drive.google.com/open?id=0B3ZNauEYziEHZjRuN1pyOVpJVUU
"I can remove batch effects by simply accounting for it in the design"
You can control for them in differential expression, but including them in the design will not remove their contribution to distances in the PCA plot. For that you need to use limma's removeBatchEffect() on the transformed data, providing the batches. See my comment on the other Answer below.
Batch and group are not confounded in that table you sent. Are you sure this was the design that gave the error?
Hi Micheal,
Sorry, I reran the design and didn't get the error anymore. Must be a mistake on my end. I used the following design:
ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~batch+status+group)
vsd_b <- varianceStabilizingTransformation(dds, blind=FALSE)
vstMat_b <- assay(vsd, blind=FALSE)
library(limma)
mat2<-removeBatchEffect(vstMat_b)
After removing batch effects, when I make the PCA plot using the following code, I get the 2 errors:
library(ggplot2)
data <- plotPCA(vsd, intgroup=c("status","group"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=group, shape=status)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(color="black",size=1.5, aes(label=strain), check_overlap = TRUE)+
coord_fixed()
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’
Error in if (is.waive(data) || empty(data)) return(cbind(data, PANEL = integer(0))) :
missing value where TRUE/FALSE needed
Here is the link for the sessionInfo
https://drive.google.com/open?id=0B3ZNauEYziEHWnRCVXBrcTJreUk
hi,
Two things. First, to remove batch effects, you need to provide the batch variable(s) to the removeBatchEffect() function, as additional arguments. See the help file for that function. In the code above you just gave the function the data matrix, but you didn't provide the batch variable(s), so I think nothing will be done to the matrix.
Also my suggestion was:
"running it on the assay() of a transformed dataset and reassigning to assay()".
In R code, reassigment to the same slot looks like:
Thanks, once again. I realized that I had not added the batch variable, but I had already posted my response by then (Oops). Okay, so I used the following command and replotted PCA:
assay(vsd)<-removeBatchEffect(assay(vsd), vsd$batch)
The PCA plots looks like this now:
I see less defined clustering by batch in the second PCA. What do you think?