Differences in PCA plot - rld versus vsd normalized data (DESeq2)
2
0
Entering edit mode
pkachroo ▴ 10
@pkachroo-11576
Last seen 4.0 years ago

Hi all,

I generated PCA plots for my RNASeq data (3 biological replicates per sample) using rld and vsd normalized count data (images below). 

I then removed data for three samples that appeared like outliers on the PCA plots (marked in red circle) and also I know that those three samples have mutations in certain genes and other samples should be wild-type like. I regenerated the PCA plot with all samples except the three outliers and get very different PCA plots using rld and vsd normalized counts. 

I am therefore wondering what could contribute to the differences? Also the blue dots and the red dots in the PCA represent samples processed as different experiments (batch effects). So, can I use  design = ~ batch + condition to account for the batch effect. How should I re-plot PCA for the normalized counts after removing the batch effect ?

Regards,

Priyanka 

 

 

 

 

 

deseq2 PCA batch effect • 7.3k views
ADD COMMENT
0
Entering edit mode

Thanks for your help Micheal. Based on the comments on Bioconductor DESeq2 threads, I understood that I can remove batch effects by simply accounting for it in the design.

I tried the following command: ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~ batch + group).

I get the following error : Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified.One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.

Here is my experimental design (link for the table below): I have 50 strains in triplicates (150 samples). The 50 strains are from three genetic groups (group 1,2 and 3). Majority of the strains in each group are wild-type, however some strains have mutations in specific genes (status column). From the PCA plots (shown in the above thread), I see clustering due to batch effect - strains done as different experiments  (batch - A and B) cluster together. So here are my two questions:

1. I want to compare strains based on genetic groups and account for the genetic status (wild type or not) and the batch effects. Is the following design appropriate:

design = ~batch+status+group

Finally perform wald tests for different contrasts: group 1 versus group 2; group 2 versus group 3 and group 3 versus group 1

2. How do I resolve the error : checkFullRank(modelMatrix)  ?

Here is the link for Coldata: 

https://drive.google.com/open?id=0B3ZNauEYziEHZjRuN1pyOVpJVUU

 

 

ADD REPLY
0
Entering edit mode

"I can remove batch effects by simply accounting for it in the design"

You can control for them in differential expression, but including them in the design will not remove their contribution to distances in the PCA plot. For that you need to use limma's removeBatchEffect() on the transformed data, providing the batches. See my comment on the other Answer below.

Batch and group are not confounded in that table you sent. Are you sure this was the design that gave the error?

ADD REPLY
0
Entering edit mode

Hi Micheal,

Sorry, I reran the design and didn't get the error anymore. Must be a mistake on my end. I used the following design: 

ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~batch+status+group)

vsd_b <- varianceStabilizingTransformation(dds, blind=FALSE)
vstMat_b <- assay(vsd, blind=FALSE)

library(limma)

mat2<-removeBatchEffect(vstMat_b)

After removing batch effects, when I make the PCA plot using the following code, I get the 2 errors: 

library(ggplot2)
data <- plotPCA(vsd, intgroup=c("status","group"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=group, shape=status)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  geom_text(color="black",size=1.5, aes(label=strain), check_overlap = TRUE)+
  coord_fixed()

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

Error in if (is.waive(data) || empty(data)) return(cbind(data, PANEL = integer(0))) : 
  missing value where TRUE/FALSE needed

Here is the link for the sessionInfo

https://drive.google.com/open?id=0B3ZNauEYziEHWnRCVXBrcTJreUk

 

  

ADD REPLY
0
Entering edit mode

hi,

Two things. First, to remove batch effects, you need to provide the batch variable(s) to the removeBatchEffect() function, as additional arguments. See the help file for that function. In the code above you just gave the function the data matrix, but you didn't provide the batch variable(s), so I think nothing will be done to the matrix.

Also my suggestion was: 

"running it on the assay() of a transformed dataset and reassigning to assay()".

In R code, reassigment to the same slot looks like:

assay(vsd) <- ...
ADD REPLY
0
Entering edit mode

Thanks, once again. I realized that I had not added the batch variable, but I had already posted my response by then (Oops). Okay, so I used the following command and replotted PCA: 

assay(vsd)<-removeBatchEffect(assay(vsd), vsd$batch) 

The PCA plots looks like this now:

I see less defined clustering by batch in the second PCA. What do you think?

ADD REPLY
0
Entering edit mode
drikaul • 0
@drikaul-13606
Last seen 7.3 years ago
La Jolla, CA

You could try removing the batch effects, limma has a function you could use for that (http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/removeBatchEffect.html). But from the plots, it seems like there's a distinction in the data already.

Moreover, I would recommend using plotMDS instead of PCA for DGE data. Not sure why there is such a difference in the two normalization methods, but the vsd one seems more like it! 

ADD COMMENT
0
Entering edit mode
Yes you can use removeBatchEffect on the transformed counts by running it on the assay() of a transformed dataset and reassigning to assay(). Note that MDS vs PCA is not causing a difference. MDA with Euclidean distance produces the same plot as PCA.
ADD REPLY
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States
I'd recommend blind=FALSE (discussed in vignette) and the VST when there are many samples as here. There is a similar thread to this one that was posted a few days ago. You can find it by clicking the DESeq2 tag on this post.
ADD COMMENT

Login before adding your answer.

Traffic: 723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6