Question

Differences in PCA plot - rld versus vsd normalized data (DESeq2)

0

Entering edit mode

pkachroo ▴ 10

@pkachroo-11576

Last seen 4.5 years ago

Hi all,

I generated PCA plots for my RNASeq data (3 biological replicates per sample) using rld and vsd normalized count data (images below).

I then removed data for three samples that appeared like outliers on the PCA plots (marked in red circle) and also I know that those three samples have mutations in certain genes and other samples should be wild-type like. I regenerated the PCA plot with all samples except the three outliers and get very different PCA plots using rld and vsd normalized counts.

I am therefore wondering what could contribute to the differences? Also the blue dots and the red dots in the PCA represent samples processed as different experiments (batch effects). So, can I use design = ~ batch + condition to account for the batch effect. How should I re-plot PCA for the normalized counts after removing the batch effect ?

Regards,

Priyanka

deseq2 PCA batch effect • 7.8k views

ADD COMMENT • link 7.7 years ago pkachroo ▴ 10

0

Entering edit mode

Thanks for your help Micheal. Based on the comments on Bioconductor DESeq2 threads, I understood that I can remove batch effects by simply accounting for it in the design.

I tried the following command: ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~ batch + group).

I get the following error : Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified.One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.

Here is my experimental design (link for the table below): I have 50 strains in triplicates (150 samples). The 50 strains are from three genetic groups (group 1,2 and 3). Majority of the strains in each group are wild-type, however some strains have mutations in specific genes (status column). From the PCA plots (shown in the above thread), I see clustering due to batch effect - strains done as different experiments (batch - A and B) cluster together. So here are my two questions:

1. I want to compare strains based on genetic groups and account for the genetic status (wild type or not) and the batch effects. Is the following design appropriate:

design = ~batch+status+group

Finally perform wald tests for different contrasts: group 1 versus group 2; group 2 versus group 3 and group 3 versus group 1

2. How do I resolve the error : checkFullRank(modelMatrix) ?

Here is the link for Coldata:

https://drive.google.com/open?id=0B3ZNauEYziEHZjRuN1pyOVpJVUU

ADD REPLY • link 7.7 years ago pkachroo ▴ 10

0

Entering edit mode

"I can remove batch effects by simply accounting for it in the design"

You can control for them in differential expression, but including them in the design will not remove their contribution to distances in the PCA plot. For that you need to use limma's removeBatchEffect() on the transformed data, providing the batches. See my comment on the other Answer below.

Batch and group are not confounded in that table you sent. Are you sure this was the design that gave the error?

ADD REPLY • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Hi Micheal,

Sorry, I reran the design and didn't get the error anymore. Must be a mistake on my end. I used the following design:

ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~batch+status+group)

vsd_b <- varianceStabilizingTransformation(dds, blind=FALSE)
vstMat_b <- assay(vsd, blind=FALSE)

library(limma)

mat2<-removeBatchEffect(vstMat_b)

After removing batch effects, when I make the PCA plot using the following code, I get the 2 errors:

library(ggplot2)
data <- plotPCA(vsd, intgroup=c("status","group"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=group, shape=status)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(color="black",size=1.5, aes(label=strain), check_overlap = TRUE)+
coord_fixed()

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

Error in if (is.waive(data) || empty(data)) return(cbind(data, PANEL = integer(0))) :
missing value where TRUE/FALSE needed

Here is the link for the sessionInfo

https://drive.google.com/open?id=0B3ZNauEYziEHWnRCVXBrcTJreUk

ADD REPLY • link 7.7 years ago pkachroo ▴ 10

0

Entering edit mode

hi,

Two things. First, to remove batch effects, you need to provide the batch variable(s) to the removeBatchEffect() function, as additional arguments. See the help file for that function. In the code above you just gave the function the data matrix, but you didn't provide the batch variable(s), so I think nothing will be done to the matrix.

Also my suggestion was:

"running it on the assay() of a transformed dataset and reassigning to assay()".

In R code, reassigment to the same slot looks like:

assay(vsd) <- ...

ADD REPLY • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Thanks, once again. I realized that I had not added the batch variable, but I had already posted my response by then (Oops). Okay, so I used the following command and replotted PCA:

assay(vsd)<-removeBatchEffect(assay(vsd), vsd$batch)

The PCA plots looks like this now:

I see less defined clustering by batch in the second PCA. What do you think?

ADD REPLY • link 7.7 years ago pkachroo ▴ 10

Wolfgang Huber · Answer 1 · 2017-07-26

0

Entering edit mode

drikaul • 0

@drikaul-13606

Last seen 7.7 years ago

La Jolla, CA

You could try removing the batch effects, limma has a function you could use for that (http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/removeBatchEffect.html). But from the plots, it seems like there's a distinction in the data already.

Moreover, I would recommend using plotMDS instead of PCA for DGE data. Not sure why there is such a difference in the two normalization methods, but the vsd one seems more like it!

ADD COMMENT • link 7.7 years ago drikaul • 0

0

Entering edit mode

Yes you can use removeBatchEffect on the transformed counts by running it on the assay() of a transformed dataset and reassigning to assay(). Note that MDS vs PCA is not causing a difference. MDA with Euclidean distance produces the same plot as PCA.

ADD REPLY • link updated 7.7 years ago by Wolfgang Huber ★ 13k • written 7.7 years ago by Michael Love 43k

Wolfgang Huber · Answer 2 · 2017-07-27

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 8 hours ago

United States

I'd recommend blind=FALSE (discussed in vignette) and the VST when there are many samples as here. There is a similar thread to this one that was posted a few days ago. You can find it by clicking the DESeq2 tag on this post.

ADD COMMENT • link updated 7.7 years ago by Wolfgang Huber ★ 13k • written 7.7 years ago by Michael Love 43k