Question

Different PCA results when using rlog and vst transformations

0

Entering edit mode

Diamond • 0

@b6d51ee2

Last seen 3.5 years ago

Germany

I have a bulk RNAseq dataset of over 50 samples that has been sequenced in 2 batches. I want to run Deseq2 eventually and I am exploring the data for potential outliers and covariates to include into the model. For initial exploratory purposes I run the following code:

dds <- DESeqDataSetFromMatrix(countData = counts, colData = design, design = ~ batch)
dds <- estimateSizeFactors(dds)
vsd <- vst(dds)
rld <- rlog(dds)
pcaData <- plotPCA(vst, intgroup=batch, returnData=TRUE)

and plot using ggplot. I have a couple of outliers which I remove and repeat the procedure above with the remaining data. I am puzzled since I get 2 pretty different pictures showing I have smaller or bigger batch effects (depending on the transformation used) depicted by blue/green:

PCA plots

I then use Combat-seq to correct for batch effects and run the above code again followed by plotting. vst-transformed data clearly shows that the batch effects were removed, however, rlog points to still persistent batch effects.

Which plot should I believe and how should I proceed further when constructing the Deseq2 model: assume that there are no batch effects or include it as a covariate in the GLM?

Thank you.

vsn DESeq2 rlog BatchEffect • 3.3k views

ADD COMMENT • link updated 3.5 years ago by Michael Love 43k • written 3.5 years ago by Diamond • 0

score 1 · Answer 1 · 2021-08-27

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

My preferred approach is to use VST over rlog, and to use ~batch + condition on the original data.

Not sure why the rlog -> Combat -> plotPCA looks like that, but anyway, the recommended approach should be fine.

ADD COMMENT • link 3.5 years ago Michael Love 43k

0

Entering edit mode

Thanks a lot for a swift reply.

Does modeling of the batch in Deseq2 have a limitation? i.e. if the batch effects look big on the PCA plot as in rlog plot one would rather go for a more aggressive batch correction method like Combat-seq and model batch effects if they are mild?

Since I would like to use my data for downstream (machine learning) applications besides DEG analysis (where I need to have batch effects removed), would you suggest believing the output of vst over rlog regarding the success of batch correction or do I need to perform additional analysis/visualization using other transformations?

ADD REPLY • link 3.5 years ago Diamond • 0

0

Entering edit mode

No downside I know of to modeling in the design formula.

See our vignette on some code to remove batch associated variance from VST data for downstream applications like PCA.

ADD REPLY • link 3.5 years ago Michael Love 43k