Hello,
I am analyzing a large cohort (1000+ samples) for differential expression analysis.
I am using Deseq2 to process raw counts from samples correcting by two confounding factors (namely sex and histology origin).
As output we obtain a shortlist of interesting candidates with strong adjusted p-value (e.g. p < e-14) and would like to plot the result so that I can visually show such a good result.
Searching around in the net (e.g. PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different? , or DESeq2 - Acquiring batch-corrected values for PCA and hierarchical clustering) I resolved to try to do this using limma batchEffectCorrection() on vst(dds).
Once I plot the output as a boxplot, it does not look as distribution close to a p.value < 1e-14 ( running a t.test on them results in 0.002)
I am pretty sure there is something I am missing in my processing.
Can you suggest a way to improve what I am doing so that what I see resembles the p.value obtained by Deseq analysis?
Here it's what I have done:
library(limma) library(DESeq2) library(ggplot2) genes=read.table("raw_counts.csv",h=T,row.names=1) info=read.table("matched_metadata.csv",sep='\t',h=T,row.names=1) ddsMat = DESeqDataSetFromMatrix(countData=genes, colData=info, design=~donor_sex + histology + STATUS) ddsMat <- estimateSizeFactors(ddsMat) ddsMat <- estimateSizeFactors(ddsMat) idx <- rowSums( counts(ddsMat, normalized=TRUE) >= 5 ) >= 8 ddsMat <- ddsMat[idx,] #now run deseq2 dds <- DESeq(ddsMat, parallel=T,BPPARAM = param) myres2 = results(dds_results, parallel=T,BPPARAM = param,lfcThreshold = 1) #now VST+ batch removal of the 2 confounding factors dds_vst= vst(dds,blind =F ) dds_batch_removed = removeBatchEffect(assay(dds_vst),batch= info$histology, batch2=info$donor_sex , design=model.matrix(~info$STATUS)) #now plot this a =as.data.frame(t(dds_batch_removed)) a$STATUS= info$STATUS ggplot(a,aes(x=STATUS , y= ENSG00000171819.4)) + geom_boxplot()
--
My result for the top gene is like this in the image below, I think I'm doing something wrong in the process
https://drive.google.com/open?id=1bhfEzQlHj5c_ElhyrQQaSwSsrk83zpGk
Thanks a lot for your help
Mattia
Can you post the plotCounts() for this gene straight from DESeq2? This may help us see the differences better. Also, what is the mean of VST values in the two groups (after you've used removeBatchEffect)?
Hello,
thanks a lot for the prompt reply.
Here the image of the plot counts :
plotCounts(dds=dds,gene="ENSG00000171819.4", intgroup= 'STATUS')
https://drive.google.com/open?id=1i16bFQgbVxC9775Xya5eMtpwhyc8zK8d
And here the results of the mean VST values in the two groups:
Finally, these are the results from the DESeq2 analysis for this gene just in case it may help.
Thanks again,
Mattia