Deseq2 problem visualizing gene expression with batch corrected values
1
0
Entering edit mode
Mbosio85 • 0
@mbosio85-16227
Last seen 6.4 years ago
Barcelona Supercomputing Center

Hello,


I am analyzing a large cohort (1000+ samples) for differential expression analysis.
I am using Deseq2 to process raw counts from  samples correcting by two confounding factors (namely sex and histology origin).

As output we obtain a shortlist of interesting candidates with strong adjusted p-value (e.g. p < e-14) and would like to plot the result so that I can visually show such a good result.

Searching around in the net (e.g. PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different? , or DESeq2 - Acquiring batch-corrected values for PCA and hierarchical clustering) I resolved to try to do this using  limma batchEffectCorrection() on vst(dds).

Once I plot the output as a boxplot, it does not look as distribution close to a p.value < 1e-14 ( running a t.test on them results in 0.002)

I am pretty sure there is something I am missing in my processing.
Can you suggest a way to improve what I am doing so that what I see resembles the p.value obtained by Deseq analysis?

Here it's what I have done:

library(limma)
library(DESeq2)
library(ggplot2)
genes=read.table("raw_counts.csv",h=T,row.names=1)
info=read.table("matched_metadata.csv",sep='\t',h=T,row.names=1)
ddsMat = DESeqDataSetFromMatrix(countData=genes,
                                colData=info, 
                                design=~donor_sex + histology + STATUS)
ddsMat <- estimateSizeFactors(ddsMat)
ddsMat <- estimateSizeFactors(ddsMat)
idx <- rowSums( counts(ddsMat, normalized=TRUE) >= 5 ) >= 8
ddsMat <- ddsMat[idx,]

#now run deseq2
dds <- DESeq(ddsMat, parallel=T,BPPARAM = param)
myres2 = results(dds_results, parallel=T,BPPARAM = param,lfcThreshold = 1)

#now VST+ batch removal of the 2 confounding factors 
dds_vst= vst(dds,blind =F )
dds_batch_removed = removeBatchEffect(assay(dds_vst),batch= info$histology,
                                      batch2=info$donor_sex ,
                                      design=model.matrix(~info$STATUS))
#now plot this
a =as.data.frame(t(dds_batch_removed))
a$STATUS= info$STATUS
ggplot(a,aes(x=STATUS , y= ENSG00000171819.4)) + geom_boxplot()

--

My result for the top gene is like this in the image below, I think I'm doing something wrong in the process 

https://drive.google.com/open?id=1bhfEzQlHj5c_ElhyrQQaSwSsrk83zpGk

Thanks a lot for your help 

Mattia

deseq2 limma differential expression visualization • 1.6k views
ADD COMMENT
0
Entering edit mode

Can you post the plotCounts() for this gene straight from DESeq2? This may help us see the differences better. Also, what is the mean of VST values in the two groups (after you've used removeBatchEffect)?

ADD REPLY
0
Entering edit mode

Hello,
thanks a lot for the prompt reply.

Here the image of the plot counts :
plotCounts(dds=dds,gene="ENSG00000171819.4", intgroup= 'STATUS')
 https://drive.google.com/open?id=1i16bFQgbVxC9775Xya5eMtpwhyc8zK8d 

And here the results of the mean VST values in the two groups:

> batch_removed_t = as.data.frame(t(dds_batch_removed))
> mean(dds_batch_removed_t[info$STATUS=='Cases',]$ENSG00000171819.4)
[1] 4.519137
> mean(dds_batch_removed_t[info$STATUS=='Control',]$ENSG00000171819.4)
[1] 4.066109

Finally, these are the results from the DESeq2 analysis for this gene just in case it may help.

> myres2[rownames(myres2)=='ENSG00000171819.4',]
log2 fold change (MAP): STATUS Cases vs Control 
Wald test p-value: STATUS Cases vs Control 
DataFrame with 1 row and 6 columns
                   baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                  <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000171819.4  75.91621       2.722376 0.1932445  8.912937 4.969837e-19 2.029135e-14

Thanks again,

Mattia

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

hi Mattia,

I think it's clear the null is false here, so a test should reject the null. I guess the issue is more than DESeq2 is giving a p-value that feels "too small". My guess as to why the p-value is so small, is that DESeq2 is sensitive to the additional high counts in the cases. I'd recommend two ways to go, if you feel like you want the statistical procedure to be less sensitive to genes like this one (while it is likely not null, you just want this gene to not get so low a p-value). You can feed the batch-corrected variance stabilized data into a rank test, such as Wilcoxon (here you have thousands of samples I'm not worried about informing the statistical model about the few degrees of freedom used in removing shifts associated with sex and histology code). The siggenes package has a function rowWilcoxon() for example, and then you can use method="BH" adjustment using the p.adjust() function. Or another alternative would be to use limma-voom (here you would supply limma-voom with the original counts, so not using VST or removeBatchEffect). I believe it will give this gene a higher p-value although it will still easily reject the null. limma-voom is less sensitive to a minority of high count samples, because it is computing differences on data that is the log of scaled counts, rather than on the original counts.

ADD COMMENT
0
Entering edit mode

Thanks a lot Mike for your answer. I will give these alternative a try, to see if they can find some genes both rejecting the null hypothesis, and showing a more easily interpretable behavior : ) 

 

ADD REPLY

Login before adding your answer.

Traffic: 764 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6