Hi! I am not sure if the following question has already been answered, but I haven't found it. Sorry if it is a reapeated question.
I am currently analysing a bulk RNAseq dataset. It contains 18 samples from 3 different patients (6 conditions per patient; 3 cell types and treatment/no-treatment). During the exploration of the data, I can see in the PC1 and PC2 that 2 of those conditions are much more similar to each other than any of the other conditions. As these 2 conditions are the ones that we are most interested in, I performed differential gene expression analysis with both edgeR and deseq2 both including the 18 samples or only the 6 of interest. I got different results doing that (expected) but I was surprised to see very few differentially expressed genes between both conditions when including all the samples to calculate the variance, especially with edgeR. I would imagine that this is due to an increase in the BCV when including the more variable samples, is this correct?
My question is: would it make sense to do the analysis using only the samples of interest for the BCV calculation? What if I would be interested in comparing how 2 cell types change differently before and after treatment? Could I do the ratio of the counts (or substraction of the log2 counts) manually and then calculate the BCV using those values?
Thanks for any help!
I would recommend to pick one pipeline for your analysis instead of analyzing with two different methods.
Hi. Yes, I am aware of that. From my understanding, adding all the samples in the DESeqDataSet (or edgeR equivalent) is better and provides more power (and that is what I have seen before working with microarray data in limma). However, in this case I got basically no differentially expressed genes when using this approach and because of the PCA results I decided to try including only the samples of interest in the DESeqDataSet. Doing this the number of differentially expressed genes was larger (of about 150 in edgeR and 200 in DESeq2).
This is from where my question comes from. I feel a bit concerned about dropping samples before including them in the DESeqDataSet but because the PC1 and PC2 suggest that 2 conditions are much more similar between them that any other sample in the dataset (even those of the same condition) I feel that the TCV/BCV calculation may be misleading. Would that make sense?