Question

Very variable samples vs similar samples in the same dataset

0

Entering edit mode

eduardogccm • 0

@eduardogccm-22161

Last seen 2.9 years ago

United Kingdom

Hi! I am not sure if the following question has already been answered, but I haven't found it. Sorry if it is a reapeated question.

I am currently analysing a bulk RNAseq dataset. It contains 18 samples from 3 different patients (6 conditions per patient; 3 cell types and treatment/no-treatment). During the exploration of the data, I can see in the PC1 and PC2 that 2 of those conditions are much more similar to each other than any of the other conditions. As these 2 conditions are the ones that we are most interested in, I performed differential gene expression analysis with both edgeR and deseq2 both including the 18 samples or only the 6 of interest. I got different results doing that (expected) but I was surprised to see very few differentially expressed genes between both conditions when including all the samples to calculate the variance, especially with edgeR. I would imagine that this is due to an increase in the BCV when including the more variable samples, is this correct?

My question is: would it make sense to do the analysis using only the samples of interest for the BCV calculation? What if I would be interested in comparing how 2 cell types change differently before and after treatment? Could I do the ratio of the counts (or substraction of the log2 counts) manually and then calculate the BCV using those values?

Thanks for any help!

edger deseq2 • 684 views

ADD COMMENT • link updated 5.3 years ago by Michael Love 43k • written 5.3 years ago by eduardogccm • 0

0

Entering edit mode

I would recommend to pick one pipeline for your analysis instead of analyzing with two different methods.

ADD REPLY • link 5.3 years ago Michael Love 43k

0

Entering edit mode

Hi. Yes, I am aware of that. From my understanding, adding all the samples in the DESeqDataSet (or edgeR equivalent) is better and provides more power (and that is what I have seen before working with microarray data in limma). However, in this case I got basically no differentially expressed genes when using this approach and because of the PCA results I decided to try including only the samples of interest in the DESeqDataSet. Doing this the number of differentially expressed genes was larger (of about 150 in edgeR and 200 in DESeq2).

This is from where my question comes from. I feel a bit concerned about dropping samples before including them in the DESeqDataSet but because the PC1 and PC2 suggest that 2 conditions are much more similar between them that any other sample in the dataset (even those of the same condition) I feel that the TCV/BCV calculation may be misleading. Would that make sense?

ADD REPLY • link 5.3 years ago eduardogccm • 0

score 2 · Accepted Answer · 2019-10-24

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

So, in our FAQ we discuss exactly this point, you can take a look there first. We recommend to look at the PCA and then if you are interested in a pair-wise comparison just go ahead and use the samples from those two groups for building the dataset.

ADD COMMENT • link 5.3 years ago Michael Love 43k

0

Entering edit mode

Ok, I didn't notice it in the FAQs, found it now!.

Thanks for pointing me back there and for taking the time for answering.

ADD REPLY • link 5.3 years ago eduardogccm • 0