Question

Mean Expression Calculation scRNA-seq (All cells or only Expressed Cells)

0

Entering edit mode

chitsazanalex ▴ 10

@chitsazanalex-11765

Last seen 7.0 years ago

I'm doing a differential test for monocle and they show that differentialGeneTest() gives the features that are different between your model but doesn't tell you about which specific genes go up for particular groups. Per there documentation, they state "We could also simply compute summary statistics such as mean or median expression level on a per-CellType basis to see this, which might be handy if we are looking at more than a handful of genes."

This makes sense and I have a calculated normalized expression matrix, my main question is does one normally use all single cells to calculate the mean expression, including the cells that have no detectable level or just expressed cells? So for example, a scenario were condition 1 has 400 total cells and 300 cells express geneA and Condition 2 has 200 total cells and only 50 express geneA. If I'm calculating a FC for geneA do I compare

meanexpression(400 TOTAL cells)/meanexpression(200 TOTAL cells) or

meanexpression(300 EXPRESSING cells)/mean(50 EXPRESSING cells).

I can see how there would be bias in both and so I wonder which is used in the field?

scrnaseq monocle • 3.0k views

ADD COMMENT • link updated 7.0 years ago by davide risso ▴ 980 • written 7.0 years ago by chitsazanalex ▴ 10

score 1 · Answer 1 · 2018-05-08

Hi,

I'm not too familiar with the monocle differential expression model, but I'll try to answer your question, which seems more general than monocle.

There is no consensus yet in the field on the best way to compare the mean expression across conditions. However, there is a very thorough review of differential expression methods that, indirectly, answer your question (e.g., a t-test would compare the mean without treating 0's in any special way and it seems to work well): https://www.nature.com/articles/nmeth.4612

In our work, we have used a zero-inflated model to "downweight" the 0's that are in excess compared to a negative binomial distribution. This seems to help boost the performance of DE methods developed for bulk RNA-seq and might be a good strategy in your case: i.e., instead of either removing or keeping the 0's in your mean computation, you can downweight them so that they do not influence the mean so much (think about it as a middle ground between your two solutions).

More details on our approach can be found in the paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4 and the method is implemented in the zinbwave Bioconductor package.