Question

Recommendation for large.n in filterByExpr

0

Entering edit mode

duez.tolga • 0

@dba43dea

Last seen 20 months ago

Germany

Hi all,

I have a single cell dataset with two groups of cell types, where each group comprises 900 cells. I am confused what to set for large.n in the filterByExpr function. I couldnt understand the documentation. May you help me here?

enter image description here

Best, Tolga

scRNAseq edgeR SingleCellExperiment • 1.5k views

ADD COMMENT • link 20 months ago duez.tolga • 0

score 2 · Answer 1 · 2023-08-09

I am the author of the filterByExpr() function, but I don't generally use it for scRNA-seq. My experience is mainly with 10X scRNA-seq, and I usually specify the minimum number of cells that I want a gene to be detected in. For example, I might decide that I want a gene to be detected in at least 100 cells to be kept in the analysis, so I would apply a filter like:

keep <- rowSums(exp_matrix > 0) >= 100

You could perhaps use filterByExpr() to get similar results. In any case, the large.n parameter is not very important in the single-cell context. With such large group sizes it doesn't matter what large.n is set to as long as it is very small compared to the number of cells in the smallest group. You can leave it as is, or reduce it to 0 or 1. Either way, it will make little difference. To understand this is more detail, suppose that the smallest group in your data contains 900 cells. If you set min.prop=0.5 and large.n=0 then you are requiring a cell to be expressed in at least min.prop * 900 = 450 cells. If you set min.prop=0.5 and large.n=10 then you are requiring a cell to be expressed in 10 + 0.5*(900-10) = 455 cells. It is hard to see how that could be a dramatic difference.

You use of min.prop=0.5 seems higher than most people would use for single-cell data. If your group sizes are all 900 or more, then that requires a gene to be expressed in >450 cells, which seems a lot. You might reduce min.prop to 0.25 say, which would then be similar to the Nature Methods paper that ATpoint cites.

score 0 · Answer 2 · 2023-08-09

Previous work (https://www.nature.com/articles/nmeth.4612) indeed showed that differential expression on single-cell level can be improved by reasonable prefiltering. What the authors did in this preprint was to retain only genes with 1 TPM in > 25% of cells in at least one of the two contrasted groups. What you could do is to calculate CPMs (rather than TPMs), for example scran::calculateCPM and then explore impact of different thresholds, rather than using filterByExpr, which has reasonable defaults, but for bulk rather than scRNA-seq.