Recommendation for large.n in filterByExpr
2
0
Entering edit mode
duez.tolga • 0
@dba43dea
Last seen 16 months ago
Germany

Hi all,

I have a single cell dataset with two groups of cell types, where each group comprises 900 cells. I am confused what to set for large.n in the filterByExpr function. I couldnt understand the documentation. May you help me here?

enter image description here

Best, Tolga

scRNAseq edgeR SingleCellExperiment • 1.3k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 28 minutes ago
WEHI, Melbourne, Australia

I am the author of the filterByExpr() function, but I don't generally use it for scRNA-seq. My experience is mainly with 10X scRNA-seq, and I usually specify the minimum number of cells that I want a gene to be detected in. For example, I might decide that I want a gene to be detected in at least 100 cells to be kept in the analysis, so I would apply a filter like:

keep <- rowSums(exp_matrix > 0) >= 100

You could perhaps use filterByExpr() to get similar results. In any case, the large.n parameter is not very important in the single-cell context. With such large group sizes it doesn't matter what large.n is set to as long as it is very small compared to the number of cells in the smallest group. You can leave it as is, or reduce it to 0 or 1. Either way, it will make little difference. To understand this is more detail, suppose that the smallest group in your data contains 900 cells. If you set min.prop=0.5 and large.n=0 then you are requiring a cell to be expressed in at least min.prop * 900 = 450 cells. If you set min.prop=0.5 and large.n=10 then you are requiring a cell to be expressed in 10 + 0.5*(900-10) = 455 cells. It is hard to see how that could be a dramatic difference.

You use of min.prop=0.5 seems higher than most people would use for single-cell data. If your group sizes are all 900 or more, then that requires a gene to be expressed in >450 cells, which seems a lot. You might reduce min.prop to 0.25 say, which would then be similar to the Nature Methods paper that ATpoint cites.

ADD COMMENT
0
Entering edit mode

are you sure that it will make little difference? the number of false positives increases tremendously after reducing it to 1 and min.prop to 0.25.

ADD REPLY
0
Entering edit mode

I said that reducing large.n would make little difference, and that is mathematically true because your group sizes are so much larger than large.n.

I did not say that reducing min.prop would make no difference. Quite the opposite, I suggested the change because it would make a difference.

How exactly have you determined that the number of false discoveries has increased? When analysing real data, we generally do not know for certain what is a true or false discovery.

ADD REPLY
0
Entering edit mode

Hi, Big thanks! now the bigger picture is clear to me.

I have labels, because i am working with a simulated dataset.

ADD REPLY
0
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 17 hours ago
Germany

Previous work (https://www.nature.com/articles/nmeth.4612) indeed showed that differential expression on single-cell level can be improved by reasonable prefiltering. What the authors did in this preprint was to retain only genes with 1 TPM in > 25% of cells in at least one of the two contrasted groups. What you could do is to calculate CPMs (rather than TPMs), for example scran::calculateCPM and then explore impact of different thresholds, rather than using filterByExpr, which has reasonable defaults, but for bulk rather than scRNA-seq.

ADD COMMENT

Login before adding your answer.

Traffic: 306 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6