Hi everyone,
I would like to know if excluding genes with extremely high expression is suitable for scran normalization in single-cell RNA-seq analysis. This type of exclusion is used in the library-size normalization of the SPRING tool (only genes making up <5% of total counts in every cell are used for normalization):
This type of exclusion might help with datasets where some cell types have a few dominant genes making up most of the counts, like secretory cells such as pancreatic beta cells or Paneth cells of the small intestinal epithelium.
An example is the Haber et al. 2017 study of the small intestinal epithelium (doi:10.1038/nature24489). In the full-length atlas dataset (count matrix downloaded from ebi.ac.uk/gxa/sc) over 50% of counts in Paneth cells come from just 10 genes (judging by the scater QC function "plotHighestExprs"). In this dataset, the CPM values for most genes encoding early secretory pathway components (e.g. Golgi and secretory vesicle proteins) are lower in Paneth cells than intestinal stem cells. This seems unrealistic, as the Paneth cell is a large, secretory cell containing bigger Golgis and many more secretory vesicles than the stem cells. If the top 10 most highly expressed genes in Paneth cells are simply removed from the count matrix, most genes encoding early secretory pathway components are higher in the Paneth cells.
Could the exclusion of genes during normalization be performed by providing a whitelist of genes with the "subset.row" argument for the function "computeSumFactors"?
Best regards, Daniel