ERROR - Size factors should be positive real numbers (using normalize() function)
1
0
Entering edit mode
kushshah ▴ 10
@kushshah-20393
Last seen 2.3 years ago
University of North Carolina, Chapel Hi…

I have a SingleCellExperiment object, and no matter what I do, when I run normalize(filtered.sce), I get the error: size factors should be positive real numbers.

It is my understanding that even though computeSumFactors() coerces to positive by default if necessary, it doesn't imply that normalize() will run automatically.

I have done many things to my pancreas dataset (Segerstolpe et. al., 2016) in terms of QC after starting with the 1308 high-quality cells specified in the metadata. Nothing seems to be working:

  • libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
  • feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
  • spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")
    • Together, these three methods removed 62, 73, and 143 cells, respectively, from the original 1308. This seems to be a lot.
  • After defining ave.raw.counts <- calcAverage(sce, use_size_factors=FALSE), I've reduced the sce object down to the genes with ave.raw.counts >= 1, which is about 14000 out of the original 25000 genes

When running filtered.sce <- computeSumFactors(filtered.sce), it runs WITHOUT any warning of encountering negative size factor estimates.

However, when running the following two commands, I get a warning and then an error:

  • filtered.sce <- computeSpikeFactors(filtered.sce, type="ERCC", general.use=FALSE)
    • Warning message: zero spike-in counts during spike-in normalization
  • filtered.sce <- normalize(filtered.sce)
    • Error in .local(object,...): size factors should be positive real numbers

I even tried filtering by keep <- ave.raw.counts >= 50 just to see if there was any way I could get it to work, but my final error during normalization was still size factors should be positive real numbers.

I would appreciate any help as to why this may be happening. I can also provide any more information that is required. Thank you so much.

scater scran singlecellexperiment normalize qc • 4.2k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 day ago
The city by the bay

First, calm down.

Secondly, let's have a look at the warning:

zero spike-in counts during spike-in normalization

Sounds pretty straightforward. If you don't have any spike-in counts for a cell, you can't compute a meaningful spike-in size factor for that cell. (Technically, the spike-in size factor is reported as zero, which is meaningless; hence the warning.) This then leads to the error in normalize, because otherwise it would divide the counts for that cell by zero.

So, depending on what you aim to do, you can either:

  1. If you must have the spike-ins for a downstream analysis step, remove the cells with zero spike-in size factors.
  2. Otherwise, remove the spike-ins and proceed onward with all cells.

Of course, you can do both of these steps, e.g., do 1 to estimate the technical mean-variance trend for feature selection, and then do 2 to use all cells for downstream analysis (possibly with the subset of features selected from 1). This is, in fact, exactly what I did with this same data set here.

P.S.

Together, these three methods removed 62, 73, and 143 cells, respectively, from the original 1308. This seems to be a lot.

I lose about 10% of cells in routine experiments, so what you're seeing is not so bad. Keep in mind that the three methods will overlap, so the total number of removed cells is unlikely to be sum of 62, 73 and 143. Of course, what they consider to be "not-low-quality" may or may not be your definition of "high quality". It's all pretty arbitrary and there's a lot of wiggle room during quality control - I mean, what cell isn't damaged by getting dunked in a foreign buffer and shot through microfluidics? They're all going to be a bit screwed up, but the hope is that there's still something useful in there.

Another factor is that there are strong patient-to-patient differences in sample processing (e.g., in the spike-in percentages if nothing else), which suggests that batch= should be used in isOutlier. Perhaps I should have done so in my code, but frankly, I was so tired from wrangling their "count" matrix into shape that I just moved on ASAP.

ADD COMMENT
0
Entering edit mode

This is extremely helpful, thank you so much. I've also batched isOutlier() by individual now.

Had a quick question - does "remove cells with zero spike-in size factors" mean "remove cells whose read count for every spike-in is zero"?

If so, I was looking at the code you linked to. Your for.hvg <- sce.emtab[,sizeFactors(sce.emtab, "ERCC") > 0 & sce.emtab$Donor!="AZ"] line seems to be accomplishing this?

Doing the same with my sce object (specifically, filtered.sce.spike <- filtered.sce[,sizeFactors(filtered.sce,"ERCC") > 0] results in filtered.sce.spike having zero columns (zero cells). I had defined 72 spike-ins earlier. Am I missing something simple here? Perhaps there is a way I need to denote spike-ins that I have not done properly?

ADD REPLY
0
Entering edit mode

Had a quick question - does "remove cells with zero spike-in size factors" mean "remove cells whose read count for every spike-in is zero"?

Yes.

Perhaps there is a way I need to denote spike-ins that I have not done properly?

You probably filtered them out in your calcAverage filtering step. I would suggest not filtering explicitly, but rather use subset.row to filter within each function as needed. See comments here.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I am having a similar issue with CITE-Seq data. I have one control "Ig" and I am trying to perform control based normalization as suggested in the OSCA book. When I do the following: controls <- grep("Ig", rownames(altExp(sce))) sf.control <- librarySizeFactors(altExp(sce), subset_row=controls) sce <- logNormCounts(sce, use.altexps=TRUE)

I get the error since ~2000 cells have zero counts for the control antibody summary(sf.control) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.4953 0.9906 1.0000 1.4859 6.4389

I saw in the OSCA book the control size factor are also zero for some cells. My question is: how should I use the control normalization then? I calculated the median size factors and made a scatter plot vs the control factors, and they don't correlate at all.

ADD REPLY

Login before adding your answer.

Traffic: 397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6