Question

batchelor batch correction correction runs into errors

1

Entering edit mode

firestar ▴ 20

@rmf-13755

Last seen 4 months ago

Sweden

I am trying to use batch correction using the batchelor package on two bulk RNA-Seq datasets . But, I run into errors.

> batchelor::fastMNN(as.matrix(dc),as.matrix(de))
more singular values/vectors requested than available'k' capped at the number of observations

> batchelor::mnnCorrect(as.matrix(dc),as.matrix(de))
'k' capped at the number of observations

> batchelor::rescaleBatches(as.matrix(dc),as.matrix(de))
Error in (function (..., log.base = 2, pseudo.count = 1, subset.row = NULL, : matrix should be double

Is it ok to use these correction methods for bulk RNA-seq data?
Is there some requirement for min number of samples?

Here are the dimensions of the datasets.

> dim(dc)
[1] 140213     28
> dim(de)
[1] 140213      4

And here is head.

> head(dc)
                        T1 T2 T3 T4 A1_L A2_L A3_L A4_L A1_S A2_S
chr2L_2_170              0  0  0  0    0    0    0    0    0    0
chr2L_1368_1544          0  0  0  0    0    0    0    0    0    0
chr2L_172691_172724      1  0  0  0    0    0    0    0    0    0
chr2L_1573892_1573953    0  0  0  0    0    1    0    0    0    0
chr2L_14712715_14712750  0  0  0  0    0    0    0    0    0    0
chr2L_14713015_14713036  0  0  0  0    0    0    0    0    0    0
                        A3_S A4_S B1_L B2_L B3_L B4_L B1_S B2_S B3_S
chr2L_2_170                0    0    0    0    0    0    0    0    0
chr2L_1368_1544            0    0    0    0    0    0    0    0    0
chr2L_172691_172724        0    0    0    0    0    0    0    0    0
chr2L_1573892_1573953      0    0    0    0    0    0    0    0    0
chr2L_14712715_14712750    0    0    0    0    0    0    0    0    0
chr2L_14713015_14713036    0    0    0    0    0    0    0    0    0
                        B4_S C1_L C2_L C3_L C4_L C1_S C2_S C3_S C4_S
chr2L_2_170                0    0    0    0    0    0    0    0    0
chr2L_1368_1544            0    0    0    0    0    0    0    0    0
chr2L_172691_172724        0    0    0    0    0    0    0    0    0
chr2L_1573892_1573953      0    0    0    0    2    0    0    0    1
chr2L_14712715_14712750    0    0    0    0    0    0    0    0    0
chr2L_14713015_14713036    0    0    0    0    0    0    0    0    0
> head(de)
                        GFP_T1 GFP_T2 RRP6_T1 RRP6_T2
chr2L_2_170                  4      3       2       6
chr2L_1368_1544              5      0       4       5
chr2L_172691_172724          1      0       0       0
chr2L_1573892_1573953        0      0       1       1
chr2L_14712715_14712750      0      0       2       0
chr2L_14713015_14713036      1      0       1       3

These two datasets run fine with limma::removeBatchEffect(). Are these errors fixable or is my data not good enough?

batchelor batch-correction mnncorrect fastmnn rna-seq • 1.4k views

ADD COMMENT • link updated 5.0 years ago by Aaron Lun ★ 28k • written 5.7 years ago by firestar ▴ 20

score 1 · Answer 1 · 2020-04-15

Don't know how I missed this question, but better late than never.

> batchelor::fastMNN(as.matrix(dc),as.matrix(de))
more singular values/vectors requested than available
'k' capped at the number of observations

These are warnings. fastMNN will look to take 50 PCs, but you have fewer samples than that, so it does its best and spits out a warning. Same for the number of nearest neighbors, defaults to 20 and you have fewer than that.

> batchelor::mnnCorrect(as.matrix(dc),as.matrix(de))
'k' capped at the number of observations

More warnings. Can be ignored if you're willingly (ab)using fastMNN to work on bulk data.

> batchelor::rescaleBatches(as.matrix(dc),as.matrix(de))
Error in (function (..., log.base = 2, pseudo.count = 1, subset.row = NULL, : matrix should be double

Now, this is an actual error message. But I am not convinced that rescaleBatches is doing anything wrong here. You should check that as.matrix(dc) and as.matrix(de) are, in fact, numeric matrices. To demonstrate the correctness of rescaleBatches, this works fine for me:

A <- matrix(rpois(100, 10), 10, 10)
B <- matrix(rpois(100, 10), 10, 10)
storage.mode(A) <- storage.mode(B) <- "integer" # or double, doesn't matter.

library(batchelor)
rescaleBatches(A, B) # no problems
## class: SingleCellExperiment 
## dim: 10 20 
## metadata(0):
## assays(1): corrected
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(1): batch
## reducedDimNames(0):
## altExpNames(0):

Is it ok to use these correction methods for bulk RNA-seq data?

In principle, yes, the assumptions that they make are not exclusive to single-cell technologies. Do take the time to understand what these assumptions are, though; see the book for a primer.

Is there some requirement for min number of samples?

With the default settings, yes. For example, k=20 doesn't make sense with so few samples, because then everyone is in an MNN pair with everyone else! That defeats the purpose of using MNNs to identify matching cell types/populations/samples across different batches. Given you only have 4 samples in one batch, and you don't mention anything about what the samples actually are, I would set k=1 and hope for the best. The other methods don't have a minimum sample requirement but they have other limitations; read the book.