Question

fastMNN based correction for large multi-run experiment requires a similar population set in all the individual datasets?

0

Entering edit mode

p.joshi ▴ 40

@pjoshi-22718

Last seen 2.5 years ago

Germany

Hi,

I am trying to use fastMNN for integrating snRNA-seq data from different samples of tissue at progressive stages of development. Some of the samples are biological replicates. Two samples are run together in one sequencing run, but they are not necessarily the same time-point or same individual. While reading the mnncorrect paper, I realized that he authors have made a note that each dataset that is needed to be integrated must have a shared population. So does it mean that some population of cell types should be present in all the datasets? For example, cell population X should be present in sample A (t= day1), sample B (t=day2), sample C (t=day3) and sample D(t=day4).

Or that two population must share same cell types. For example, Sample A and B share celltype X, sample B and C share celltype Y and sample C and D share cell type Z?

If scenario 1 is required, then we can't use fastMNN to integrate tissues unless we spike in with a well defined cell population at the library prep step? Then as that cell population s artificial, we would have to somehow discard it after integrating the data-set before performing DEG analysis.

I hope the author of the method could respond to this.

Thanks!

scrna-seq integration batch correction • 1.8k views

ADD COMMENT • link updated 4.9 years ago by Aaron Lun ★ 28k • written 4.9 years ago by p.joshi ▴ 40

score 3 · Accepted Answer · 2020-01-20

Some understanding on how fastMNN works would be valuable here. Say we have four batches 1-4. If we merge them with default arguments, 1 is first merged with 2 to create a 1+2 combination; this 1+2 combination is then merged with 3, and then the 1+2+3 combination is merged with 4. (You can also do hierarchical merges but you can check out the documentation in ?fastMNN for more details.)

Now, the important part is that the shared population assumption only needs to apply to each merge step. For example, we assume that the 1+2 combination shares at least one subpopulation with batch 3. Provided you arrange your merge order appropriately, there is no requirement for all batches to share the same subpopulation. In your case, the merge order naturally follows your time course, and in fact, I believe that this is the exact approach used in last year's mouse gastrulation paper.

Conversely, it would probably be unwise to merge, say, the first timepoint with the last timepoint. This is likely to unnecessarily discard some biology, see comments here.

At some point, I had also entertained the use of "spike-in cell controls" for batch correction. IIRC, we tried throwing in HeLa cells to all of our batches to provide a reference to use for correction. I must admit that this was pretty disappointing; you may be less than surprised to hear that the batch effect for HeLa cells is not representative of your cells of interest. Technically speaking, we could solve this by using a reference population assembled from a population more similar to our test sample, as is suggested for mass cytometry studies. However, this only solves the technical effects of sequencing and not the biological batch effects (e.g., donor variation, fluctuations in experimental conditions), so it didn't seem like it was worth the expense.