Question

Using MAST with DropSeq data

1

Entering edit mode

jeremycfd ▴ 10

@jeremycfd-14955

Last seen 7.2 years ago

Hello,

I've been using MAST for analysis of single-cell qPCR data, and I'm familiar with its use for "traditional" single-cell RNAseq data where reads from the full lengths of transcripts are converted to digital gene expression (via counts). I was wondering if anyone had considered any potential issues with using MAST for analysis of single-cell data from platforms like 10X DropSeq, where counts are estimated using UMIs but only from either the 3' or 5' end of transcripts (and never with any data from elsewhere in a transcript). From DropSeq approaches you can get a raw UMI count, and they recommend first filtering unexpressed genes, then normalizing the gene-specific UMI counts by the median number of UMIs obtained from each cell, and taking the log-transformation of the gene/cell matrix (this all seems very similar to what we would do with RSEM or EdgeR).

From my perspective I can't see any obvious issue here, but I wanted to know if anyone else had any thoughts on whether this sort of data might for some reason (perhaps related to the UMI approach, the 5'/3' specific sequencing, or this particular normalization approach) violate assumptions underlying the MAST framework.

Thanks for reading!

MAST mast dropseq • 2.6k views

ADD COMMENT • link updated 7.2 years ago by Andrew_McDavid ▴ 280 • written 7.2 years ago by jeremycfd ▴ 10

score 1 · Answer 1 · 2018-02-06

1

Entering edit mode

Andrew_McDavid ▴ 280

@andrew_mcdavid-11488

Last seen 6 months ago

United States

You are right that the native distribution of the UMIs (counts) before doing any normalization is rather distinct from that of qPCR. After some types of normalization, it's not so different. We've had good luck by calculating counts per million (or ten thousand, as seem to be popular with 10X data) and then log2(CPM + 1) transforming. The normalization question (in my mind) remains somewhat unresolved, but it increasingly seems that considering something a bit more sophisticated than just global scaling may be warranted. Vallejos (2017) and Bacher (2017) help shed some light.

ADD COMMENT • link 7.2 years ago Andrew_McDavid ▴ 280

0

Entering edit mode

Hello Andrew,

As a follow-up question, it is technically okay to apply MAST on log2(CPM+1) data right? How do I determine which normalization method to use in general?

Thanks!

ADD REPLY • link 7.0 years ago liw • 0

0

Entering edit mode

Technically the issue is the quality of the normality assumption in the continuous portion of the model. In my experience the non-zero component of the log2(1+CPM) has appeared pretty symmetric for droplet technologies, but you could evaluate this yourself informally graphically or formally with tests for symmetry. As the number of cells considered increases (typical with droplet technologies) the importance of the normality decreases because of the central limit theorem. In independent evaluations, MAST has been shown to maintain it's advertised level in a range of scenarios, for instance Soneson and Robinson 2018 (https://www.nature.com/articles/nmeth.4612/).

ADD REPLY • link 7.0 years ago Andrew_McDavid ▴ 280