Hello,
I would like to continue a topic that was first started on this Biostars post. Essentially, in an attempt to help the OP from that topic, I brought the point of downstream data analysis with full-length RNA-seq protocol such as Smart-Seq2 when one uses Salmon
(in quasi-mapping mode) and tximport
.
The main point that some of us were discussing is what should be the best protocol for using tximport
with tools such as Seurat
under these analytical conditions? (and perhaps the tximport vignette could benefit from a small section on this like how it has for DESeq2, EdgeR etc.).
A few points I brought up were:
- When I searched for tximport and single-cell/Seurat etc...,
alevin
usually comes up, however, it's important to realize that a lot of the 10X genomics tech, is 3' tagged RNA-seq, and thus does not have the length biases that would be present in Smart-Seq protocol (and thus passing the rawtxi$counts
as raw counts in data import for Seurat makes perfect sense). Of course, that with Smart-Seq data, we wouldn't even usealevin
, but instead justsalmon
in the same way as it is done with bulk RNA-seq. - Thus, my understanding is that the correct steps for Smart-Seq/full length protocol would be to 1) import the data with the
tximport
settingcountsFromAbundance=lengthScaledTPM
which would then result in counts which were normalized for sequencing depth and length and this would be stored intxi$counts
which can then 2) be passed on to Seurat'sCreateSeuratObject
incounts
. NOTE I originally had in mind that one would likely want to do this withtxOut=FALSE
to have gene-level data as I am not quite sure single-cell algorithms are sensitive enough to transcript-level analysis/DE etc... But perhaps this would be a good place to get this confirmation. 3) In Seurat, if one importstxi$counts
generated withcountsFromAbundance=lengthScaledTPM
, then one should likely follow the advice that has been given by the Seurat team if starting with TPMs (this info is from their GitHub issue #668 - don't think the last answer is from a Seurat team member, but it was approved by the satijalab in the reaction) which are to skip theSeurat::NormalizeData()
step, but transform the data to log scale (which is stored inobject@metadata
) prior toScaleData
and also note that log scale in Seurat is natural log.
I believe that captures the main point from the follow-up discussion. Thanks for any advice (special thanks to Michael Love who suggested we post a question here).
Hi Michael, makes sense, thanks! On the last point in particular, I appreciate the clarification (got confused with the TPM input), but it makes sense when I think about how we utilize it in bulk RNA-seq with DESeq2 (e.g.: even thought we import with
DESeqDataSetFromTximport
, DESeq2 still performs library seq. depth normalization).Thanks again!
Sure!
You’ll notice the column sum of the counts matrices is always the same: the number of mapped reads.