Question

Import Salmon data with tximport and pool technical replicates

0

Entering edit mode

Vitor • 0

@851bd8c4

Last seen 3 months ago

United States

Hello,

I'm following the vignette on how to import expression estimates from Salmon with tximport and create an offset matrix. I also want to implement the pooling of technical replicates with sumTechReps(). Technical replicates were analyzed separate with Salmon and will have different offsets.

I believe edgeR doesn't modify the raw counts and instead use the offsets in the GLM. And inspecting sumTechReps code, it seems that the function sums the counts and computes average normalization factors if you pass a DGEList object. Is that the correct way of doing this?

EDIT: When using sumTechReps(ID = sample_ids), the column names of y$counts and the row names of y$samples are converted to corresponding sample_ids, but column names of y$offset still refer to the original replicate IDs (although matrix is reduced to the same dimensions as the pooled count matrix). Can that be a problem?

Thank you!

tximport edgeR • 1.4k views

ADD COMMENT • link written 20 months ago by Vitor • 0

score 1 · Accepted Answer · 2023-08-24

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

Sorry I missed this because it wasn't tagged with tximport.

I think summing the raw counts and averaging the normalization factors is appropriate. But I don't know about the mechanics of sumTechReps.

ADD COMMENT • link 20 months ago Michael Love 43k

score 1 · Accepted Answer · 2023-08-25

I would prefer that you combined the technical replicates before you ran Salmon, i.e., by merging the FASTQ files. The EM algorithm by which Salmon assigns reads probabilistically to transcripts will work better if it has all the reads for each sample at once. That would help to resolve read assigment ambiguities, reduce the variability of the TPM estimates that are input to tximport from Salmon and hence improve the reliability of the edgeR offset matrix.

Salmon's probabilistic algorithm means that merging FASTQ files at the beginning is better than (not the same as) summing the counts from technical replicates downstream.

edgeR's sumTechReps() function is not currently designed to work optimally on a DGEList object with an offset matrix set. It will just use the offsets for the first technical replicate for each sample instead of combining them. I will look into it and improve the function for this usage case. However, if you have true technical replicates (repeat sequencing runs of the same RNA sample) it will always be better to combine them before running Salmon.