Hi,
I used an nf-core/rnaseq
pipeline using star_salmon
default aligner, on strand specific dataset. I have a question about gene counts data obtained as a result of salmon quantification. I am interested in gene counts for downstream only rather than isoforms. It seems like the nf-core rnaseq pipeline is designed to import "counts_gene_length_scaled" reference 1 reference 2 via tximport > Deseq2 > size_factors > vst
. The pipeline generates a number of files, I would like to know which file from the shown below is best to use in edgeR
DGEList. Probably this file "salmon.merged.gene_counts.rds"?
Before using this pipeline I used to get started from the raw gene counts from featureCounts
then use in EdgeR.
salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts.rds
salmon.merged.gene_counts_scaled.rds
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv
salmon_tx2gene.tsv
Thank you,
Toufiq
Thank you Gordon Smyth
My collaborator asked me to test this pipeline with
egdeR
package. We are interested at gene level analysis only. It seems likesalmon.merged.gene_counts.tsv
could be a starting point inedgeR
Does it output Salmon files (directories with
quant.sf
in them)?That would be the easiest. These files you have above are processed and not ideal. The whole point of tximport is to take Salmon output files are prepare count matrices with effective gene length offsets. The gene length offsets account for changes in transcript length as well as biases such as sample-specific variation based on amplification or fragmentation.
It does output the Salmon files, and it is documented here:
https://nf-co.re/rnaseq/output#pseudo-alignment-and-quantification
The first bulletpoint is the easiest, and is a commonly used pipeline for getting Salmon quantification into R/Bioconductor for use with downstream count based tools.
Alternatively, if you don't have access to the
quant.sf
files, you would loadsalmon.merged.gene_counts_length_scaled.tsv
and use that as the count matrix input to edgeR.Michael Love thank you. Yes, the pipeline generates
quant.sf
files too, however, those were deleted and only the above listed files were provided. As a workaround, I will usesalmon.merged.gene_counts_length_scaled.tsv
fileas the count for the input matrix inR
Hi,
Would it be ok to use the same file "salmon.merged.gene_counts_length_scaled.tsv" as input for DESeq and treated as if it where a normal Feature Counts like output by rounding the length scaled counts to integer?
I understood it is advisable to use tximport from the quant.sf files but wouldn't it reach the exact same input type at the end as in the "salmon.merged.gene_counts_length_scaled.tsv"
Thank you very much in advance for the answer
Yes you can do that.
The default in tximport is to use original counts plus an offset. But the scaled counts is an alternative solution.
Thank you very much for your quick reply.
From here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Salmon
Also from here: https://github.com/COMBINE-lab/salmon/issues/581
And here: https://nf-co.re/rnaseq/3.14.0/docs/output#alignment-and-quantification
I understood that the offset is used to correct for length bias. Wouldn t the gene_counts_length_scaled.tsv file generated by the nf-core be exactly that? And result in identical inputs for DESeq()?
If not, what is the difference?
"gene_counts_length_scaled.tsv file generated by the nf-core be exactly that?"
No that is the counts scaled by the length. Counts + offset is an alternative approach to scaling the counts _before_ supplying them to the model.
From the 2015 tximport paper:
We propose two alternative ways of integrating transcript abundance estimates into the DGE pipeline: to define an artificial count matrix, or to calculate offsets that can be used in the statistical modeling of the observed gene counts...
Understood, Thank you very much for the reply, very kind.