Which input file is used for DGEList in EgdeR?
1
0
Entering edit mode
@mohammedtoufiq91-17679
Last seen 11 days ago
United States

Hi,

I used an nf-core/rnaseq pipeline using star_salmon default aligner, on strand specific dataset. I have a question about gene counts data obtained as a result of salmon quantification. I am interested in gene counts for downstream only rather than isoforms. It seems like the nf-core rnaseq pipeline is designed to import "counts_gene_length_scaled" reference 1 reference 2 via tximport > Deseq2 > size_factors > vst. The pipeline generates a number of files, I would like to know which file from the shown below is best to use in edgeR DGEList. Probably this file "salmon.merged.gene_counts.rds"?

Before using this pipeline I used to get started from the raw gene counts from featureCounts then use in EdgeR.

salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts.rds
salmon.merged.gene_counts_scaled.rds
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv
salmon_tx2gene.tsv

Thank you,

Toufiq

salmon edgeR tximport nf-core gene_counts • 2.7k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

I use and recommend featureCounts. Despite all that has been written on this topic, I still think that direct gene counting is faster and more accurate than gettng gene counts from transcript level estimates. If you want to use the above pipeline though, you could follow the advice from Mike Love given in your Reference 1 link.

ADD COMMENT
0
Entering edit mode

Thank you Gordon Smyth

My collaborator asked me to test this pipeline with egdeR package. We are interested at gene level analysis only. It seems like salmon.merged.gene_counts.tsv could be a starting point in edgeR

ADD REPLY
1
Entering edit mode

Does it output Salmon files (directories with quant.sf in them)?

That would be the easiest. These files you have above are processed and not ideal. The whole point of tximport is to take Salmon output files are prepare count matrices with effective gene length offsets. The gene length offsets account for changes in transcript length as well as biases such as sample-specific variation based on amplification or fragmentation.

ADD REPLY
2
Entering edit mode

It does output the Salmon files, and it is documented here:

https://nf-co.re/rnaseq/output#pseudo-alignment-and-quantification

The first bulletpoint is the easiest, and is a commonly used pipeline for getting Salmon quantification into R/Bioconductor for use with downstream count based tools.

Alternatively, if you don't have access to the quant.sf files, you would load salmon.merged.gene_counts_length_scaled.tsv and use that as the count matrix input to edgeR.

ADD REPLY
0
Entering edit mode

Michael Love thank you. Yes, the pipeline generates quant.sf files too, however, those were deleted and only the above listed files were provided. As a workaround, I will use salmon.merged.gene_counts_length_scaled.tsv fileas the count for the input matrix in R

ADD REPLY
0
Entering edit mode

Hi,

Would it be ok to use the same file "salmon.merged.gene_counts_length_scaled.tsv" as input for DESeq and treated as if it where a normal Feature Counts like output by rounding the length scaled counts to integer?

I understood it is advisable to use tximport from the quant.sf files but wouldn't it reach the exact same input type at the end as in the "salmon.merged.gene_counts_length_scaled.tsv"

Thank you very much in advance for the answer

ADD REPLY
1
Entering edit mode

Yes you can do that.

The default in tximport is to use original counts plus an offset. But the scaled counts is an alternative solution.

ADD REPLY
0
Entering edit mode

Thank you very much for your quick reply.

From here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Salmon

Also from here: https://github.com/COMBINE-lab/salmon/issues/581

And here: https://nf-co.re/rnaseq/3.14.0/docs/output#alignment-and-quantification

I understood that the offset is used to correct for length bias. Wouldn t the gene_counts_length_scaled.tsv file generated by the nf-core be exactly that? And result in identical inputs for DESeq()?

If not, what is the difference?

ADD REPLY
0
Entering edit mode

"gene_counts_length_scaled.tsv file generated by the nf-core be exactly that?"

No that is the counts scaled by the length. Counts + offset is an alternative approach to scaling the counts _before_ supplying them to the model.

From the 2015 tximport paper:

We propose two alternative ways of integrating transcript abundance estimates into the DGE pipeline: to define an artificial count matrix, or to calculate offsets that can be used in the statistical modeling of the observed gene counts...

ADD REPLY
0
Entering edit mode

Understood, Thank you very much for the reply, very kind.

ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6