I have some relatively recent strand specific RNA-seq data which comes as 'second-strand', that means all reads are reversed with respect to the genomic sequence of the transcript and annotation. With featureCounts I would have to use -s 2 option, for example, to get a 70+% rate of assignments to transcripts. But now I am very concerned that the counting went terribly wrong. Does easyRNAseq have an option to set the strand for read assignment, or does it determine those automatically?
I have counted those with the following settings:
bamParam <- BamParam(paired = FALSE, stranded = TRUE, yieldSize = 1000000L) param <- RnaSeqParam(annotParam = annotParam, bamParam = bamParam, countBy = c("transcripts"), precision = "read") ret <- simpleRNASeq(bamFiles = getBamFileList(args), param = param, nnodes = 120, verbose = TRUE, override = TRUE)
Hi Hervé, thank you for your reply, I have now re-counted all our files using featureCounts (stand-alone executable) and a few also using HTseq-count. The result is reproducibly overall: easyRNAseq with stranded=TRUE and featureCount -s1 deliver ~5% reads assigned to gene models, counting reverse strand in HTseq count and featureCounts: 70% - 85% assigned. I have tested about 80 bam files, both paired and single-end. Also, see here: https://www.biostars.org/p/257358
5% might be the typical strand leakage of the protocol. I have also found a few post mentioning low rates of assigned reads.
It seems like all recent (2-3 years ago) strand specific illumina data is reverse stranded. I have counted 22 publications citing easyRNASeq on Pubmed, and it might be that many of those could be miscounted. Therefore I think that this package should contain a clear warning that it is not compatible with recent illumina stranded protocols. Or document how to set the parameters to achieve correct counts.
Dear Michael,
Thank you for reporting on this. On one hand, luckily, my package is not main stream, so among the publications listed in PubMed, only one may be at risk. I will nonetheless contact the authors of all publication asap, as it is actually very hard to figure out the protocol that was used and I want to be very sure. On the other hand, were my package more frequently used, this would have been figured out earlier. Anyway, I will take immediate corrective measures. Many thanks once again for this report.
Best,
Nicolas Delhomme
P.S. there is so much traffic on the Bioconductor website and mailing list that I had overseen your email. Hervé was the one (thanks) that brought it to my attention. So if I may suggest so, it is a good practice for such critical issues to get in direct contact with the package maintainer, which email address you can find in the package description.
Hi Nicolas, Michael,
Thanks Nico for stopping by! Per your request this morning I added the warning about unsupported "reversely stranded" data to easyRNASeq. The 2 updated versions of the package (2.12.1 in release and 2.13.1 in devel) should become available via
biocLite()
in 24 hours or so.Cheers,
H.