Question

Warning: missing reverse strand option in easyRNASeq.

0

Entering edit mode

Michael Dondrup ▴ 10

@michael-dondrup-8328

Last seen 3.7 years ago

Bergen, Norway

I have some relatively recent strand specific RNA-seq data which comes as 'second-strand', that means all reads are reversed with respect to the genomic sequence of the transcript and annotation. With featureCounts I would have to use -s 2 option, for example, to get a 70+% rate of assignments to transcripts. But now I am very concerned that the counting went terribly wrong. Does easyRNAseq have an option to set the strand for read assignment, or does it determine those automatically?

I have counted those with the following settings:

bamParam <- BamParam(paired = FALSE, stranded = TRUE, yieldSize = 1000000L)
param <- RnaSeqParam(annotParam = annotParam, bamParam = bamParam, countBy = c("transcripts"), precision = "read")
ret <- simpleRNASeq(bamFiles = getBamFileList(args), param = param, nnodes = 120, verbose = TRUE, override = TRUE)

easyrnaseq rnaseq strand • 1.7k views

ADD COMMENT • link updated 7.5 years ago by Hervé Pagès 16k • written 7.5 years ago by Michael Dondrup ▴ 10

0

Entering edit mode

Michael Dondrup ▴ 10

@michael-dondrup-8328

Last seen 3.7 years ago

Bergen, Norway

I have done a comparison using the featureCounts program, seemingly the worst-case scenario happened. The number of assigned reads in featureCounts using -s 2 is on average 10x-15x higher per sample. So it seems that all our counts on recent stranded Illumina data could be wrong! This is possibly a big bummer, everyone should check their counts if they were achieved using the following parameters:

recent Illumina strand-specific protocol (reverse sense output)
counted using easyRNAseq with option stranded = TRUE

ADD COMMENT • link 7.5 years ago Michael Dondrup ▴ 10

score 2 · Accepted Answer · 2017-06-15

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 6 days ago

Seattle, WA, United States

Hi Michael,

FWIW, you could double-confirm your results by counting again with summarizeOverlaps() from the GenomicAlignments package. If your data is "reversely stranded", then invert the read strand with invertStrand(reads) before passing them to summarizeOverlaps(). Alternatively, if the reads are paired-end, your can specify strandMode=2 when loading them with readGAlignmentPairs().

It might not be as efficient as featureCounts() so maybe you want to try this on a subset of your data only, just to confirm the "reversely stranded" thing.

Also there is a simple test in ?junctions (in GenomicAlignments) to find out whether RNAseq data is stranded or not. It doesn't use any annotation at all, it only looks at the read junctions.

Cheers,

H.

ADD COMMENT • link 7.5 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé, thank you for your reply, I have now re-counted all our files using featureCounts (stand-alone executable) and a few also using HTseq-count. The result is reproducibly overall: easyRNAseq with stranded=TRUE and featureCount -s1 deliver ~5% reads assigned to gene models, counting reverse strand in HTseq count and featureCounts: 70% - 85% assigned. I have tested about 80 bam files, both paired and single-end. Also, see here: https://www.biostars.org/p/257358

5% might be the typical strand leakage of the protocol. I have also found a few post mentioning low rates of assigned reads.

It seems like all recent (2-3 years ago) strand specific illumina data is reverse stranded. I have counted 22 publications citing easyRNASeq on Pubmed, and it might be that many of those could be miscounted. Therefore I think that this package should contain a clear warning that it is not compatible with recent illumina stranded protocols. Or document how to set the parameters to achieve correct counts.

ADD REPLY • link 7.5 years ago Michael Dondrup ▴ 10

1

Entering edit mode

Dear Michael,

Thank you for reporting on this. On one hand, luckily, my package is not main stream, so among the publications listed in PubMed, only one may be at risk. I will nonetheless contact the authors of all publication asap, as it is actually very hard to figure out the protocol that was used and I want to be very sure. On the other hand, were my package more frequently used, this would have been figured out earlier. Anyway, I will take immediate corrective measures. Many thanks once again for this report.

Best,

Nicolas Delhomme

P.S. there is so much traffic on the Bioconductor website and mailing list that I had overseen your email. Hervé was the one (thanks) that brought it to my attention. So if I may suggest so, it is a good practice for such critical issues to get in direct contact with the package maintainer, which email address you can find in the package description.

ADD REPLY • link 7.5 years ago Nicolas Delhomme ▴ 320

1

Entering edit mode

Hi Nicolas, Michael,

Thanks Nico for stopping by! Per your request this morning I added the warning about unsupported "reversely stranded" data to easyRNASeq. The 2 updated versions of the package (2.12.1 in release and 2.13.1 in devel) should become available via biocLite() in 24 hours or so.

Cheers,

H.

ADD REPLY • link 7.5 years ago Hervé Pagès 16k