Question

Rsubread featureCounts with strand information

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 6.7 years ago

Hi,

I have RNAseq data with strand information and I'm using Rsubread's featureCounts function to sum reads onto the GTF annotation.

This is the command I'm using:

featureCounts(files=my.bam,annot.ext=my.gtf,isGTFAnnotationFile=T,useMetaFeatures=T,isPairedEnd=T)

my.gtf is gencode.v25.primary_assembly.annotation.gtf BTW.

When I specify strandSpecific=1 I'm getting a weird behavior where, for an example gene, it's reporting many more reads mapping to an antisense gene encoded on the - strand than to an overlapping protein coding gene encoded on the + strand (260 vs. 12 reads). This is neither supported when I count the number of reads per strand for that genomic interval using samtools view nor when I view this region with IGV.

If I don't specify strandSpecific I am getting nearly the reverse outcome: most of the reads are assigned to the protein coding gene (2 vs. 878).

Any idea what is this?

> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

locale: [1] C

attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base

other attached packages: [1] Rsubread_1.24.0 snpEnrichment_1.7.0 doParallel_1.0.10 [4] iterators_1.0.8 foreach_1.4.3

loaded via a namespace (and not attached): [1] Rcpp_0.12.7 lattice_0.20-34 codetools_0.2-15 [4] assertthat_0.1 grid_3.3.1 plyr_1.8.4 [7] gtable_0.2.0 scales_0.4.1 ggplot2_2.2.0 [10] zlibbioc_1.20.0 lazyeval_0.2.0 Matrix_1.2-7.1 [13] snpStats_1.24.0 splines_3.3.1 tools_3.3.1 [16] munsell_0.4.3 survival_2.40-1 BiocGenerics_0.20.0 [19] colorspace_1.2-7 tibble_1.2

Rsubread bam rnaseq • 5.2k views

ADD COMMENT • link updated 8.3 years ago by Aaron Lun ★ 28k • written 8.3 years ago by rubi ▴ 110

score 4 · Accepted Answer · 2016-12-28

I'll stick my neck out and guess that your data is reversely stranded. If you're using Illumina's TruSeq stranded protocol, the first read of each pair should contain the sequence of the template strand, i.e., it is in the opposite direction of the coding sequence. Thus, you should set strandSpecific=2 in featureCounts.

Incidentally, this would explain your observations for the "weird gene". Let's say that the antisense gene is not expressed much, while the protein-coding gene is highly expressed, which explains the results when you don't use strand-specific counting. (This difference is exacerbated by the fact that, even if the antisense gene were expressed, read pairs couldn't get uniquely assigned to it due to the presence of the protein-coding gene.) However, when you set (probably incorrectly) strandSpecific=1, read pairs derived from the protein coding gene get assigned to the antisense gene, and vice versa, resulting in a switch in the count sizes.

When I'm not sure about the strandedness of the data, I examine the BAM files on a genome browser to check whether the first read in each pair in the same or opposite direction as the coding strand for a handful of genes. This will tell you quickly what's going on.