Rsubread featureCounts with strand information
1
0
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 6.5 years ago

Hi,

I have RNAseq data with strand information and I'm using Rsubread's featureCounts function to sum reads onto the GTF annotation.

 

This is the command I'm using:

featureCounts(files=my.bam,annot.ext=my.gtf,isGTFAnnotationFile=T,useMetaFeatures=T,isPairedEnd=T)

 

my.gtf is gencode.v25.primary_assembly.annotation.gtf BTW.

 

When I specify strandSpecific=1 I'm getting a weird behavior where, for an example gene, it's reporting many more reads mapping to an antisense gene encoded on the - strand than to an overlapping protein coding gene encoded on the + strand (260 vs. 12 reads). This is neither supported when I count the number of reads per strand for that genomic interval using samtools view nor when I view this region with IGV.

If I don't specify strandSpecific I am getting nearly the reverse outcome: most of the reads are assigned to the protein coding gene (2 vs. 878).

Any idea what is this?

 

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsubread_1.24.0     snpEnrichment_1.7.0 doParallel_1.0.10  
[4] iterators_1.0.8     foreach_1.4.3      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7         lattice_0.20-34     codetools_0.2-15   
 [4] assertthat_0.1      grid_3.3.1          plyr_1.8.4         
 [7] gtable_0.2.0        scales_0.4.1        ggplot2_2.2.0      
[10] zlibbioc_1.20.0     lazyeval_0.2.0      Matrix_1.2-7.1     
[13] snpStats_1.24.0     splines_3.3.1       tools_3.3.1        
[16] munsell_0.4.3       survival_2.40-1     BiocGenerics_0.20.0
[19] colorspace_1.2-7    tibble_1.2         

 

 

Rsubread bam rnaseq • 5.0k views
ADD COMMENT
4
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 4 minutes ago
The city by the bay

I'll stick my neck out and guess that your data is reversely stranded. If you're using Illumina's TruSeq stranded protocol, the first read of each pair should contain the sequence of the template strand, i.e., it is in the opposite direction of the coding sequence. Thus, you should set strandSpecific=2 in featureCounts

Incidentally, this would explain your observations for the "weird gene". Let's say that the antisense gene is not expressed much, while the protein-coding gene is highly expressed, which explains the results when you don't use strand-specific counting. (This difference is exacerbated by the fact that, even if the antisense gene were expressed, read pairs couldn't get uniquely assigned to it due to the presence of the protein-coding gene.) However, when you set (probably incorrectly) strandSpecific=1, read pairs derived from the protein coding gene get assigned to the antisense gene, and vice versa, resulting in a switch in the count sizes.

When I'm not sure about the strandedness of the data, I examine the BAM files on a genome browser to check whether the first read in each pair in the same or opposite direction as the coding strand for a handful of genes. This will tell you quickly what's going on.

ADD COMMENT

Login before adding your answer.

Traffic: 550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6