RSubread output significantly different from bash Subread output
2
0
Entering edit mode
Aimee • 0
@3d966bb9
Last seen 4 weeks ago
United States

Hello, I would like some assistance in figuring out why Rsubread and the bash subread are giving different feature count outputs. I came across this issue because I wrote a wrapper script for a pipeline used by a colleague, and during the validation process we have been trying to resolve why the wrapper feature count table differs from hers. The differences drastically affect the statistical analysis by DeSeq2. Through much testing, I found that our bam file outputs from STAR were exactly the same, and the issue arises when we run the files through featureCounts. The primary distinction is that her method uses Rsubread and the wrapper uses subread in bash. If I run her bam files through bash subread, our feature count tables are now identical, confirming that the divergence is occurring at the featureCounts step.

Neither of us was using the most updated versions of Rsubread and subread (Rsubread 2.18.0 and subread 2.0.1), and I recognized that this alone could cause an issue if those versions were not aligned with each other. So I downloaded Rsubread 2.20.0 and Subread 2.0.6 (unfortunately, I could not download 2.0.7 because it would require admin privileges, and 2.0.6 was the latest available on bioconda). I then ran the same 14 bam files (forward paired reads only) through Rsubread and bash Subread using the following scripts for each (same gtf file as well):

Subread 2.0.6 script:

featureCounts -t exon -g 'gene_id' -s 2 -o FeatureCounts_original.txt -a ${ref_dir}/${gtf_file} outputs/STAR/*/*.bam

Rsubread 2.20.0 script:

featureCounts(files=bamfiles, annot.ext=gtf, isGTFAnnotationFile=TRUE, isPairedEnd=FALSE, strandSpecific=2)

As far as I can tell by the manual, the parameters used between these two are the same, and both of the run logs indicate the same settings in the Settings box.

Of possible interest: the Rsubread output from version 2.18.0 was identical to that from 2.20.0, and the bash Subread output from version 2.0.1 was identical to that from 2.0.6.

In comparing the resulting feature count tables from the R version and bash version, there are major difference in the number of alignments assigned per sample and in total. Rsubread successfully assigned 205,861,649 more alignments, with an average of 14,704,404 more per sample. Here is a breakdown comparing the two outputs (information compiled from the run logs):

Comparison of successfully assigned reads Rsubread vs. Subread

My team is very confused about which method is the truth, as we would draw vastly different conclusions about the populations of genes that are significantly differentially regulated from each method. Do you see anything that I am doing incorrectly or have suggestions for why the R and bash results are not in agreement? I am happy to provide more information as well. Thank you.

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /apps/x86_64/R/4.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/x86_64/R/4.4.0/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.4.0
Subread Rsubread • 731 views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 19 hours ago
WEHI, Melbourne, Australia

The difference (as ATpoint has correctly pointed out) is that Rsubread::featureCounts counts multi-mapping reads by default whereas the command-line featureCounts does not. This would have been reported to you in the output settings box when the commands were run. Rsubread::featureCounts would have given you a settings box saying:

||      Multimapping reads : counted                                          ||

whereas the command-line version would have given you a settings box saying:

||      Multimapping reads : not counted                                      ||

so the settings should have been quite explicit.

We recommend that you use countMultiMappingReads=TRUE and allowMultiOverlap=FALSE for gene-level RNA-seq read counting, and that is why the defaults are set that way in Rsubread. With these settings, featureCounts will count a multi-mapping read only if it overlaps exactly one gene. The default settings mean that it counts multi-mapping reads only if they would be counted in exactly the same way by an EM algorithm. This not only makes logical sense, it has also proved to give good results in many thousands of RNA-seq analyses.

The different defaults in the two program versions are for historical reasons. Originally the default was countMultiMappingReads=FALSE in both versions. When experience showed that countMultiMappingReads=TRUE was better, it was easy to change the R code default, but changing the command-line default requires changes to the C code, which is more work. Remember that this is free software that is maintained long-term without any ongoing funding to do so.

I do not understand why you say you have "paired reads" but are setting isPairedEnd=FALSE. Using paired-end reads would cut down substantially on the proportion of multi-mapping reads. If you have paired-end reads, then both ends will originate from the same strand. I can't think of any reason for counting one end only from each pair.

ADD COMMENT
0
Entering edit mode

Hi Gordon, Thank you so much for this response. I went back and checked, and you are absolutely correct that it was shown in the settings box, and I missed it. My eyes must have jumped oddly going back and forth between my terminals.

I greatly appreciate your commentary on the methods used as well, so that I can bring these to the attention of my team. I am currently a trainee, and my role in this has been to take my colleague's pipeline and wrap it to allow automation, but I kept the same parameters they previously used, so I am unsure of the reasoning behind the choices. However, I believe we are going to take a hard look at the best practices for this and make sure we are following them, and your advice will be very helpful.

ADD REPLY
2
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 3 hours ago
Germany

The R version counts multimappers (countMultiMappingReads is TRUE by default) while the command line version does not.

ADD COMMENT
0
Entering edit mode

Thank you! I will try rerunning with that parameter and see if it fixes the discrepancy! I'll report back.

Update: This completely resolved the issue! The feature count tables are identical now. Thank you again SO MUCH! I agree--It makes no sense that (1) the defaults are different between the two and (2) that this setting in particular would be set to TRUE by default.

ADD REPLY

Login before adding your answer.

Traffic: 648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6