Question

Variance between runs of Rsubread

0

Entering edit mode

Isaac Virshup • 0

@isaac-virshup-11641

Last seen 8.2 years ago

University of Melbourne

Hello,

I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.

Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?

This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.

The command to align follows the following convention:

align(index=indexname,
    readfile1=readfile1,
    readfile2=readfile2,
    input_format="FASTQ",
    output_format="BAM",
    output_file=outfile,
    unique=TRUE,
    nBestLocations=1,
    maxMismatches=5,
    indels=5,
    nthreads=6)

Here is the command used to generate count tables:

featureCounts(inputfiles, 
    annot.ext=gtf, 
    isGTFAnnotationFile=T, 
    useMetaFeatures=T, 
    isPairedEnd=ispaired, 
    allowMultiOverlap=T, 
    reportReads=T, 
    minMQS=30, 
    nthreads=ncpu, 
    ignoreDup=T)

Though md5 sums of the bam files output by align change between runs. In all test cases R was built with gcc v4.4.7.

Thanks,

–Isaac

UPDATE

The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.

Here's a gist containing the contents of the $stats field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.

rsubread alignment reproducibility • 3.1k views

ADD COMMENT • link 8.6 years ago Isaac Virshup • 0

0

Entering edit mode

There could be slight differences in counting results between different Rsubread versions. But you need to provide more details, including your commands, versions and counting summaries, so that the differences you observed can be better understood.

ADD REPLY • link 8.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

I've updated the question to include commands and versions used. I'm not sure what you mean by counting summaries, would you mind clarifying?

ADD REPLY • link 8.6 years ago Isaac Virshup • 0

score 0 · Answer 1 · 2016-10-11

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 5 days ago

Australia/Melbourne

The difference you observed is likely to be caused by the changes in the align() function. We made improvement on the reporting of multi-mapping reads (this is why you saw differences for genes in repetitive regions) and also on the reporting of mapping quality scores.

The 'stat' component included in the object returned by featureCounts includes counting summary, which gives the breakdown of counting results for different groups of reads (multimapping reads, reads passing MQS cutoff etc). This summary is helpful for you to understand what caused the counting differences between your different runs.

ADD COMMENT • link 8.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Sorry, but I think there has a been a misunderstanding about my question. I'm seeing differences between alignments generated from a single install of RSubread. I'll add an update to my question to clarify this.

ADD REPLY • link 8.6 years ago Isaac Virshup • 0

0

Entering edit mode

Could you run sessionInfo() command and provide its output?

ADD REPLY • link 8.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Session info associated with the linked gist above:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

loaded via a namespace (and not attached):
[1] tools_3.3.1

ADD REPLY • link 8.6 years ago Isaac Virshup • 0

0

Entering edit mode

Hi Isaac, could you provide the bam file so we can take a close look? Could you also let us know which genome you mapped your reads to?

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

The alignment was against GRCh37.69. How should I send you a link to the files?

ADD REPLY • link 8.5 years ago Isaac Virshup • 0

0

Entering edit mode

You may send us the files via dropbox.

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Not sure if there's quite enough space in my dropbox at the moment... I've uploaded them to CloudStor (should be similar) here: https://cloudstor.aarnet.edu.au/plus/index.php/s/gh2AdmTmhr6Vdox

ADD REPLY • link 8.5 years ago Isaac Virshup • 0

0

Entering edit mode

Wei, have you had a chance to retrieve those files? I'd like to take those files down from my cloudstor drive.

Best,

–Isaac

ADD REPLY • link 8.5 years ago Isaac Virshup • 0

0

Entering edit mode

Yes we have downloaded the files. Thanks.

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Wei,

On the recommendation of a co-worker I tried running subread using a single thread to see if that produced deterministic output. I've run it a few times now and have been getting the same output – while multi-threaded output still varies by run. I ran this using conda installed R and Rsubread using the human genome from ensembl release 76. Hope this helps resolve this issue!

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.23.0

ADD REPLY • link 8.3 years ago Isaac Virshup • 0

0

Entering edit mode

Hi Isaac, this will be fixed in the next version which should be released soon.

ADD REPLY • link 8.3 years ago Wei Shi ★ 3.6k

0

Entering edit mode

That's great! Thanks for looking into this.

Will the new version be available with the next release of bioconductor or will it be out separately?