Hello,
I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.
Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?
This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.
The command to align follows the following convention:
align(index=indexname,
readfile1=readfile1,
readfile2=readfile2,
input_format="FASTQ",
output_format="BAM",
output_file=outfile,
unique=TRUE,
nBestLocations=1,
maxMismatches=5,
indels=5,
nthreads=6)
Here is the command used to generate count tables:
featureCounts(inputfiles,
annot.ext=gtf,
isGTFAnnotationFile=T,
useMetaFeatures=T,
isPairedEnd=ispaired,
allowMultiOverlap=T,
reportReads=T,
minMQS=30,
nthreads=ncpu,
ignoreDup=T)
Though md5 sums of the bam files output by align
change between runs. In all test cases R was built with gcc v4.4.7.
Thanks,
–Isaac
UPDATE
The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.
Here's a gist containing the contents of the $stats
field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.
There could be slight differences in counting results between different Rsubread versions. But you need to provide more details, including your commands, versions and counting summaries, so that the differences you observed can be better understood.
I've updated the question to include commands and versions used. I'm not sure what you mean by counting summaries, would you mind clarifying?