Summary
It seems featureCounts under certain circumstances ignores the isPairedEnd = TRUE
argument.
Background
I have been using featureCounts successfully for many years but recently ran into a problem in files where I have a mixture of paired and unpaired strand specific data (Produced by Trimmomatic mapped by Hisat2 (v 2.0.1-beta)). To illustrate the problem consider the two bam files (and their indexes) here. Both sam files contain a subset of chr1 consisting of the first 5 million basepairs (chr1-1-5.000.000) in mm10. The only difference between the files is that one were mapped using only the paired data (onlypaired.bam) and the other also includes the unpaired data (pairedand_unpaired.bam).
Analysis goal Quantify the sense/anti-sense ratio of reads mapping to genes. For the example files we are specifically interested in the ENSMUSG00000025902.13 gene located at chr1:4490931-4497354.
Samtools approach: As far as I can tell the files are identical with regards to mapping and paired flags and the flags works fine.
samtools view -c -f 1 only_paired.bam chr1:4490931-4497354
2573 paired reads (and a total of 2657 reads in pairedandunpaired.bam)
When considering ENSMUSG00000025902.13 the the strand of the first-mate seems specific in both files:
samtools view -f 64 only_paired.bam chr1:4490931-4497354 | sam2bed | cut -f 6 | sort | uniq -c
11 -
1272 +
samtools view -f 64 paired_and_unpaired.bam chr1:4490931-4497354 | sam2bed | cut -f 6 | sort | uniq -c
11 -
1272 +
So far so good. Now lets try with featureCount from the R package Rsubread
Rsubread::featureCount approach:
library(Rsubread)
### Create regions to quantify
# sense
x <- capture.output( # prevent the subread printout
senseQuantPaired <- featureCounts(
files = bamFiles,
annot.ext=senseAnnot,
isPairedEnd = TRUE,
strandSpecific = 1,
)
)
# anit-sense paired
x <- capture.output( # prevent the subread printout
antisenseQuantPaired <- featureCounts(
files = bamFiles,
annot.ext=antisenseAnnot,
isPairedEnd = TRUE,
strandSpecific = 1,
)
)
# anit-sense unpaired
x <- capture.output( # prevent the subread printout
antisenseQuantUnpaired <- featureCounts(
files = bamFiles,
annot.ext=antisenseAnnot,
isPairedEnd = FALSE,
strandSpecific = 1,
)
)
x <- capture.output( # prevent the subread printout
antisenseQuantUnpaired <- featureCounts(
files = bamFiles,
annot.ext=antisenseAnnot,
isPairedEnd = FALSE,
nthreads=1,
strandSpecific = 1,
)
)
### Massage output into df
data.frame(
libType = c('only_paired','paired_and_unpaired'),
sense = senseQuant$counts[1,],
antisensePaired = antisenseQuantPaired$counts[1,],
antisenseUnpaired = antisenseQuantUnpaired$counts[1,],
row.names = NULL
)
libType sense antisensePaired antisenseUnpaired
1 only_paired 1283 11 1283
2 paired_and_unpaired 1341 1309 1309
From the "sense" column we see that both methods agree on the number of reads mapping to the '+' strand - results supported by the samtools approach above. The problem is the number of reads mapping to the '-' strand. When considering the file only containing paired reads very few reads map to the '-' strand (again just like the samtools approach) but for the file also containing unpaired reads featureCounts finds a lot of reads. Actually the same number of reads as if we quantified the "pairedandunpaired.bam" file as unpaired data. The read-count can also be rescued by filtering the pairedandunpaired file with the "-f 1" flag via samtools (not shown).
Summary
So it seems featureCounts under certain circumstances ignores the isPairedEnd = TRUE
argument...? And it seems to have something to do with the paired/unpaired reads (composition) in the first 5e6 since if I just extract the ENSMUSG00000025902.13 gene (aka ignore the first ~4490000 nt) the minus-strand quantification is correct.
Hope this get fixed soon.
Cheers Kristoffer
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.32.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 magrittr_1.5 usethis_1.4.0 devtools_2.0.1 pkgload_1.0.2 R6_2.4.0
[7] rlang_0.3.1 tools_3.5.1 pkgbuild_1.0.2 sessioninfo_1.1.1 cli_1.0.1 withr_2.1.2
[13] remotes_2.0.2 yaml_2.2.0 assertthat_0.2.0 digest_0.6.18 rprojroot_1.3-2 crayon_1.3.4
[19] processx_3.2.0 callr_3.0.0 base64enc_0.1-3 fs_1.2.6 ps_1.2.1 testthat_2.0.1
[25] memoise_1.1.0 glue_1.3.0 compiler_3.5.1 desc_1.2.0 backports_1.1.2 prettyunits_1.0.2
Hi Kristoffer,
I'm not sure what is in the
paired_and_unpaired.bam
file. Does it contain both single-end and paired-end reads? You can figure it out by running$ samtools view paired_and_unpaired.bam | awk 'and($2, 1) == 1' | wc -l
and
$ samtools view paired_and_unpaired.bam | awk 'and($2, 1) == 0' | wc -l
If both commands give non-zero numbers then your BAM file is a mixture of single- and paired-end reads. FeatureCounts assumes that a BAM file can contain only one type of reads, either single-end or paired-end, but never the mixture of them. It then checks the type of the reads in an input file by looking at the first read in it. If it sees a single-end read (ie without the
0x01
FLAG) at the beginning of the input file, it treats all the reads in it as single-end.You can also allow screen output from featureCounts and check the type of reads in the screen output.
If you see
S paired_and_unpaired.bam
in the screen output, it means that featureCounts treats all reads as single-end.If you see
P paired_and_unpaired.bam
in the screen output, it means that featureCounts treats all reads as paired-end.Cheers, Yang
Hi Yang
Thanks for clarifying this - and yes you are right it contain a mixture of paired and unpaired reads. Why would featureCounts not support mixed paired or unpaired reads? Means that if I have paired end data that needs trimming I will throw away a lot of data - often 5-15%...
I think I was mislead by the "isPairedEnd" argument. Maybe adding a note explaining this to the documentation this would be helpfull for future users...
One reason for featureCounts not allowing the mixture of single-end and paired-end reads in one SAM/BAM file is that the counts of single reads shouldn't be added to the counts of fragments because they are different things.
A normal read aligner reports unpaired read mapping results as paired-end reads. Namely, although only one end in the read-pair is mapped, it still has a 0x01 FLAG in the SAM/BAM file to indicate that the input is paired-end.
I think you might have mis-understood me due to me not being clear - sorry about that :-). What I am talking about is not mixing single-end and paire-end libraries. I am talking about the case where a paired end library where trimming of low quality reads etc (via e.g. Trimmomatic) causes the removal of one of the read mates but not the other so you end up with data that is a mix of paired and unpaired reads from the same paired experiment (that is what the example files used above is). In such a case the unpaired reads are equivalent of cases where only one mate mapped (as you mention in the last comment). Why would that not be supported by featureCounts?
featureCounts can correctly process any bam files generated from the mapping of reads generated from a paired end library including those bam files that contain read pairs that have only one end mapped (the other end is not required to be included in the same file), as long as all the reads are marked as being generated from a read pair in their FLAG field. However your bam files including unpaired reads do not seem to have this FLAG set and featureCounts will treat them as single end reads. featureCounts identifies single end and paired end reads according to the SAM/BAM specification.
Note that you will get unintended counting results if you count paired end reads as single end - the second read in a pair will be incorrectly counted to the strand opposite to the strand the first read is counted to. When featureCounts identifies the reads as paired end reads it will flip the strand of the second read to make sure the two reads from the same pair are counted to the same strand.
Could you please also provide your full commands for trimming and aligning your raw reads (reads from FASTQ files) so we can see why paired end reads were marked as single end?
For trimming I am using Trimmomatic and it does not specifically matter which filtering you are doing but when you trim paired data the result is 4 files (see "Paired End Mode" section of the manual) two which contains the ordered paired reads and two files containing respectively the left and right reads which lost their mate due to the trimming.
For mapping I am using HISAT2 where the trimmed paired files are mapped via the "-1" and "-2" argument and the one without a mate after trimming are mapped with the "-U" ( see HISTAT options here)
So in summary if a mates is lost due to trimming they will not be marked as paired when you map them. Sorry if it is a naive question but could such read not be supported by counting them just like reads where only one mate was mapped?
The problem isn't in the mapping, the problem is that featureCounts knows to expect read 2 to be in the opposite orientation as read 1. But if it doesn't know it's looking at read 2, and thinks its looking at read 1, the orientation will not what the software is expecting
I suppose what you could do is align the read2 singletons separately, and manually change the flags in the bam to indicate that those reads are read2 where read1 failed to map. (Flag 137 I believe). Then merged that bam with the paired one, and FeatureCounts will probably understand that.
But that seems like a lot of work for what have to be a very small % of your reads.
An easier and more complete solution would be to simply not use Trimmomatic, as suggested by Wei Shi in his answer.