Entering edit mode
Hi,
I was running QC for my whole exome data (paired-end sequencing) and
got this error:
> readpairs <- reads2pairs(reads) Error in reads2pairs(reads) : read pair IDs do not seem to be unique sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TEQC_3.10.0 hwriter_1.3.2 Rsamtools_1.22.0 [4] Biostrings_2.38.4 XVector_0.10.0 GenomicRanges_1.22.4 [7] GenomeInfoDb_1.6.3 IRanges_2.4.8 S4Vectors_0.8.11 [10] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] bitops_1.0-6 futile.options_1.0.0 zlibbioc_1.16.0 [4] futile.logger_1.4.3 BiocParallel_1.4.3 lambda.r_1.1.9 [7] tools_3.2.3 Biobase_2.30.0
Any idea why is that? for paired-end bam, each ID should replicate 2 times for the mates? right?
Thanks,
TOmmy
I actually used bwa mem to map to get the bam files, so bwa mem is keeping the multi-mapped reads I guess.
You mentioned that `get.reads` only read in the primary alignments? but why it is still there?
Maybe I need to remove those reads in bam before feeding into TEQC?
Thank you!
Tommy
That only primary alignments are kept in 'get.reads' was introduced in TEQC version 3.13.1.
If the error still appears for recent TEQC versions and/or after removing secondary alignments from the bam, I should have a closer look.
Manuela
I was using 3.1.0. A side question is how TEQC determine the not uniquely mapped reads for BWA-MEM aligned bam files.
I read here:
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/
BWA-mem added XA tag to a read if it is mapped to multiple places. some people also use
samtools view -q 1
but q=0 is only a convention for multi-mapped reads by BWA-mem not by other aligners.
Tommy
TEQC reads bam files with the 'scanBam' function in the 'Rsamtools' package. This function uses 'scanBamFlag' to determine which reads to keep based on the FLAG field in the bam file (second column).
Secondary alignments have flag = 256, and filtering those out with 'scanBam' should be equivalent to:
(Additionally TEQC specifies isUnmappedQuery=FALSE in order to remove reads without match to the reference.)
I don't know how this refers to the specifications that BWA-mem adds to other fields in the bam...
samtools view -F 256 will remove some reads, but not all not uniquely mapped reads. Have you tried TEQC on any BWA-MEM aligned bam files?
from Heng Li:
https://www.biostars.org/p/59281/#59303
Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about `mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.
For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.
However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?
In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.
Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.
Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.
Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.