How does the readGappedAlignmentPairs from the GenomicRanges library
handle reads that map to several places in the genome?
Sometimes it can happen that one pair of the read is flagged as
properly paired even if the other read maps to several locations, how
is this handled?
Thank you in advance!
-- output of sessionInfo():
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
[4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
loaded via a namespace (and not attached):
[1] stats4_2.15.0 tools_2.15.0
--
Sent via the guest posting facility at bioconductor.org.
On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote:
>
> How does the readGappedAlignmentPairs from the GenomicRanges library
handle reads that map to several places in the genome?
>
> Sometimes it can happen that one pair of the read is flagged as
properly paired even if the other read maps to several locations, how
is this handled?
The full details will eventually be documented on
?readBamGappedAlignmentPairs and ?makeGappedAlignmentPairs; the author
is currently taking a few days off. My understanding is that each
alignment is a separate record, and that in the SAM specification one
mate tells the location of the second mate (and vice versa) so that
pairs can be identified unambiguously (except when the mate alignments
differ only in the pattern of indels within an end).
More details will be forthcoming.
Martin
> Thank you in advance!
>
> -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0 tools_2.15.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
On Sat, Jun 2, 2012 at 11:48 AM, Martin Morgan <mtmorgan@fhcrc.org>
wrote:
> On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote:
>
>>
>> How does the readGappedAlignmentPairs from the GenomicRanges
library
>> handle reads that map to several places in the genome?
>>
>> Sometimes it can happen that one pair of the read is flagged as
properly
>> paired even if the other read maps to several locations, how is
this
>> handled?
>>
>
> The full details will eventually be documented on
> ?readBamGappedAlignmentPairs and ?makeGappedAlignmentPairs; the
author is
> currently taking a few days off. My understanding is that each
alignment is
> a separate record, and that in the SAM specification one mate tells
the
> location of the second mate (and vice versa) so that pairs can be
> identified unambiguously (except when the mate alignments differ
only in
> the pattern of indels within an end).
>
> More details will be forthcoming.
>
>
I think I can add a few details, since I've been playing with this
lately.
I think the intent at least is to drop the read pairs that are not
one-to-one. I'm a little skeptical that it's actually working
correctly
though, because it seems to always drop exactly one read pair from my
BAMs,
even though there must be more than one that is ambiguous.
Take this with a grain of salt though.
Michael
> Martin
>
>
> Thank you in advance!
>>
>> -- output of sessionInfo():
>>
>> R version 2.15.0 (2012-03-30)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
base
>>
>> other attached packages:
>> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
>> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
>>
>> loaded via a namespace (and not attached):
>> [1] stats4_2.15.0 tools_2.15.0
>>
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor="">
>>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>
>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor="">
>
[[alternative HTML version deleted]]
Hi Vedran,
On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote:
>
> How does the readGappedAlignmentPairs from the GenomicRanges library
handle reads that map to several places in the genome?
Good question and I wish there was a simple answer...
readGappedAlignmentPairs() delegates to findMateAlignment() for doing
the pairing. findMateAlignment() does not have a full man page yet
but will have one soon. The man page will explain the algorithm used
for doing the pairing of records loaded from a BAM file.
Here is roughly how it works.
First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0,
and flag bit 0x8 set to 0 are candidates for pairing (see the SAM
Spec for a description of flag bits and fields). Any other record
is discarded. That is, records that correspond to single end reads,
and records that correspond to paired end reads where one or both
ends are unmapped, are discarded.
Then the algorithm looks at the following fields and flag bits:
(A) QNAME
(B) RNAME, RNEXT
(C) POS, PNEXT
(D) Flag bits Ox10 and 0x20
(E) Flag bits 0x40 and 0x80
2 records rec(i) and rec(j) are considered mates iff all the following
conditions are satisfied:
(A) They have the same QNAME
(B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i)
(C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i)
(D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and
Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i)
(E) rec(i) corresponds to the fi
rst segment in the template and
rec(j) corresponds to the last segment in the template
OR
rec(j) corresponds to the fi
rst segment in the template and
rec(i) corresponds to the last segment in the template
This algorithm will find almost all pairs unambiguously, even when
the same pair of reads maps to several places in the genome. Note
that when a given pair maps to a single place in the genome, looking
at (A) is enough to pair the 2 corresponding records. The additional
conditions (B), (C), (D) and (E) are only here to help in the
situation
where more than 2 records share the same QNAME. And that works most
of the times but there are still situations where this is not enough
to solve the pairing problem unambiguously.
For example, here are 4 records (loaded in a GappedAlignments object)
that cannot be paired with the above algorithm:
** Showing the 4 records as a GappedAlignments object of length 4:
GappedAlignments with 4 alignments and 2 elementMetadata cols:
seqnames strand cigar qwidth start
end
<rle> <rle> <character> <integer> <integer>
<integer>
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R - 13M372N24M 37 6983858
6984266
SRR031714.2658602 chr2R - 13M378N24M 37 6983858
6984272
width ngap | mrnm mpos
<integer> <integer> | <factor> <integer>
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 409 1 | chr2R 6983850
SRR031714.2658602 415 1 | chr2R 6983850
Note that the BAM fields show up in the following columns:
- QNAME: the names of the GappedAlignments object (unnamed col)
- RNAME: the seqnames col
- POS: the start col
- RNEXT: the mrnm col
- PNEXT: the mpos col
As you can see, the aligner has aligned the same pair to the same
location twice! The only difference between the 2 aligned pairs is in
the cigar i.e. one end of the pair is aligned twice to the same
location
with exactly the same cigar while the other end of the pair is aligned
twice to the same location but with slightly different cigars.
** Now showing the corresponding flag bits:
isPaired isProperPair isUnmappedQuery hasUnmappedMate
isMinusStrand
[1,] 1 1 0 0
0
[2,] 1 1 0 0
0
[3,] 1 1 0 0
1
[4,] 1 1 0 0
1
isMateMinusStrand isFirstMateRead isSecondMateRead
isNotPrimaryRead
[1,] 1 0 1
0
[2,] 1 0 1
0
[3,] 0 1 0
0
[4,] 0 1 0
0
isNotPassingQualityControls isDuplicate
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 0 0
As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4)
are both first mates. But looking at (A), (B), (C), (D) and (E), the
pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could
be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to
disambiguate! Also note that everything is tagged as proper pair (flag
bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so
using this information would not help disambiguate here.
I'm wondering if there is some other place we should look at in the
BAM file in order to disambiguate (e.g. a tag?), or if those
ambiguous pairings are just part of the life with the SAM Spec.
Not sure whether this is a weakness of the Spec? Or A feature? Any
input on this would be appreciated.
In the meantime, findMateAlignment() is just ignoring those ambiguous
pairings (with a warning) i.e. records that cannot be paired
unambiguously are not paired at all. Concretely that means
that readGappedAlignmentPairs() is guaranteed to return a
GappedAlignmentPairs object where every pair could be formed in an
non-ambiguous way. Note that AFAICS in practice this approach
doesn't seem to leave aside a lot of records because ambiguous
pairing events seem pretty rare.
Cheers,
H.
>
> Sometimes it can happen that one pair of the read is flagged as
properly paired even if the other read maps to several locations, how
is this handled?
>
> Thank you in advance!
>
> -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0 tools_2.15.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
There was a small formatting problem with my previous email (due to
some copy/paste from the SAM Spec opened in Acroread to my mailer),
so I'm sending this again. Sorry for the noise...
On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote:
>
> How does the readGappedAlignmentPairs from the GenomicRanges library
handle reads that map to several places in the genome?
Good question and I wish there was a simple answer...
readGappedAlignmentPairs() delegates to findMateAlignment() for doing
the pairing. findMateAlignment() does not have a full man page yet
but will have one soon. The man page will explain the algorithm used
for doing the pairing of records loaded from a BAM file.
Here is roughly how it works.
First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0,
and flag bit 0x8 set to 0 are candidates for pairing (see the SAM
Spec for a description of flag bits and fields). Any other record
is discarded. That is, records that correspond to single end reads,
and records that correspond to paired end reads where one or both
ends are unmapped, are discarded.
Then the algorithm looks at the following fields and flag bits:
(A) QNAME
(B) RNAME, RNEXT
(C) POS, PNEXT
(D) Flag bits Ox10 and 0x20
(E) Flag bits 0x40 and 0x80
2 records rec(i) and rec(j) are considered mates iff all the following
conditions are satisfied:
(A) They have the same QNAME
(B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i)
(C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i)
(D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and
Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i)
(E) rec(i) corresponds to the first segment in the template and
rec(j) corresponds to the last segment in the template
OR
rec(j) corresponds to the first segment in the template and
rec(i) corresponds to the last segment in the template
This algorithm will find almost all pairs unambiguously, even when
the same pair of reads maps to several places in the genome. Note
that when a given pair maps to a single place in the genome, looking
at (A) is enough to pair the 2 corresponding records. The additional
conditions (B), (C), (D) and (E) are only here to help in the
situation
where more than 2 records share the same QNAME. And that works most
of the times but there are still situations where this is not enough
to solve the pairing problem unambiguously.
For example, here are 4 records (loaded in a GappedAlignments object)
that cannot be paired with the above algorithm:
** Showing the 4 records as a GappedAlignments object of length 4:
GappedAlignments with 4 alignments and 2 elementMetadata cols:
seqnames strand cigar qwidth start
end
<rle> <rle> <character> <integer> <integer>
<integer>
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R - 13M372N24M 37 6983858
6984266
SRR031714.2658602 chr2R - 13M378N24M 37 6983858
6984272
width ngap | mrnm mpos
<integer> <integer> | <factor> <integer>
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 409 1 | chr2R 6983850
SRR031714.2658602 415 1 | chr2R 6983850
Note that the BAM fields show up in the following columns:
- QNAME: the names of the GappedAlignments object (unnamed col)
- RNAME: the seqnames col
- POS: the start col
- RNEXT: the mrnm col
- PNEXT: the mpos col
As you can see, the aligner has aligned the same pair to the same
location twice! The only difference between the 2 aligned pairs is in
the cigar i.e. one end of the pair is aligned twice to the same
location
with exactly the same cigar while the other end of the pair is aligned
twice to the same location but with slightly different cigars.
** Now showing the corresponding flag bits:
isPaired isProperPair isUnmappedQuery hasUnmappedMate
isMinusStrand
[1,] 1 1 0 0
0
[2,] 1 1 0 0
0
[3,] 1 1 0 0
1
[4,] 1 1 0 0
1
isMateMinusStrand isFirstMateRead isSecondMateRead
isNotPrimaryRead
[1,] 1 0 1
0
[2,] 1 0 1
0
[3,] 0 1 0
0
[4,] 0 1 0
0
isNotPassingQualityControls isDuplicate
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 0 0
As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4)
are both first mates. But looking at (A), (B), (C), (D) and (E), the
pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could
be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to
disambiguate! Also note that everything is tagged as proper pair (flag
bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so
using this information would not help disambiguate here.
I'm wondering if there is some other place we should look at in the
BAM file in order to disambiguate (e.g. a tag?), or if those
ambiguous pairings are just part of the life with the SAM Spec.
Not sure whether this is a weakness of the Spec? Or A feature? Any
input on this would be appreciated.
In the meantime, findMateAlignment() is just ignoring those ambiguous
pairings (with a warning) i.e. records that cannot be paired
unambiguously are not paired at all. Concretely that means
that readGappedAlignmentPairs() is guaranteed to return a
GappedAlignmentPairs object where every pair was formed in an
non-ambiguous way. Note that AFAICS in practice this approach
doesn't seem to leave aside a lot of records because ambiguous
pairing events seem pretty rare.
Cheers,
H.
>
> Sometimes it can happen that one pair of the read is flagged as
properly paired even if the other read maps to several locations, how
is this handled?
>
> Thank you in advance!
>
> -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0 tools_2.15.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319