GappedAlignmentPairs requests

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 hours ago

Seattle, WA, United States

Hi Michael, On 06/29/2012 05:06 AM, Michael Lawrence wrote: > Hi (Herve), > > I have been playing around with GappedAlignmentPairs and found it pretty > useful. Some things that would be nice: > > 1) A method for getting the observed fragment length (would need to be NA > when pairs are on different chromosomes). This should at least have an > option (if not always) to exclude the N cigar runs from the calculation. Or > it could just take it from the TLEN (isize) column in the BAM file (which > at least in the case of gsnap excludes the Ns). From the SAM Spec: TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. [...] It is set as 0 for single-segment template or when the information is unavailable. 5 notes: (1) It seems that most of the time this information is not available in the BAM files produced by the aligners (at least not in those I've seen so far). (2) I wonder if the above definition really makes sense for a paired-end read with the 2 ends aligned to the *same* strand. It's actually not clear to me how to interpret such a situation and why the aligner is producing this kind of paired alignments. Seems to reveal a template coming from a transcript that mixes exons from both strands. (3) Taking "the number of bases from the leftmost mapped base to the rightmost mapped base" is ok most of the times but is questionable in some situations e.g. when the first segment is not mapped upstream of the last segment. I'd rather set TLEN to NA in that case too, like when the 2 segments are not mapped to the same reference, or when they are mapped to the same strand. (4) The above is just a definition so it is what it is (and is not inherently correct or incorrect) but it seems to me that excluding the N gaps would be closer to the intuitive notion of "observed template length". (5) The above definition ignores soft clipping. So here is a proposal that ignores the N gaps but takes into account soft clipping. 'galp' is GappedAlignmentPairs object. Note that we don't need to deal with the situation where the 2 ends are not mapped to the same reference, or are mapped to the same strand, because those pairs are discarded (with a warning) when the GappedAlignmentPairs object is made (usually with readGappedAlignmentPairs()): otlen <- function(galp) { lgal <- left(galp) rgal <- right(galp) lend <- end(lgal) rstart <- start(rgal) ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L ## Set to NA when the first and last ends are not in an ## upstream/downstream configuration: not_upstream_downstream <- rstart <= lend ans[not_upstream_downstream] <- NA_integer_ ans } If we want to be a little less conservative about the first and last ends being in an upstream/downstream configuration, we can replace the computation of 'not_upstream_downstream' with: lstart <- start(lgal) rend <- end(rgal) not_upstream_downstream <- rstart < lstart | rend < lend > > 2) A method called "introns" or something that returns the N regions as a > GRangesList, as a complement to grglist(). The special gap between the 2 ends is a little bit problematic. How should we deal with it given that (1) it has a different meaning than the N gaps (i.e. it does not in general correspond to an intron, only by chance), and (2) it's not always here (e.g. when the first and last ends are not in an upstream/downstream configuration)? Of course (2) could always be handled by removing the problematic alignment pairs. > Such a method would also be nice > on GappedAlignments. This one is easier ('gal' being a GappedAlignments object): psetdiff(granges(gal), grglist(gal)) Also I think you actually sent us a "gaps" method for GRangesList recently that we are about to add to the GenomicRanges package. Once we have it, the user will be able to do: gaps(grglist(gal)) as an equivalent way to do the above. Thanks, H. > > Thanks, > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Alignment Cancer GenomicRanges Alignment Cancer GenomicRanges • 1.3k views

ADD COMMENT • link updated 12.7 years ago by Michael Lawrence ★ 11k • written 12.8 years ago by Hervé Pagès 16k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > > On 06/29/2012 05:06 AM, Michael Lawrence wrote: > >> Hi (Herve), >> >> I have been playing around with GappedAlignmentPairs and found it pretty >> useful. Some things that would be nice: >> >> 1) A method for getting the observed fragment length (would need to be NA >> when pairs are on different chromosomes). This should at least have an >> option (if not always) to exclude the N cigar runs from the calculation. >> Or >> it could just take it from the TLEN (isize) column in the BAM file (which >> at least in the case of gsnap excludes the Ns). >> > > From the SAM Spec: > > TLEN: signed observed Template LENgth. If all segments are mapped to > the same reference, the unsigned observed template length equals the > number of bases from the leftmost mapped base to the rightmost mapped > base. [...] It is set as 0 for single-segment template or when the > information is unavailable. > > 5 notes: > > (1) It seems that most of the time this information is not available > in the BAM files produced by the aligners (at least not in those > I've seen so far). > > (2) I wonder if the above definition really makes sense for a > paired-end read with the 2 ends aligned to the *same* strand. > It's actually not clear to me how to interpret such a situation > and why the aligner is producing this kind of paired alignments. > Seems to reveal a template coming from a transcript that mixes > exons from both strands. > > Yes, for non-concordant pairs, like ends that do not map to opposite strands, or that map to different chromosomes or very far away on the same chromosome, this would not be easy to calculate. Presumably, there is some sort of genomic rearrangement or trans-splicing, and the precise break-point would need to be determined. > (3) Taking "the number of bases from the leftmost mapped base > to the rightmost mapped base" is ok most of the times but is > questionable in some situations e.g. when the first segment > is not mapped upstream of the last segment. I'd rather set > TLEN to NA in that case too, like when the 2 segments are not > mapped to the same reference, or when they are mapped to the > same strand. > > This makes sense to me. > (4) The above is just a definition so it is what it is (and is not > inherently correct or incorrect) but it seems to me that > excluding the N gaps would be closer to the intuitive notion > of "observed template length". > > (5) The above definition ignores soft clipping. > > Right. The definition is not that helpful, and Tom has already decided to violate it with gsnap, which counts the soft-clipped regions and discounts the N regions when populating the TLEN field. So we're currently happy just taking TLEN from the BAM file. > So here is a proposal that ignores the N gaps but takes into account > soft clipping. 'galp' is GappedAlignmentPairs object. Note that we > don't need to deal with the situation where the 2 ends are not > mapped to the same reference, or are mapped to the same strand, because > those pairs are discarded (with a warning) when the GappedAlignmentPairs > object is made (usually with readGappedAlignmentPairs()): > > otlen <- function(galp) > { > lgal <- left(galp) > rgal <- right(galp) > lend <- end(lgal) > rstart <- start(rgal) > ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L > > ## Set to NA when the first and last ends are not in an > ## upstream/downstream configuration: > not_upstream_downstream <- rstart <= lend > ans[not_upstream_downstream] <- NA_integer_ > > ans > } > > If we want to be a little less conservative about the first and > last ends being in an upstream/downstream configuration, we > can replace the computation of 'not_upstream_downstream' with: > > lstart <- start(lgal) > rend <- end(rgal) > not_upstream_downstream <- rstart < lstart | rend < lend > > > >> 2) A method called "introns" or something that returns the N regions as a >> GRangesList, as a complement to grglist(). >> > > The special gap between the 2 ends is a little bit problematic. > How should we deal with it given that (1) it has a different > meaning than the N gaps (i.e. it does not in general correspond to an > intron, only by chance), and (2) it's not always here (e.g. when > the first and last ends are not in an upstream/downstream > configuration)? Of course (2) could always be handled by removing > the problematic alignment pairs. > > Well, my request was to discard the inter-read gap, so that I would just have the splices as called by the aligner. It would be nice to have another function that would return the inter-read gap. I actually have functions that do all these things, but they're not very elegant or robust to all of these situations. > > Such a method would also be nice >> on GappedAlignments. >> > > This one is easier ('gal' being a GappedAlignments object): > > psetdiff(granges(gal), grglist(gal)) > > Also I think you actually sent us a "gaps" method for GRangesList > recently that we are about to add to the GenomicRanges package. > Once we have it, the user will be able to do: > > gaps(grglist(gal)) > > as an equivalent way to do the above. > > Right, but as I said, we want to ignore the inter-read gap, which takes a bit more work. Right now, I do the above and then setdiff off the inter-read gap. > Thanks, > > H. > > >> Thanks, >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > [[alternative HTML version deleted]]

ADD COMMENT • link 12.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On 07/10/2012 03:45 AM, Michael Lawrence wrote: > > > On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Michael, > > > On 06/29/2012 05:06 AM, Michael Lawrence wrote: > > Hi (Herve), > > I have been playing around with GappedAlignmentPairs and found > it pretty > useful. Some things that would be nice: > > 1) A method for getting the observed fragment length (would need > to be NA > when pairs are on different chromosomes). This should at least > have an > option (if not always) to exclude the N cigar runs from the > calculation. Or > it could just take it from the TLEN (isize) column in the BAM > file (which > at least in the case of gsnap excludes the Ns). > > > From the SAM Spec: > > TLEN: signed observed Template LENgth. If all segments are mapped to > the same reference, the unsigned observed template length equals the > number of bases from the leftmost mapped base to the rightmost mapped > base. [...] It is set as 0 for single-segment template or when the > information is unavailable. > > 5 notes: > > (1) It seems that most of the time this information is not available > in the BAM files produced by the aligners (at least not in those > I've seen so far). > > (2) I wonder if the above definition really makes sense for a > paired-end read with the 2 ends aligned to the *same* strand. > It's actually not clear to me how to interpret such a situation > and why the aligner is producing this kind of paired alignments. > Seems to reveal a template coming from a transcript that mixes > exons from both strands. > > > > Yes, for non-concordant pairs, like ends that do not map to opposite > strands, or that map to different chromosomes or very far away on the > same chromosome, this would not be easy to calculate. Presumably, there > is some sort of genomic rearrangement or trans-splicing, and the precise > break-point would need to be determined. > > (3) Taking "the number of bases from the leftmost mapped base > to the rightmost mapped base" is ok most of the times but is > questionable in some situations e.g. when the first segment > is not mapped upstream of the last segment. I'd rather set > TLEN to NA in that case too, like when the 2 segments are not > mapped to the same reference, or when they are mapped to the > same strand. > > > This makes sense to me. > > (4) The above is just a definition so it is what it is (and is not > inherently correct or incorrect) but it seems to me that > excluding the N gaps would be closer to the intuitive notion > of "observed template length". > > (5) The above definition ignores soft clipping. > > > Right. The definition is not that helpful, and Tom has already decided > to violate it with gsnap, which counts the soft-clipped regions and > discounts the N regions when populating the TLEN field. So we're > currently happy just taking TLEN from the BAM file. > > So here is a proposal that ignores the N gaps but takes into account > soft clipping. 'galp' is GappedAlignmentPairs object. Note that we > don't need to deal with the situation where the 2 ends are not > mapped to the same reference, or are mapped to the same strand, because > those pairs are discarded (with a warning) when the GappedAlignmentPairs > object is made (usually with readGappedAlignmentPairs()): > > otlen <- function(galp) > { > lgal <- left(galp) > rgal <- right(galp) > lend <- end(lgal) > rstart <- start(rgal) > ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L > > ## Set to NA when the first and last ends are not in an > ## upstream/downstream configuration: > not_upstream_downstream <- rstart <= lend > ans[not_upstream_downstream] <- NA_integer_ > > ans > } > > If we want to be a little less conservative about the first and > last ends being in an upstream/downstream configuration, we > can replace the computation of 'not_upstream_downstream' with: > > lstart <- start(lgal) > rend <- end(rgal) > not_upstream_downstream <- rstart < lstart | rend < lend > > > > 2) A method called "introns" or something that returns the N > regions as a > GRangesList, as a complement to grglist(). > > > The special gap between the 2 ends is a little bit problematic. > How should we deal with it given that (1) it has a different > meaning than the N gaps (i.e. it does not in general correspond to an > intron, only by chance), and (2) it's not always here (e.g. when > the first and last ends are not in an upstream/downstream > configuration)? Of course (2) could always be handled by removing > the problematic alignment pairs. > > > Well, my request was to discard the inter-read gap, so that I would just > have the splices as called by the aligner. I see. Then this can be done with something like this: ## First we need an optimized mendoapply(c, x, y) for CompressedList ## objects (probably worth adding to IRanges, could also be called ## elementCombine() and take an arbitrary number of objects). pcombine <- function(x, y) { if (!is(x, "CompressedList") || class(x) != class(y) || length(x) != length(y)) stop("'x' and 'y' must be CompressedList objects ", "of the same class and length") tmp <- c(x, y)[GenomicRanges:::.makePickupIndex(length(x))] ans <- GenomicRanges:::.shrinkByHalf(tmp) elementMetadata(ans) <- NULL ans } ## Then the introns() function itself for GappedAlignmentPairs. introns <- function(x) { if (!is(x, "GappedAlignmentPairs")) stop("'x' must be a GappedAlignmentPairs object") ## Extract introns (i.e. N gaps) from the first ends. Fgal <- first(x) Fgrl <- grglist(Fgal, order.as.in.query=TRUE) Fintrons <- psetdiff(granges(Fgal), Fgrl) ## Extract introns (i.e. N gaps) from the last ends. Lgal <- last(x, invert.strand=TRUE) Lgrl <- grglist(Lgal, order.as.in.query=TRUE) Lintrons <- psetdiff(granges(Lgal), Lgrl) ## Combine the introns from first and last ends into a ## single GRangesList object. ans <- pcombine(Fintrons, Lintrons) names(ans) <- names(x) ans } Try it: my_introns <- introns(galp) stopifnot(identical(elementLengths(my_introns), ngap(first(galp)) + ngap(last(galp)))) > It would be nice to have > another function that would return the inter-read gap. I actually have > functions that do all these things, but they're not very elegant or > robust to all of these situations. > > > Such a method would also be nice > on GappedAlignments. ^^^^^^^^^^^^^^^^ > > > This one is easier ('gal' being a GappedAlignments object): ^^^^^^^^^^^^^^^^ > > psetdiff(granges(gal), grglist(gal)) > > Also I think you actually sent us a "gaps" method for GRangesList > recently that we are about to add to the GenomicRanges package. > Once we have it, the user will be able to do: > > gaps(grglist(gal)) > > as an equivalent way to do the above. > > > Right, but as I said, we want to ignore the inter-read gap, which takes > a bit more work. Right now, I do the above and then setdiff off the > inter-read gap. The above is for a GappedAlignments object so there is no inter-read gap. HTH, H. > > Thanks, > > H. > > > Thanks, > Michael > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.7 years ago Hervé Pagès 16k

0

Entering edit mode

Looks great. Will it land in GenomicRanges? Also, a function for getting the inter-read gap would be nice, even if trivial. Thanks, Michael On Tue, Jul 10, 2012 at 4:22 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > > On 07/10/2012 03:45 AM, Michael Lawrence wrote: > >> >> >> On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Michael, >> >> >> On 06/29/2012 05:06 AM, Michael Lawrence wrote: >> >> Hi (Herve), >> >> I have been playing around with GappedAlignmentPairs and found >> it pretty >> useful. Some things that would be nice: >> >> 1) A method for getting the observed fragment length (would need >> to be NA >> when pairs are on different chromosomes). This should at least >> have an >> option (if not always) to exclude the N cigar runs from the >> calculation. Or >> it could just take it from the TLEN (isize) column in the BAM >> file (which >> at least in the case of gsnap excludes the Ns). >> >> >> From the SAM Spec: >> >> TLEN: signed observed Template LENgth. If all segments are mapped to >> the same reference, the unsigned observed template length equals the >> number of bases from the leftmost mapped base to the rightmost mapped >> base. [...] It is set as 0 for single-segment template or when the >> information is unavailable. >> >> 5 notes: >> >> (1) It seems that most of the time this information is not available >> in the BAM files produced by the aligners (at least not in those >> I've seen so far). >> >> (2) I wonder if the above definition really makes sense for a >> paired-end read with the 2 ends aligned to the *same* strand. >> It's actually not clear to me how to interpret such a situation >> and why the aligner is producing this kind of paired alignments. >> Seems to reveal a template coming from a transcript that mixes >> exons from both strands. >> >> >> >> Yes, for non-concordant pairs, like ends that do not map to opposite >> strands, or that map to different chromosomes or very far away on the >> same chromosome, this would not be easy to calculate. Presumably, there >> is some sort of genomic rearrangement or trans-splicing, and the precise >> break-point would need to be determined. >> >> (3) Taking "the number of bases from the leftmost mapped base >> to the rightmost mapped base" is ok most of the times but is >> questionable in some situations e.g. when the first segment >> is not mapped upstream of the last segment. I'd rather set >> TLEN to NA in that case too, like when the 2 segments are not >> mapped to the same reference, or when they are mapped to the >> same strand. >> >> >> This makes sense to me. >> >> (4) The above is just a definition so it is what it is (and is not >> inherently correct or incorrect) but it seems to me that >> excluding the N gaps would be closer to the intuitive notion >> of "observed template length". >> >> (5) The above definition ignores soft clipping. >> >> >> Right. The definition is not that helpful, and Tom has already decided >> to violate it with gsnap, which counts the soft-clipped regions and >> discounts the N regions when populating the TLEN field. So we're >> currently happy just taking TLEN from the BAM file. >> >> So here is a proposal that ignores the N gaps but takes into account >> soft clipping. 'galp' is GappedAlignmentPairs object. Note that we >> don't need to deal with the situation where the 2 ends are not >> mapped to the same reference, or are mapped to the same strand, >> because >> those pairs are discarded (with a warning) when the >> GappedAlignmentPairs >> object is made (usually with readGappedAlignmentPairs()): >> >> otlen <- function(galp) >> { >> lgal <- left(galp) >> rgal <- right(galp) >> lend <- end(lgal) >> rstart <- start(rgal) >> ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L >> >> ## Set to NA when the first and last ends are not in an >> ## upstream/downstream configuration: >> not_upstream_downstream <- rstart <= lend >> ans[not_upstream_downstream] <- NA_integer_ >> >> ans >> } >> >> If we want to be a little less conservative about the first and >> last ends being in an upstream/downstream configuration, we >> can replace the computation of 'not_upstream_downstream' with: >> >> lstart <- start(lgal) >> rend <- end(rgal) >> not_upstream_downstream <- rstart < lstart | rend < lend >> >> >> >> 2) A method called "introns" or something that returns the N >> regions as a >> GRangesList, as a complement to grglist(). >> >> >> The special gap between the 2 ends is a little bit problematic. >> How should we deal with it given that (1) it has a different >> meaning than the N gaps (i.e. it does not in general correspond to an >> intron, only by chance), and (2) it's not always here (e.g. when >> the first and last ends are not in an upstream/downstream >> configuration)? Of course (2) could always be handled by removing >> the problematic alignment pairs. >> >> >> Well, my request was to discard the inter-read gap, so that I would just >> have the splices as called by the aligner. >> > > I see. Then this can be done with something like this: > > ## First we need an optimized mendoapply(c, x, y) for CompressedList > ## objects (probably worth adding to IRanges, could also be called > ## elementCombine() and take an arbitrary number of objects). > pcombine <- function(x, y) > { > if (!is(x, "CompressedList") || > class(x) != class(y) || > length(x) != length(y)) > stop("'x' and 'y' must be CompressedList objects ", > "of the same class and length") > tmp <- c(x, y)[GenomicRanges:::.**makePickupIndex(length(x))] > ans <- GenomicRanges:::.shrinkByHalf(**tmp) > elementMetadata(ans) <- NULL > ans > } > > ## Then the introns() function itself for GappedAlignmentPairs. > introns <- function(x) > { > if (!is(x, "GappedAlignmentPairs")) > stop("'x' must be a GappedAlignmentPairs object") > > ## Extract introns (i.e. N gaps) from the first ends. > Fgal <- first(x) > Fgrl <- grglist(Fgal, order.as.in.query=TRUE) > Fintrons <- psetdiff(granges(Fgal), Fgrl) > > ## Extract introns (i.e. N gaps) from the last ends. > Lgal <- last(x, invert.strand=TRUE) > Lgrl <- grglist(Lgal, order.as.in.query=TRUE) > Lintrons <- psetdiff(granges(Lgal), Lgrl) > > ## Combine the introns from first and last ends into a > ## single GRangesList object. > ans <- pcombine(Fintrons, Lintrons) > names(ans) <- names(x) > ans > } > > Try it: > > my_introns <- introns(galp) > stopifnot(identical(**elementLengths(my_introns), ngap(first(galp)) + > ngap(last(galp)))) > > > It would be nice to have >> another function that would return the inter-read gap. I actually have >> functions that do all these things, but they're not very elegant or >> robust to all of these situations. >> >> >> Such a method would also be nice >> on GappedAlignments. >> > ^^^^^^^^^^^^^^^^ > > >> >> This one is easier ('gal' being a GappedAlignments object): >> > ^^^^^^^^^^^^^^^^ > > >> psetdiff(granges(gal), grglist(gal)) >> >> Also I think you actually sent us a "gaps" method for GRangesList >> recently that we are about to add to the GenomicRanges package. >> Once we have it, the user will be able to do: >> >> gaps(grglist(gal)) >> >> as an equivalent way to do the above. >> >> >> Right, but as I said, we want to ignore the inter-read gap, which takes >> a bit more work. Right now, I do the above and then setdiff off the >> inter-read gap. >> > > The above is for a GappedAlignments object so there is no inter-read > gap. > > HTH, > H. > > >> Thanks, >> >> H. >> >> >> Thanks, >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<http s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On 07/10/2012 09:01 PM, Michael Lawrence wrote: > Looks great. Will it land in GenomicRanges? Also, a function for getting > the inter-read gap would be nice, even if trivial. Should not be hard. It will have to use 0-width ranges in the output when the first/last ends are not in a upstream/downstream configuration though. BTW, is "inter-read gap" the commonly used term for that gap? What would be a good name for this function? I've just added introns() for GappedAlignments and GappedAlignmentPairs to GenomicRanges 1.9.31. For my otlen() proposal (observed template length), I want to do some comparison with what Tom is putting in the TLEN field of the BAM files produced by gsnap. I have some big gsnap-generated BAM files with incredibly complicated cigars (lots of gaps and soft clipping), they'll be perfect for testing. If everything looks good, I'll add otlen() to GenomicRanges (will be renamed tlen()). H. > > Thanks, > Michael > > On Tue, Jul 10, 2012 at 4:22 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Michael, > > > On 07/10/2012 03:45 AM, Michael Lawrence wrote: > > > > On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi Michael, > > > On 06/29/2012 05:06 AM, Michael Lawrence wrote: > > Hi (Herve), > > I have been playing around with GappedAlignmentPairs > and found > it pretty > useful. Some things that would be nice: > > 1) A method for getting the observed fragment length > (would need > to be NA > when pairs are on different chromosomes). This should > at least > have an > option (if not always) to exclude the N cigar runs from the > calculation. Or > it could just take it from the TLEN (isize) column in > the BAM > file (which > at least in the case of gsnap excludes the Ns). > > > From the SAM Spec: > > TLEN: signed observed Template LENgth. If all segments are > mapped to > the same reference, the unsigned observed template length > equals the > number of bases from the leftmost mapped base to the > rightmost mapped > base. [...] It is set as 0 for single-segment template or > when the > information is unavailable. > > 5 notes: > > (1) It seems that most of the time this information is not > available > in the BAM files produced by the aligners (at least > not in those > I've seen so far). > > (2) I wonder if the above definition really makes sense for a > paired-end read with the 2 ends aligned to the *same* > strand. > It's actually not clear to me how to interpret such a > situation > and why the aligner is producing this kind of paired > alignments. > Seems to reveal a template coming from a transcript > that mixes > exons from both strands. > > > > Yes, for non-concordant pairs, like ends that do not map to opposite > strands, or that map to different chromosomes or very far away > on the > same chromosome, this would not be easy to calculate. > Presumably, there > is some sort of genomic rearrangement or trans-splicing, and the > precise > break-point would need to be determined. > > (3) Taking "the number of bases from the leftmost mapped base > to the rightmost mapped base" is ok most of the times > but is > questionable in some situations e.g. when the first > segment > is not mapped upstream of the last segment. I'd rather set > TLEN to NA in that case too, like when the 2 segments > are not > mapped to the same reference, or when they are mapped > to the > same strand. > > > This makes sense to me. > > (4) The above is just a definition so it is what it is (and > is not > inherently correct or incorrect) but it seems to me that > excluding the N gaps would be closer to the intuitive > notion > of "observed template length". > > (5) The above definition ignores soft clipping. > > > Right. The definition is not that helpful, and Tom has already > decided > to violate it with gsnap, which counts the soft-clipped regions and > discounts the N regions when populating the TLEN field. So we're > currently happy just taking TLEN from the BAM file. > > So here is a proposal that ignores the N gaps but takes > into account > soft clipping. 'galp' is GappedAlignmentPairs object. Note > that we > don't need to deal with the situation where the 2 ends are not > mapped to the same reference, or are mapped to the same > strand, because > those pairs are discarded (with a warning) when the > GappedAlignmentPairs > object is made (usually with readGappedAlignmentPairs()): > > otlen <- function(galp) > { > lgal <- left(galp) > rgal <- right(galp) > lend <- end(lgal) > rstart <- start(rgal) > ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L > > ## Set to NA when the first and last ends are not in an > ## upstream/downstream configuration: > not_upstream_downstream <- rstart <= lend > ans[not_upstream_downstream] <- NA_integer_ > > ans > } > > If we want to be a little less conservative about the first and > last ends being in an upstream/downstream configuration, we > can replace the computation of 'not_upstream_downstream' with: > > lstart <- start(lgal) > rend <- end(rgal) > not_upstream_downstream <- rstart < lstart | rend < lend > > > > 2) A method called "introns" or something that returns > the N > regions as a > GRangesList, as a complement to grglist(). > > > The special gap between the 2 ends is a little bit problematic. > How should we deal with it given that (1) it has a different > meaning than the N gaps (i.e. it does not in general > correspond to an > intron, only by chance), and (2) it's not always here (e.g. > when > the first and last ends are not in an upstream/downstream > configuration)? Of course (2) could always be handled by > removing > the problematic alignment pairs. > > > Well, my request was to discard the inter-read gap, so that I > would just > have the splices as called by the aligner. > > > I see. Then this can be done with something like this: > > ## First we need an optimized mendoapply(c, x, y) for CompressedList > ## objects (probably worth adding to IRanges, could also be called > ## elementCombine() and take an arbitrary number of objects). > pcombine <- function(x, y) > { > if (!is(x, "CompressedList") || > class(x) != class(y) || > length(x) != length(y)) > stop("'x' and 'y' must be CompressedList objects ", > "of the same class and length") > tmp <- c(x, y)[GenomicRanges:::.__makePickupIndex(length(x))] > ans <- GenomicRanges:::.shrinkByHalf(__tmp) > elementMetadata(ans) <- NULL > ans > } > > ## Then the introns() function itself for GappedAlignmentPairs. > introns <- function(x) > { > if (!is(x, "GappedAlignmentPairs")) > stop("'x' must be a GappedAlignmentPairs object") > > ## Extract introns (i.e. N gaps) from the first ends. > Fgal <- first(x) > Fgrl <- grglist(Fgal, order.as.in.query=TRUE) > Fintrons <- psetdiff(granges(Fgal), Fgrl) > > ## Extract introns (i.e. N gaps) from the last ends. > Lgal <- last(x, invert.strand=TRUE) > Lgrl <- grglist(Lgal, order.as.in.query=TRUE) > Lintrons <- psetdiff(granges(Lgal), Lgrl) > > ## Combine the introns from first and last ends into a > ## single GRangesList object. > ans <- pcombine(Fintrons, Lintrons) > names(ans) <- names(x) > ans > } > > Try it: > > my_introns <- introns(galp) > stopifnot(identical(__elementLengths(my_introns), > ngap(first(galp)) + ngap(last(galp)))) > > > It would be nice to have > another function that would return the inter-read gap. I > actually have > functions that do all these things, but they're not very elegant or > robust to all of these situations. > > > Such a method would also be nice > on GappedAlignments. > > ^^^^^^^^^^^^^^^^ > > > > This one is easier ('gal' being a GappedAlignments object): > > ^^^^^^^^^^^^^^^^ > > > psetdiff(granges(gal), grglist(gal)) > > Also I think you actually sent us a "gaps" method for > GRangesList > recently that we are about to add to the GenomicRanges package. > Once we have it, the user will be able to do: > > gaps(grglist(gal)) > > as an equivalent way to do the above. > > > Right, but as I said, we want to ignore the inter-read gap, > which takes > a bit more work. Right now, I do the above and then setdiff off the > inter-read gap. > > > The above is for a GappedAlignments object so there is no inter- read > gap. > > HTH, > H. > > > Thanks, > > H. > > > Thanks, > Michael > > [[alternative HTML version deleted]] > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.7 years ago Hervé Pagès 16k

0

Entering edit mode

On Tue, Jul 10, 2012 at 11:57 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > > On 07/10/2012 09:01 PM, Michael Lawrence wrote: > >> Looks great. Will it land in GenomicRanges? Also, a function for getting >> the inter-read gap would be nice, even if trivial. >> > > Should not be hard. It will have to use 0-width ranges in the output > when the first/last ends are not in a upstream/downstream configuration > though. That's OK, I've done the same. > BTW, is "inter-read gap" the commonly used term for that gap? > What would be a good name for this function? > > I am not aware of a better name but perhaps someone else is. > I've just added introns() for GappedAlignments and GappedAlignmentPairs > to GenomicRanges 1.9.31. > > For my otlen() proposal (observed template length), I want to do some > comparison with what Tom is putting in the TLEN field of the BAM files > produced by gsnap. I have some big gsnap-generated BAM files with > incredibly complicated cigars (lots of gaps and soft clipping), they'll > be perfect for testing. If everything looks good, I'll add otlen() > to GenomicRanges (will be renamed tlen()). > > Sounds good. I'm sure you'll find some inconsistencies in those BAMs if you dig deep. I've also got a function that takes a many-to-one matching from reads to transcripts and calculates the tlen, including removing annotated introns that fall into the inter-read gap. It's useful for me. Might be useful in general. We can add that later, perhaps as part of the overlap functionality Val is working on. Thanks a lot for adding this functionality, Michael H. > > >> Thanks, >> Michael >> >> On Tue, Jul 10, 2012 at 4:22 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Michael, >> >> >> On 07/10/2012 03:45 AM, Michael Lawrence wrote: >> >> >> >> On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Hi Michael, >> >> >> On 06/29/2012 05:06 AM, Michael Lawrence wrote: >> >> Hi (Herve), >> >> I have been playing around with GappedAlignmentPairs >> and found >> it pretty >> useful. Some things that would be nice: >> >> 1) A method for getting the observed fragment length >> (would need >> to be NA >> when pairs are on different chromosomes). This should >> at least >> have an >> option (if not always) to exclude the N cigar runs from >> the >> calculation. Or >> it could just take it from the TLEN (isize) column in >> the BAM >> file (which >> at least in the case of gsnap excludes the Ns). >> >> >> From the SAM Spec: >> >> TLEN: signed observed Template LENgth. If all segments are >> mapped to >> the same reference, the unsigned observed template length >> equals the >> number of bases from the leftmost mapped base to the >> rightmost mapped >> base. [...] It is set as 0 for single-segment template or >> when the >> information is unavailable. >> >> 5 notes: >> >> (1) It seems that most of the time this information is not >> available >> in the BAM files produced by the aligners (at least >> not in those >> I've seen so far). >> >> (2) I wonder if the above definition really makes sense for a >> paired-end read with the 2 ends aligned to the *same* >> strand. >> It's actually not clear to me how to interpret such a >> situation >> and why the aligner is producing this kind of paired >> alignments. >> Seems to reveal a template coming from a transcript >> that mixes >> exons from both strands. >> >> >> >> Yes, for non-concordant pairs, like ends that do not map to >> opposite >> strands, or that map to different chromosomes or very far away >> on the >> same chromosome, this would not be easy to calculate. >> Presumably, there >> is some sort of genomic rearrangement or trans-splicing, and the >> precise >> break-point would need to be determined. >> >> (3) Taking "the number of bases from the leftmost mapped base >> to the rightmost mapped base" is ok most of the times >> but is >> questionable in some situations e.g. when the first >> segment >> is not mapped upstream of the last segment. I'd rather >> set >> TLEN to NA in that case too, like when the 2 segments >> are not >> mapped to the same reference, or when they are mapped >> to the >> same strand. >> >> >> This makes sense to me. >> >> (4) The above is just a definition so it is what it is (and >> is not >> inherently correct or incorrect) but it seems to me that >> excluding the N gaps would be closer to the intuitive >> notion >> of "observed template length". >> >> (5) The above definition ignores soft clipping. >> >> >> Right. The definition is not that helpful, and Tom has already >> decided >> to violate it with gsnap, which counts the soft-clipped regions >> and >> discounts the N regions when populating the TLEN field. So we're >> currently happy just taking TLEN from the BAM file. >> >> So here is a proposal that ignores the N gaps but takes >> into account >> soft clipping. 'galp' is GappedAlignmentPairs object. Note >> that we >> don't need to deal with the situation where the 2 ends are >> not >> mapped to the same reference, or are mapped to the same >> strand, because >> those pairs are discarded (with a warning) when the >> GappedAlignmentPairs >> object is made (usually with readGappedAlignmentPairs()): >> >> otlen <- function(galp) >> { >> lgal <- left(galp) >> rgal <- right(galp) >> lend <- end(lgal) >> rstart <- start(rgal) >> ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L >> >> ## Set to NA when the first and last ends are not in an >> ## upstream/downstream configuration: >> not_upstream_downstream <- rstart <= lend >> ans[not_upstream_downstream] <- NA_integer_ >> >> ans >> } >> >> If we want to be a little less conservative about the first >> and >> last ends being in an upstream/downstream configuration, we >> can replace the computation of 'not_upstream_downstream' >> with: >> >> lstart <- start(lgal) >> rend <- end(rgal) >> not_upstream_downstream <- rstart < lstart | rend < lend >> >> >> >> 2) A method called "introns" or something that returns >> the N >> regions as a >> GRangesList, as a complement to grglist(). >> >> >> The special gap between the 2 ends is a little bit >> problematic. >> How should we deal with it given that (1) it has a different >> meaning than the N gaps (i.e. it does not in general >> correspond to an >> intron, only by chance), and (2) it's not always here (e.g. >> when >> the first and last ends are not in an upstream/downstream >> configuration)? Of course (2) could always be handled by >> removing >> the problematic alignment pairs. >> >> >> Well, my request was to discard the inter-read gap, so that I >> would just >> have the splices as called by the aligner. >> >> >> I see. Then this can be done with something like this: >> >> ## First we need an optimized mendoapply(c, x, y) for CompressedList >> ## objects (probably worth adding to IRanges, could also be called >> ## elementCombine() and take an arbitrary number of objects). >> pcombine <- function(x, y) >> { >> if (!is(x, "CompressedList") || >> class(x) != class(y) || >> length(x) != length(y)) >> stop("'x' and 'y' must be CompressedList objects ", >> "of the same class and length") >> tmp <- c(x, y)[GenomicRanges:::.__**makePickupIndex(length(x))] >> ans <- GenomicRanges:::.shrinkByHalf(**__tmp) >> >> elementMetadata(ans) <- NULL >> ans >> } >> >> ## Then the introns() function itself for GappedAlignmentPairs. >> introns <- function(x) >> { >> if (!is(x, "GappedAlignmentPairs")) >> stop("'x' must be a GappedAlignmentPairs object") >> >> ## Extract introns (i.e. N gaps) from the first ends. >> Fgal <- first(x) >> Fgrl <- grglist(Fgal, order.as.in.query=TRUE) >> Fintrons <- psetdiff(granges(Fgal), Fgrl) >> >> ## Extract introns (i.e. N gaps) from the last ends. >> Lgal <- last(x, invert.strand=TRUE) >> Lgrl <- grglist(Lgal, order.as.in.query=TRUE) >> Lintrons <- psetdiff(granges(Lgal), Lgrl) >> >> ## Combine the introns from first and last ends into a >> ## single GRangesList object. >> ans <- pcombine(Fintrons, Lintrons) >> names(ans) <- names(x) >> ans >> } >> >> Try it: >> >> my_introns <- introns(galp) >> stopifnot(identical(__**elementLengths(my_introns), >> >> ngap(first(galp)) + ngap(last(galp)))) >> >> >> It would be nice to have >> another function that would return the inter-read gap. I >> actually have >> functions that do all these things, but they're not very elegant >> or >> robust to all of these situations. >> >> >> Such a method would also be nice >> on GappedAlignments. >> >> ^^^^^^^^^^^^^^^^ >> >> >> >> This one is easier ('gal' being a GappedAlignments object): >> >> ^^^^^^^^^^^^^^^^ >> >> >> psetdiff(granges(gal), grglist(gal)) >> >> Also I think you actually sent us a "gaps" method for >> GRangesList >> recently that we are about to add to the GenomicRanges >> package. >> Once we have it, the user will be able to do: >> >> gaps(grglist(gal)) >> >> as an equivalent way to do the above. >> >> >> Right, but as I said, we want to ignore the inter-read gap, >> which takes >> a bit more work. Right now, I do the above and then setdiff off >> the >> inter-read gap. >> >> >> The above is for a GappedAlignments object so there is no inter-read >> gap. >> >> HTH, >> H. >> >> >> Thanks, >> >> H. >> >> >> Thanks, >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**_____________________ >> >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconductor<ht tps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" biocond="" uctor<https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> Search the archives: >> http://news.gmane.org/gmane.__**__science.biology.informatics.** >> ____conductor<http: news.gmane.org="" gmane.____science.biology.infor="" matics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> >> >> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Michael Lawrence ★ 11k

Login before adding your answer.