I'm trying to figure out how to use Bioconductor for the following use case. I want to parse a subset of aligned reads from a BAM file, optionally trim the alignments, and then access the (possibly trimmed) query sequences. The first two operations--reading from the BAM file and trimming the alignments--can be accomplished easily with GAlignments. However, my understanding is that query sequences are discarded when the GAlignments are populated from the BAM file.
My first question is whether you have contemplated adding an optional slot to hold the query sequences within a GAlignments object. I understand that storing these sequences is expensive, but being able to access them directly via GAlignments would be extremely helpful for many studies, including those of genetic variation and allele-specific expression.
If that's not on the horizon, then would you be willing to please suggest a possible workflow to accomplish my goal? The best that I can come up with is to load GAligments including the query names, identify the set of reads of interest, load the query sequences via Rsamtools, and then match the GAlignments to the corresponding query sequence by using the query names. This is cumbersome and probably slow, but seems like it might work.
Thank you for the quick and very helpful response. I apologize for overlooking this feature of readGAlignments; it's exactly what I was looking for.
I noticed that functions like qnarrow() don't affect the sequence when loaded as metadata (for understandable reasons). Is there a recommended programmatic way to modify metadata that corresponds to strings of the same length as the reads (e.g., query sequence, quality scores, etc.), or is it reasonable to just do it by hand by taking substrings and then setting the metadata accordingly?
(To clarify, I am currently using narrow() on the DNAStringSet, which works fine; I'm just wondering if there is a more appropriate method that I should use to manipulate the metadata of a GAlignments object when the metadata is a string of length equal to the width of the read.)
Yes qnarrow() propagates the metadata columns as-is. More generally speaking, vectorized transformations (i.e. transformations that return an object parallel to the input) either propagate the metadata columns as-is to the result, or drop them, or sometimes add new metadata columns. But they generally don't alter the metadata columns that are propagated to the result. So yes, you need to adjust the sequences in the metadata columns of your trimmed GAlignments object by hand. I confirm that qnarrow'ing the alignments translates into narrow'ing the query sequences (SEQ field) or the quality strings (QUAL field). So in your case:
Thank you. This is perfect.
On 10/21/15 11:39 AM, Hervé Pagès [bioc] wrote:
> Activity on a post you are following on support.bioconductor.org
> <https: support.bioconductor.org="">
>
> User Hervé Pagès <https: support.bioconductor.org="" u="" 1542=""/> wrote
> Comment: GAlignments and the query sequence (SEQ) field of BAM files
> <https: support.bioconductor.org="" p="" 73618="" #73649="">:
>
> Yes |qnarrow()| propagates the metadata columns as-is. More generally
> speaking, vectorized transformations (i.e. transformations that return
> an object /parallel/ to the input) either propagate the metadata columns
> as-is to the result, or drop them, or sometimes add new metadata
> columns. But they generally don't alter the metadata columns that are
> propagated to the result. So yes, you need to adjust the sequences in
> the metadata columns of your trimmed GAlignments object by hand. I
> confirm that |qnarrow|'ing the alignments translates into |narrow|'ing
> the query sequences (SEQ field) or the quality strings (QUAL field). So
> in your case:
>
> param <- ScanBamParam(what=c("seq", "qual"))
> gal <- readGAlignments(file, use.names=TRUE, param=param)
> gal2 <- qnarrow(gal, ...)
> mcols(gal2)$seq <- narrow(mcols(gal)$seq, ...)
> mcols(gal2)$qual <- narrow(mcols(gal)$qual, ...)
>
> I'll add an example like that to |?`qnarrow,GAlignments-method`|.
>
> Hope this helps,
>
> H.
>
> ------------------------------------------------------------------------
>
> Post tags: [galignments], bam
>
> You may reply via email or visit
> C: GAlignments and the query sequence (SEQ) field of BAM files
>
Like with scanBam(), you can load any BAM field and/or tag with readGAlignments(). They will end up as metadata columns on the returned GAlignments object. For example to load the query sequences:
Hi Hervé,
Thank you for the quick and very helpful response. I apologize for overlooking this feature of readGAlignments; it's exactly what I was looking for.
I noticed that functions like qnarrow() don't affect the sequence when loaded as metadata (for understandable reasons). Is there a recommended programmatic way to modify metadata that corresponds to strings of the same length as the reads (e.g., query sequence, quality scores, etc.), or is it reasonable to just do it by hand by taking substrings and then setting the metadata accordingly?
Thank you,
Rob
(To clarify, I am currently using narrow() on the DNAStringSet, which works fine; I'm just wondering if there is a more appropriate method that I should use to manipulate the metadata of a GAlignments object when the metadata is a string of length equal to the width of the read.)
Yes
qnarrow()
propagates the metadata columns as-is. More generally speaking, vectorized transformations (i.e. transformations that return an object parallel to the input) either propagate the metadata columns as-is to the result, or drop them, or sometimes add new metadata columns. But they generally don't alter the metadata columns that are propagated to the result. So yes, you need to adjust the sequences in the metadata columns of your trimmed GAlignments object by hand. I confirm thatqnarrow
'ing the alignments translates intonarrow
'ing the query sequences (SEQ field) or the quality strings (QUAL field). So in your case:I'll add an example like that to
?`qnarrow,GAlignments-method`
.Hope this helps,
H.