stackStringsFromBam equivalent for GAlignments with seq
3
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…

hi there,

It's been a while since I posted - hi all! I miss you bioc people here at the Hutch.

I am wondering whether a function exists that works like stackStringsFromBam, but operates on a GAlignments object (that contains a "seq" column) rather than a bam file. I'm asking because I want to do this series of operations (a) read in a bam file as GAlignments (has thousands of reads), e.g. aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq")) (b) do some filtering on those alignments (now have fewer thousands of reads) aln_filt <- aln[myFilters] (c) get an alignment of reads overlapping a region of interest (~10 reads in each region)

If I want to skip the filtering step, I can get the alignment like this:
myAln <- stackStringsFromBam(bamfile, param=myRegionAsGRanges, use.names=TRUE) but that operates on a bam rather than GAlignments object.

It'd be really handy for me there was an analagous function like stackStringsFromGAlignments - would that be easy to implement? (or does it already exist?)

I think I can see a path towards getting what I need using the sequenceLayer function but that path seems more complicated than it should be. I can see other potential paths too that will extract sequence chunks given a reference sequence, but they lose the read names, which I want to use in later processing steps.

Am I missing something? It seems like there's probably a function out there that'll do what I want but I can't figure out what it is.

thanks very much!

Janet

genomicalignments stackStringsFromBam sequenceLayer • 1.7k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Janet,

Long time no see. Welcome back!

I added stackStringsFromGAlignments() to GenomicAlignments 1.21.4 (devel). See commit here.

GenomicAlignments 1.21.4 should propagate to BioC 3.10 (devel) in the next 24-48h.

Best,

H.

ADD COMMENT
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…

Wow - so quick! thanks, Herve - really appreciate it.

It's been a while since I've used devel (things are so stable these days), so I'll have to remind myself how to run that, but I'll try this out ASAP.

Janet

ADD COMMENT
0
Entering edit mode

(deleted comment - meant to submit it as a reply)

ADD REPLY
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…

Tried stackStringsFromGAlignments - looks great! Thanks again, Herve.

On a quick test it works really well. I did a sanity check with test alignment/region, and I noticed something that I'm sure I can ignore, but wondered if you might care about it for consistency.

I noticed on a test region that the objects returned by stackStringsFromBam and stackStringsFromGAlignments are not fully identical. The parts I care about (seq and names) are identical, which is great, but there's an mcols element that's not. I don't really understand what that mcols object is, but I don't think I need to - seems like an internal thing for DNAStringSets that won't affect me (?).

Anyway, a code example will be the easiest to explain. First, I get an alignment either from bam file or from GAlignments object:

bamfile <- BamFile("myfile.sorted.bam")
aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq"))
myAln <- stackStringsFromBam(bamfile, param=temp[35], use.names=TRUE)
myAln2 <- stackStringsFromGAlignments(aln, region=temp[35])

I'm happy that sequence and names are identical:

   identical(seq(myAln),seq(myAln2)) # TRUE
   identical(names(myAln),names(myAln2)) # TRUE

but the overall DNAStringSet object is not

identical(myAln,myAln2)   # FALSE

and that's because of the mcols:

mcols(myAln)
DataFrame with 12 rows and 0 columns

mcols(myAln2)
DataFrame with 12 rows and 1 column
                                        seq
                             <DNAStringSet>
10/8454466//0_2661  AAAGCTTTTC...TCGATGAGCG

etc

other than mcols, the two alignments are identical:

mcols(myAln) <- mcols(myAln2)
identical(myAln,myAln2) # TRUE

thanks again!

Janet

ADD COMMENT
0
Entering edit mode

Right. This is expected. For myAln you din't explicitly request the seq metadata column. If you explicitly request it, e.g. with:

param <- ScanBamParam(which=temp[35], what="seq")
myAln <- stackStringsFromBam(bamfile, param=param, use.names=TRUE)

then myAln and myAln2 should have the same metadata columns.

An additional note: please keep in mind that identical() is not reliable on objects that use external pointers internally, like XStringSet objects. It can produce some embarrassing false positives:

x <- DNAStringSet(c("AA", "CCC"))
y <- DNAStringSet(c("GG", "TTT"))
identical(x, y)
# [1] TRUE

A more reliable way to check that 2 XStringSet objects contain the same sequences is to do identical(as.character(x), as.character(y)). However note that this does not compare the class or the metadata columns of the 2 objects.

H.

ADD REPLY
0
Entering edit mode

Also I'm not so sure that your test identical(seq(myAln),seq(myAln2)) is doing what you have in mind. seq() is a base function that only works by chance on XStringSet objects but doesn't do anything interesting on them. It seems to be doing something like seq_along() on them.

ADD REPLY
0
Entering edit mode

And I realize now that the show method for XStringSet objects is probably misleading here, by suggesting the existence of a seq() getter for these objects. Hmm... not good. I wonder how many users got misled by this. I need to do something about this.

ADD REPLY
0
Entering edit mode

thanks again. that makes sense (the mcols thing).

I'd better steer clear of identical! that's unpredictable and a little scary.

and yes, I'd made a wrong assumption about 'seq' - not sure where that came into my head from that it might be a function to get the seqs. It could have been the 'show' output, or perhaps some other object type where there's a seq getter (?? I can't think what though). Would 'seqs' be a better column header in the show method, to reduce the chance of misleading users?

ADD REPLY
0
Entering edit mode

Not sure what's the best way to deal with this. I just opened an issue on GitHub about this so we keep track of the problem.

H.

ADD REPLY

Login before adding your answer.

Traffic: 857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6