Hi,
I have an application where I have mapped reads to two different
reference sequences and I have these two mappings as bam files. I now
want to see for each read whether it mapped to one, both or neither of
the two references (and where the mapping is to both see if one was
better than the other). All my reading of the samtools docs suggested
that there is no facility for indexing by read name as opposed to
genomic coordinates, so the first way that occured to me to do this
was
to sort each bam by read name (which is possible using samtools) and
then step through the two bams in parallel checking the alignment
information for each read. There are ~120M reads, so reading the two
bams into memory all in one go is not practical, but from what I can
see
Rsamtools doesn't have functions for reading bams either line by line
or
by chunks of lines. All the ScanBamParam arguments seem to refer to
reading chunks by genomic ranges - this isn't appropriate for my
application as the genomic ranges between the two references are not
necessarily comparable.
So my question is really two-fold:
1) Am I right in my reading of the Rsamtools docs? Are there any
lower level functions for reading bams I am ignoring that could do
what
I want?
2) My thought now is to just make sams from the sorted bams and
step
through using readLines() and my own sam parsing code. Is there an
alternative strategy I have overlooked?
--
Alex Gutteridge
On 07/07/2012 10:10 AM, Alex Gutteridge wrote:
> Hi,
>
> I have an application where I have mapped reads to two different
> reference sequences and I have these two mappings as bam files. I
now
> want to see for each read whether it mapped to one, both or neither
of
> the two references (and where the mapping is to both see if one was
> better than the other). All my reading of the samtools docs
suggested
> that there is no facility for indexing by read name as opposed to
> genomic coordinates, so the first way that occured to me to do this
was
> to sort each bam by read name (which is possible using samtools) and
> then step through the two bams in parallel checking the alignment
> information for each read. There are ~120M reads, so reading the two
> bams into memory all in one go is not practical, but from what I can
see
> Rsamtools doesn't have functions for reading bams either line by
line or
> by chunks of lines. All the ScanBamParam arguments seem to refer to
see the development version and the yieldSize argument to BamFile.
Martin
> reading chunks by genomic ranges - this isn't appropriate for my
> application as the genomic ranges between the two references are not
> necessarily comparable.
>
> So my question is really two-fold:
>
> 1) Am I right in my reading of the Rsamtools docs? Are there any
lower
> level functions for reading bams I am ignoring that could do what I
want?
> 2) My thought now is to just make sams from the sorted bams and step
> through using readLines() and my own sam parsing code. Is there an
> alternative strategy I have overlooked?
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793