Question

Rsamtools: Read a bam line by line

0

Entering edit mode

Alex Gutteridge ▴ 650

@alex-gutteridge-2935

Last seen 10.2 years ago

United States

Hi, I have an application where I have mapped reads to two different reference sequences and I have these two mappings as bam files. I now want to see for each read whether it mapped to one, both or neither of the two references (and where the mapping is to both see if one was better than the other). All my reading of the samtools docs suggested that there is no facility for indexing by read name as opposed to genomic coordinates, so the first way that occured to me to do this was to sort each bam by read name (which is possible using samtools) and then step through the two bams in parallel checking the alignment information for each read. There are ~120M reads, so reading the two bams into memory all in one go is not practical, but from what I can see Rsamtools doesn't have functions for reading bams either line by line or by chunks of lines. All the ScanBamParam arguments seem to refer to reading chunks by genomic ranges - this isn't appropriate for my application as the genomic ranges between the two references are not necessarily comparable. So my question is really two-fold: 1) Am I right in my reading of the Rsamtools docs? Are there any lower level functions for reading bams I am ignoring that could do what I want? 2) My thought now is to just make sams from the sorted bams and step through using readLines() and my own sam parsing code. Is there an alternative strategy I have overlooked? -- Alex Gutteridge

GO Rsamtools GO Rsamtools • 1.2k views

ADD COMMENT • link updated 12.4 years ago by Martin Morgan 25k • written 12.4 years ago by Alex Gutteridge ▴ 650

score 0 · Answer 1 · 2012-07-07

On 07/07/2012 10:10 AM, Alex Gutteridge wrote: > Hi, > > I have an application where I have mapped reads to two different > reference sequences and I have these two mappings as bam files. I now > want to see for each read whether it mapped to one, both or neither of > the two references (and where the mapping is to both see if one was > better than the other). All my reading of the samtools docs suggested > that there is no facility for indexing by read name as opposed to > genomic coordinates, so the first way that occured to me to do this was > to sort each bam by read name (which is possible using samtools) and > then step through the two bams in parallel checking the alignment > information for each read. There are ~120M reads, so reading the two > bams into memory all in one go is not practical, but from what I can see > Rsamtools doesn't have functions for reading bams either line by line or > by chunks of lines. All the ScanBamParam arguments seem to refer to see the development version and the yieldSize argument to BamFile. Martin > reading chunks by genomic ranges - this isn't appropriate for my > application as the genomic ranges between the two references are not > necessarily comparable. > > So my question is really two-fold: > > 1) Am I right in my reading of the Rsamtools docs? Are there any lower > level functions for reading bams I am ignoring that could do what I want? > 2) My thought now is to just make sams from the sorted bams and step > through using readLines() and my own sam parsing code. Is there an > alternative strategy I have overlooked? > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793