Question

Perform a pileup on the last nucleotides of each read with variable length reads

1

Entering edit mode

Sam Buckberry • 0

@sam-buckberry-8071

Last seen 9.4 years ago

Australia

I'm performing an analysis of MRE-Seq (methylation-sensitive restriction enzyme digest + sequencing) data and would like to perform a Pileup count of the first 3 nucleotides (left or 5') of each mapped fastq read. The fastq reads all start with 'CGG', as this is the site where the enzyme(s) cut. By performing a pileup of the first 3 nucleotides, we can get a measure of methylation at this site.

For the reads mapped to the forward strand (+), this is relatively simple as described here C: Comparing read sequence back to reference sequence.

p_param <- PileupParam(cycle_bins=seq(0, 3))

sbp <- ScanBamParam(flag=scanBamFlag(isMinusStrand=FALSE))

res <- pileup(file=fl, pileupParam=p_param, scanBamParam=sbp)

But for reads mapping to the reverse (-) strand, the 'start' of alignment in the bam file is the 3' (right) end of the fastq record (and the reads are of variable length. Therefore I need to perform a pileup of the last 3 nucleotides of each read aligning to the reverse strand, such as:

p_param <- PileupParam(cycle_bins=seq(Inf-3, Inf))

sbp <- ScanBamParam(flag=scanBamFlag(isMinusStrand=TRUE))
res <- pileup(file=fl, pileupParam=p_param, scanBamParam=sbp)

I'm aware that the Inf-3 will not work, but is there a way to pileup at end of the alignment for reads mapping to the reverse strand?

pileup rsamtools bam genomicalignments • 2.8k views

ADD COMMENT • link updated 9.7 years ago by Nathaniel Hayden ▴ 180 • written 9.8 years ago by Sam Buckberry • 0

score 2 · Answer 1 · 2015-06-12

Update 2: I have since updated pileup to allow specifying bins in a strand-sensitive way (`query_bins`) to respect the actual order of cycles as they came off the sequencer (because reads aligned to the minus strand are "reversed") and allow specifying bins in a strand-insensitive way (`left_bins`) to preserve the original behavior if desired. So the behavior requested in the original question is available via the `query_bins` parameter. See updated documentation for details.

Update: since I posted this answer I have added support for specifying bins from the end of reads. See A: Perform a pileup on the last nucleotides of each read with variable length reads

Hi, Sam. I have two approximations to float immediately while I look into feasibility of specifying cycle bins in reverse. I'll get back to you on the latter. Please let me know if the approximations work for you.

Note also the distinction between aligned width and the width generated by the sequencer. If most of your reads' CIGARs are M ops then no worries.

Option 1: filter into separate BAM files based on read length, perform pileup on each (and specify bins as if fixed width reads as over at C: Comparing read sequence back to reference sequence ), then combine the results.This makes sense if you're interested in the value-added parts of pileup; if you're just interested in, say, raw nucleotide counts and quality scores, see other option. If the set of read widths is small and the number of BAM files you're working with is small, this might not drive you crazy; you can also merge BAM files first to start from a smaller n.

Look at ?filterBam. You can pass arbitrary functions to filter at the granularity of each record of a BAM file, so you could make a separate BAM file for each read width and operate on each separately. Definitely check out the examples section of the filterBam man page so you can see it in action. And you'll see the example filters on the seq width!

Option 2: If you're fine with raw nucleotide counts, etc. Use GenomicAlignments::readGAlignments and specify a what argument of, e.g., "seq" in the ScanBamParam you provide for the param arg in readGAlignments. This will add fields of the reads to the metadata columns of the returned GAlignments object. See ?scanBam for the fields that are available. GAlignments is what will be returned from readGAlignments, and the seq field will be a DNAStringSet. Example:

library(GenomicAlignments)
sbp <- ScanBamParam(what="seq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
gal <- readGAlignments(fl, param=sbp)

Let me know if you need any clarification, and please share what works for you if you don't mind.

Nathaniel Hayden · Answer 2 · 2015-06-15

0

Entering edit mode

Sam Buckberry • 0

@sam-buckberry-8071

Last seen 9.4 years ago

Australia

Hi Nathaniel,

Thanks for your response. Option 1 is an interesting take on the problem, but I have 100's of bam files to process, so it may not be practical. Plus I only really need the counts at this stage. However, I have quickly adapted Option 2 and it appears to be sufficient so far.

library(GenomicAlignments)
sbp <- ScanBamParam(what="seq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
gal <- readGAlignments(fl, param=sbp)

# Get the read start site with respect to strand
readStart <- ifelse(test=strand(gal) == "-", yes=end(gal), no=start(gal))
locus <- paste(rname(gal), readStart, sep=":")

# count the number of reads starting at each locus
counts <- data.frame(table(locus))

ADD COMMENT • link updated 9.7 years ago by Nathaniel Hayden ▴ 180 • written 9.7 years ago by Sam Buckberry • 0

1

Entering edit mode

Hi Sam,

In case you're after the coverage of the first 3 nucleotides of your FASTQ reads (i.e. counting nucleotides from the left with respect to the FASTQ file), you could use a combination of qnarrow() and coverage():

## qnarrow() can be used to trim the reads before alignment:
first3_gal <- qnarrow(gal, start=ifelse(strand(gal) == "+", 1, -3),
                           end  =ifelse(strand(gal) == "+", 3, -1))

Note that the difference between qnarrow() and narrow() is that the latter trims the reads after alignment. This could make a difference if, say, some alignments have indels within the first 3 nucleotides (or last 3 if on the minus strand). If your alignments are simple (i.e. your CIGARs contain only M's), then qnarrow() and narrow() will produce exactly the same result.

Then:

coverage(first3_gal)

The result is a named RleList object with 1 coverage vector per chromosome.

H.

ADD REPLY • link 9.7 years ago Hervé Pagès 16k

score 0 · Answer 3 · 2015-06-20

Since my original answer (wherein I suggested two workarounds: A: Perform a pileup on the last nucleotides of each read with variable length reads ) I have added support for reverse bins to pileup. The convention is to use -1 to indicate last nucleotide, -2 second-to-last, etc. I updated the man page and added a couple examples. To target your case of grabbing the last three nucleotides for reads that align to the minus strand:

pp <- PileupParam(cycle_bins=c(-4, -1))
sbp <- ScanBamParam(flag=scanBamFlag(isMinusStrand=FALSE))
res <- pileup(file=fl, pileupParam=pp, scanBamParam=sbp)

Note the -4 because of how cut derives bin boundaries: (-4,-1].

This should be available in devel version of Rsamtools once the nightly builds run. I appreciate feedback.