On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
> Hello,
>
> I would like to manipulate the start and end positions of my reads
before
> calling summarizeOverlaps. One way to do this is to convert my reads
to a
> GRanges and then use flank, narrow, etc. to properly position the
read ends
> where I want them. However, I don't see a method for
summarizeOverlaps that
> accepts a GRanges object or bed file or similar for the reads. Is
there such a
> method, and if not, would it be possible to add it?
>
> The specific application I have in mind is single-end ChIP-Seq
reads, where we
> have a good idea of what the fragment length is and would like to
extend the
> reads to this length. Alternately, it may be preferable to count
only the
> 5-prime ends of the read, and this could be done by narrowing to 1
bp length.
if bam is 'bed file or similar' then... summarizeOverlaps can take an
arbitrary
function as it's 'mode' argument, as documented on ?summarizeOverlaps
## The 'mode' argument can take a custom count function whose
## arguments are the same as those in the current counting modes
## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
## In this example records are filtered by map quality before
counting.
mapq_filter <- function(features, reads, ignore.strand,
inter.feature) {
require(GenomicAlignments) # needed for parallel evaluation
Union(features, reads[mcols(reads)$mapq >= 20],
ignore.strand=ignore.strand,
inter.feature=inter.feature)
}
genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200),
width=100))
param <- ScanBamParam(what="mapq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
se <- summarizeOverlaps(genes, fl, mode=mapq_filter,
param=param)
assays(se)$counts
## The count function can be completely custom (i.e., not use
the
## pre-defined count functions at all). Requirements are that
## the input arguments match the pre-defined modes and the
output
## is a vector of counts the same length as 'features'.
my_count <- function(features, reads, ignore.strand,
inter.feature) {
## perform filtering, or subsetting etc.
require(GenomicAlignments) # needed for parallel evaluation
countOverlaps(features, reads)
}
the 'reads' argument is a GAlignments / GAlignmentsList (for single /
paired
reads) so you could do whatever you'd like to them, so long as your
function
returns a vector of counts equal to length(features)
ryans_count = function(features, reads, ...) {
reads = as(reads, "GRanges") ## equivalently (??)
granges(reads)
width(reads) = 147
countOverlaps(features, reads)
}
I think you're also interested in width of overlaps, so you could
implement that
in ryans_count, too, e.g., via
http://stackoverflow.com/questions/14685802/width-of-the-overlapped-
segment-in-genomicranges-package
or
https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html
and other parts of that thread, include the late continuation
https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html
(probably need to think through carefully what these are doing).
summarizeOverlaps will iterate through a bam file in chunks, so
moderating
memory use (your function will be called several times).
If you have several bam files and a linux / Mac, load the BiocParallel
package
and it'll distribute one bam file to each processor. If memory becomes
a problem
use BamFileList(your_files, yieldSize=500000) or similar to reduce the
size of
each chunk (the default yield size is I think 1,000,000).
A newer approach is to build a tool of your own using the bed / wig
reader
functions in rtracklayer with GenomicRanges::tileGenome to create
large-ish
chunks of file to process, followed by, e.g.,
GenomicFiles::reduceByFile to MAP
from the file to count in each tile, and REDUCE to summarize counts
within a
file; the vignette outlines this approach for several case studies
(though not
for counting reads overlapping ranges -- I'm sure a worked example
using, e.g.,
http://bioconductor.org/packages/release/data/experiment/html/RNAseqDa
ta.HNRNPC.bam.chr14.html
would be appreciated).
Martin
>
> -Ryan Thompson
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793