summarizeOverlaps using GRanges or bed file as reads?
2
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 26 days ago
Icahn School of Medicine at Mount Sinai…
Hello, I would like to manipulate the start and end positions of my reads before calling summarizeOverlaps. One way to do this is to convert my reads to a GRanges and then use flank, narrow, etc. to properly position the read ends where I want them. However, I don't see a method for summarizeOverlaps that accepts a GRanges object or bed file or similar for the reads. Is there such a method, and if not, would it be possible to add it? The specific application I have in mind is single-end ChIP-Seq reads, where we have a good idea of what the fragment length is and would like to extend the reads to this length. Alternately, it may be preferable to count only the 5-prime ends of the read, and this could be done by narrowing to 1 bp length. -Ryan Thompson
convert convert • 2.8k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Ryan, It looks like the "summarizeOverlaps" methods for GAlignments objects can also be used on reads stored in a GRanges object: library(GenomicAlignments) example(summarizeOverlaps) features <- gr reads <- reads Then: > class(features) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > class(reads) [1] "GAlignments" attr(,"package") [1] "GenomicAlignments" > reads <- as(reads, "GRanges") > selectMethod("summarizeOverlaps", c("GRanges", "GAlignments"))(features, reads) class: SummarizedExperiment dim: 11 1 exptData(0): assays(1): counts rownames(11): A B ... H1 H2 rowData metadata column names(0): colnames(1): reads colData names(2): object records Of course, that doesn't mean summarizeOverlaps() shouldn't work out-of-the-box on reads passed in a GRanges object. Note that in its current implementation, the above method works on any object 'reads' for which 'findOverlaps(features, reads)' works. Cheers, H. On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: > Hello, > > I would like to manipulate the start and end positions of my reads > before calling summarizeOverlaps. One way to do this is to convert my > reads to a GRanges and then use flank, narrow, etc. to properly position > the read ends where I want them. However, I don't see a method for > summarizeOverlaps that accepts a GRanges object or bed file or similar > for the reads. Is there such a method, and if not, would it be possible > to add it? > > The specific application I have in mind is single-end ChIP-Seq reads, > where we have a good idea of what the fragment length is and would like > to extend the reads to this length. Alternately, it may be preferable to > count only the 5-prime ends of the read, and this could be done by > narrowing to 1 bp length. > > -Ryan Thompson > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States
On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: > Hello, > > I would like to manipulate the start and end positions of my reads before > calling summarizeOverlaps. One way to do this is to convert my reads to a > GRanges and then use flank, narrow, etc. to properly position the read ends > where I want them. However, I don't see a method for summarizeOverlaps that > accepts a GRanges object or bed file or similar for the reads. Is there such a > method, and if not, would it be possible to add it? > > The specific application I have in mind is single-end ChIP-Seq reads, where we > have a good idea of what the fragment length is and would like to extend the > reads to this length. Alternately, it may be preferable to count only the > 5-prime ends of the read, and this could be done by narrowing to 1 bp length. if bam is 'bed file or similar' then... summarizeOverlaps can take an arbitrary function as it's 'mode' argument, as documented on ?summarizeOverlaps ## The 'mode' argument can take a custom count function whose ## arguments are the same as those in the current counting modes ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). ## In this example records are filtered by map quality before counting. mapq_filter <- function(features, reads, ignore.strand, inter.feature) { require(GenomicAlignments) # needed for parallel evaluation Union(features, reads[mcols(reads)$mapq >= 20], ignore.strand=ignore.strand, inter.feature=inter.feature) } genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) param <- ScanBamParam(what="mapq") fl <- system.file("extdata", "ex1.bam", package="Rsamtools") se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) assays(se)$counts ## The count function can be completely custom (i.e., not use the ## pre-defined count functions at all). Requirements are that ## the input arguments match the pre-defined modes and the output ## is a vector of counts the same length as 'features'. my_count <- function(features, reads, ignore.strand, inter.feature) { ## perform filtering, or subsetting etc. require(GenomicAlignments) # needed for parallel evaluation countOverlaps(features, reads) } the 'reads' argument is a GAlignments / GAlignmentsList (for single / paired reads) so you could do whatever you'd like to them, so long as your function returns a vector of counts equal to length(features) ryans_count = function(features, reads, ...) { reads = as(reads, "GRanges") ## equivalently (??) granges(reads) width(reads) = 147 countOverlaps(features, reads) } I think you're also interested in width of overlaps, so you could implement that in ryans_count, too, e.g., via http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package or https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html and other parts of that thread, include the late continuation https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html (probably need to think through carefully what these are doing). summarizeOverlaps will iterate through a bam file in chunks, so moderating memory use (your function will be called several times). If you have several bam files and a linux / Mac, load the BiocParallel package and it'll distribute one bam file to each processor. If memory becomes a problem use BamFileList(your_files, yieldSize=500000) or similar to reduce the size of each chunk (the default yield size is I think 1,000,000). A newer approach is to build a tool of your own using the bed / wig reader functions in rtracklayer with GenomicRanges::tileGenome to create large-ish chunks of file to process, followed by, e.g., GenomicFiles::reduceByFile to MAP from the file to count in each tile, and REDUCE to summarize counts within a file; the vignette outlines this approach for several case studies (though not for counting reads overlapping ranges -- I'm sure a worked example using, e.g., http://bioconductor.org/packages/release/data/experiment/html/RNAseqDa ta.HNRNPC.bam.chr14.html would be appreciated). Martin > > -Ryan Thompson > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to GenomicAlignments 1.1.1. Valerie On 04/14/2014 11:35 PM, Martin Morgan wrote: > On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >> Hello, >> >> I would like to manipulate the start and end positions of my reads before >> calling summarizeOverlaps. One way to do this is to convert my reads to a >> GRanges and then use flank, narrow, etc. to properly position the read >> ends >> where I want them. However, I don't see a method for summarizeOverlaps >> that >> accepts a GRanges object or bed file or similar for the reads. Is >> there such a >> method, and if not, would it be possible to add it? >> >> The specific application I have in mind is single-end ChIP-Seq reads, >> where we >> have a good idea of what the fragment length is and would like to >> extend the >> reads to this length. Alternately, it may be preferable to count only the >> 5-prime ends of the read, and this could be done by narrowing to 1 bp >> length. > > > if bam is 'bed file or similar' then... summarizeOverlaps can take an > arbitrary function as it's 'mode' argument, as documented on > ?summarizeOverlaps > > ## The 'mode' argument can take a custom count function whose > ## arguments are the same as those in the current counting modes > ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). > ## In this example records are filtered by map quality before > counting. > > mapq_filter <- function(features, reads, ignore.strand, > inter.feature) { > require(GenomicAlignments) # needed for parallel evaluation > Union(features, reads[mcols(reads)$mapq >= 20], > ignore.strand=ignore.strand, > inter.feature=inter.feature) > } > > genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) > param <- ScanBamParam(what="mapq") > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) > assays(se)$counts > > ## The count function can be completely custom (i.e., not use the > ## pre-defined count functions at all). Requirements are that > ## the input arguments match the pre-defined modes and the output > ## is a vector of counts the same length as 'features'. > > my_count <- function(features, reads, ignore.strand, > inter.feature) { > ## perform filtering, or subsetting etc. > require(GenomicAlignments) # needed for parallel evaluation > countOverlaps(features, reads) > } > > the 'reads' argument is a GAlignments / GAlignmentsList (for single / > paired reads) so you could do whatever you'd like to them, so long as > your function returns a vector of counts equal to length(features) > > ryans_count = function(features, reads, ...) { > reads = as(reads, "GRanges") ## equivalently (??) granges(reads) > width(reads) = 147 > countOverlaps(features, reads) > } > > I think you're also interested in width of overlaps, so you could > implement that in ryans_count, too, e.g., via > > > http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package > > > or > > https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html > > and other parts of that thread, include the late continuation > > https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html > > (probably need to think through carefully what these are doing). > > summarizeOverlaps will iterate through a bam file in chunks, so > moderating memory use (your function will be called several times). > > If you have several bam files and a linux / Mac, load the BiocParallel > package and it'll distribute one bam file to each processor. If memory > becomes a problem use BamFileList(your_files, yieldSize=500000) or > similar to reduce the size of each chunk (the default yield size is I > think 1,000,000). > > > A newer approach is to build a tool of your own using the bed / wig > reader functions in rtracklayer with GenomicRanges::tileGenome to create > large-ish chunks of file to process, followed by, e.g., > GenomicFiles::reduceByFile to MAP from the file to count in each tile, > and REDUCE to summarize counts within a file; the vignette outlines this > approach for several case studies (though not for counting reads > overlapping ranges -- I'm sure a worked example using, e.g., > > > http://bioconductor.org/packages/release/data/experiment/html/RNAseq Data.HNRNPC.bam.chr14.html > > > would be appreciated). > > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
Thanks! This is exactly what I wanted. On Tue Apr 15 09:24:27 2014, Valerie Obenchain wrote: > Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to > GenomicAlignments 1.1.1. > > Valerie > > On 04/14/2014 11:35 PM, Martin Morgan wrote: >> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >>> Hello, >>> >>> I would like to manipulate the start and end positions of my reads >>> before >>> calling summarizeOverlaps. One way to do this is to convert my reads >>> to a >>> GRanges and then use flank, narrow, etc. to properly position the read >>> ends >>> where I want them. However, I don't see a method for summarizeOverlaps >>> that >>> accepts a GRanges object or bed file or similar for the reads. Is >>> there such a >>> method, and if not, would it be possible to add it? >>> >>> The specific application I have in mind is single-end ChIP-Seq reads, >>> where we >>> have a good idea of what the fragment length is and would like to >>> extend the >>> reads to this length. Alternately, it may be preferable to count >>> only the >>> 5-prime ends of the read, and this could be done by narrowing to 1 bp >>> length. >> >> >> if bam is 'bed file or similar' then... summarizeOverlaps can take an >> arbitrary function as it's 'mode' argument, as documented on >> ?summarizeOverlaps >> >> ## The 'mode' argument can take a custom count function whose >> ## arguments are the same as those in the current counting modes >> ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). >> ## In this example records are filtered by map quality before >> counting. >> >> mapq_filter <- function(features, reads, ignore.strand, >> inter.feature) { >> require(GenomicAlignments) # needed for parallel evaluation >> Union(features, reads[mcols(reads)$mapq >= 20], >> ignore.strand=ignore.strand, >> inter.feature=inter.feature) >> } >> >> genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) >> param <- ScanBamParam(what="mapq") >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) >> assays(se)$counts >> >> ## The count function can be completely custom (i.e., not use the >> ## pre-defined count functions at all). Requirements are that >> ## the input arguments match the pre-defined modes and the output >> ## is a vector of counts the same length as 'features'. >> >> my_count <- function(features, reads, ignore.strand, >> inter.feature) { >> ## perform filtering, or subsetting etc. >> require(GenomicAlignments) # needed for parallel evaluation >> countOverlaps(features, reads) >> } >> >> the 'reads' argument is a GAlignments / GAlignmentsList (for single / >> paired reads) so you could do whatever you'd like to them, so long as >> your function returns a vector of counts equal to length(features) >> >> ryans_count = function(features, reads, ...) { >> reads = as(reads, "GRanges") ## equivalently (??) >> granges(reads) >> width(reads) = 147 >> countOverlaps(features, reads) >> } >> >> I think you're also interested in width of overlaps, so you could >> implement that in ryans_count, too, e.g., via >> >> >> http://stackoverflow.com/questions/14685802/width-of-the- overlapped-segment-in-genomicranges-package >> >> >> >> or >> >> https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html >> >> and other parts of that thread, include the late continuation >> >> https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html >> >> (probably need to think through carefully what these are doing). >> >> summarizeOverlaps will iterate through a bam file in chunks, so >> moderating memory use (your function will be called several times). >> >> If you have several bam files and a linux / Mac, load the BiocParallel >> package and it'll distribute one bam file to each processor. If memory >> becomes a problem use BamFileList(your_files, yieldSize=500000) or >> similar to reduce the size of each chunk (the default yield size is I >> think 1,000,000). >> >> >> A newer approach is to build a tool of your own using the bed / wig >> reader functions in rtracklayer with GenomicRanges::tileGenome to create >> large-ish chunks of file to process, followed by, e.g., >> GenomicFiles::reduceByFile to MAP from the file to count in each tile, >> and REDUCE to summarize counts within a file; the vignette outlines this >> approach for several case studies (though not for counting reads >> overlapping ranges -- I'm sure a worked example using, e.g., >> >> >> http://bioconductor.org/packages/release/data/experiment/html/RNAse qData.HNRNPC.bam.chr14.html >> >> >> >> would be appreciated). >> >> >> Martin >> >>> >>> -Ryan Thompson >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >
ADD REPLY
0
Entering edit mode
Ok, so using this method, I could write a function that takes the reads, coerces them to GRanges, and then manipulates the start and end points as I like, and then dispatches to the standard counting functions. Then I can call summarizeOverlaps on my bam files directly. That's pretty nice. On 4/14/14, 11:35 PM, Martin Morgan wrote: > On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >> Hello, >> >> I would like to manipulate the start and end positions of my reads >> before >> calling summarizeOverlaps. One way to do this is to convert my reads >> to a >> GRanges and then use flank, narrow, etc. to properly position the >> read ends >> where I want them. However, I don't see a method for >> summarizeOverlaps that >> accepts a GRanges object or bed file or similar for the reads. Is >> there such a >> method, and if not, would it be possible to add it? >> >> The specific application I have in mind is single-end ChIP-Seq reads, >> where we >> have a good idea of what the fragment length is and would like to >> extend the >> reads to this length. Alternately, it may be preferable to count only >> the >> 5-prime ends of the read, and this could be done by narrowing to 1 bp >> length. > > > if bam is 'bed file or similar' then... summarizeOverlaps can take an > arbitrary function as it's 'mode' argument, as documented on > ?summarizeOverlaps > > ## The 'mode' argument can take a custom count function whose > ## arguments are the same as those in the current counting modes > ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). > ## In this example records are filtered by map quality before > counting. > > mapq_filter <- function(features, reads, ignore.strand, > inter.feature) { > require(GenomicAlignments) # needed for parallel evaluation > Union(features, reads[mcols(reads)$mapq >= 20], > ignore.strand=ignore.strand, > inter.feature=inter.feature) > } > > genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) > param <- ScanBamParam(what="mapq") > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) > assays(se)$counts > > ## The count function can be completely custom (i.e., not use the > ## pre-defined count functions at all). Requirements are that > ## the input arguments match the pre-defined modes and the output > ## is a vector of counts the same length as 'features'. > > my_count <- function(features, reads, ignore.strand, > inter.feature) { > ## perform filtering, or subsetting etc. > require(GenomicAlignments) # needed for parallel evaluation > countOverlaps(features, reads) > } > > the 'reads' argument is a GAlignments / GAlignmentsList (for single / > paired reads) so you could do whatever you'd like to them, so long as > your function returns a vector of counts equal to length(features) > > ryans_count = function(features, reads, ...) { > reads = as(reads, "GRanges") ## equivalently (??) > granges(reads) > width(reads) = 147 > countOverlaps(features, reads) > } > > I think you're also interested in width of overlaps, so you could > implement that in ryans_count, too, e.g., via > > > http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package > > > or > > https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html > > and other parts of that thread, include the late continuation > > https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html > > (probably need to think through carefully what these are doing). > > summarizeOverlaps will iterate through a bam file in chunks, so > moderating memory use (your function will be called several times). > > If you have several bam files and a linux / Mac, load the BiocParallel > package and it'll distribute one bam file to each processor. If memory > becomes a problem use BamFileList(your_files, yieldSize=500000) or > similar to reduce the size of each chunk (the default yield size is I > think 1,000,000). > > > A newer approach is to build a tool of your own using the bed / wig > reader functions in rtracklayer with GenomicRanges::tileGenome to > create large-ish chunks of file to process, followed by, e.g., > GenomicFiles::reduceByFile to MAP from the file to count in each tile, > and REDUCE to summarize counts within a file; the vignette outlines > this approach for several case studies (though not for counting reads > overlapping ranges -- I'm sure a worked example using, e.g., > > > http://bioconductor.org/packages/release/data/experiment/html/RNAseq Data.HNRNPC.bam.chr14.html > > > would be appreciated). > > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY

Login before adding your answer.

Traffic: 650 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6