coverage on very large ReadGappedAlignmentsObject
1
0
Entering edit mode
Stefanie ▴ 360
@stefanie-5192
Last seen 10.2 years ago
Dear list, I have some human data, mapped with bowtie to the transcriptome and then further processed with eXpress. The output of eXpress is a sorted bam file, containing the alignments of the reads to different transcripts. So, the bam file contains about 48 x 10^6 alignments, distributed over about 50000 transcripts. I read in the bam file with 'readGappedAlignments'. > reads GappedAlignments with 46765032 alignments and 0 metadata columns: seqnames strand cigar qwidth <rle> <rle> <character> <integer> [1] hg19_ensGene_ENST00000237247 + 75M 75 [2] hg19_ensGene_ENST00000371036 - 75M 75 [3] hg19_ensGene_ENST00000371037 - 75M 75 [4] hg19_ensGene_ENST00000471889 - 75M 75 [5] hg19_ensGene_ENST00000377479 + 75M 75 ... ... ... ... ... [46765028] hg19_ensGene_ENST00000344867 - 75M 75 [46765029] hg19_ensGene_ENST00000400848 - 75M 75 [46765030] hg19_ensGene_ENST00000400848 - 75M 75 [46765031] hg19_ensGene_ENST00000400848 - 75M 75 [46765032] hg19_ensGene_ENST00000400829 - 75M 75 start end width ngap <integer> <integer> <integer> <integer> [1] 1783 1857 75 0 [2] 690 764 75 0 [3] 1632 1706 75 0 [4] 2121 2195 75 0 [5] 391 465 75 0 ... ... ... ... ... [46765028] 280 354 75 0 [46765029] 509 583 75 0 [46765030] 509 583 75 0 [46765031] 509 583 75 0 [46765032] 179 253 75 0 --- seqlengths: hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 2580 ... 999 Now, for whatever reason, I would like to have the coverage vector for each transcript. So, basically I would just compute coverage(reads). This works very well for smaller genomes and experiments (e.g. yeast with 10 x 10^6 alignments). But, using 'coverage' on this large ReadGappedAlignments Object takes really very very long. Is there a way to do this better (in R)? Or is this just not feasible to do in R and I should rather preprocess the data outside R? I would appreciate any comments, best, Stefanie
Coverage genomes Coverage genomes • 1.8k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 05/08/2013 04:40 AM, Stefanie Tauber wrote: > Dear list, > > I have some human data, mapped with bowtie to the transcriptome and then > further processed with eXpress. > The output of eXpress is a sorted bam file, containing the alignments of > the reads to different transcripts. > > So, the bam file contains about 48 x 10^6 alignments, distributed over > about 50000 transcripts. > I read in the bam file with 'readGappedAlignments'. > > >> reads > GappedAlignments with 46765032 alignments and 0 metadata columns: > seqnames strand cigar qwidth > <rle> <rle> <character> <integer> > [1] hg19_ensGene_ENST00000237247 + 75M 75 > [2] hg19_ensGene_ENST00000371036 - 75M 75 > [3] hg19_ensGene_ENST00000371037 - 75M 75 > [4] hg19_ensGene_ENST00000471889 - 75M 75 > [5] hg19_ensGene_ENST00000377479 + 75M 75 > ... ... ... ... ... > [46765028] hg19_ensGene_ENST00000344867 - 75M 75 > [46765029] hg19_ensGene_ENST00000400848 - 75M 75 > [46765030] hg19_ensGene_ENST00000400848 - 75M 75 > [46765031] hg19_ensGene_ENST00000400848 - 75M 75 > [46765032] hg19_ensGene_ENST00000400829 - 75M 75 > start end width ngap > <integer> <integer> <integer> <integer> > [1] 1783 1857 75 0 > [2] 690 764 75 0 > [3] 1632 1706 75 0 > [4] 2121 2195 75 0 > [5] 391 465 75 0 > ... ... ... ... ... > [46765028] 280 354 75 0 > [46765029] 509 583 75 0 > [46765030] 509 583 75 0 > [46765031] 509 583 75 0 > [46765032] 179 253 75 0 > --- > seqlengths: > hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 > 2580 ... 999 > > > > > Now, for whatever reason, I would like to have the coverage vector for > each transcript. > So, basically I would just compute > coverage(reads). > > This works very well for smaller genomes and experiments (e.g. yeast with > 10 x 10^6 alignments). > > But, using 'coverage' on this large ReadGappedAlignments Object takes > really very very long. Hi Stefanie -- One reason might be that you are on the edge of physical memory availability and you are swapping to disk. A solution is to read in chunks, which turns out to be very easy bam = BamFile(, yieldSize = 10000000) open(bam) cvg = NULL while (length(reads <- readGappedAlignments(bam, <...>))) { if (cvg == NULL) cvg = coverage(reads) else cvg = cvg + coverage(reads) } close(bam) This will be relatively efficient; yieldSize can be adjusted, including to a size so that several bam files can be processed in parallel. Start with a small yieldSize to quickly debug the iteration, before processing the whole file. It can be further simplified using coverage,BamFile-method directly, if you're not wanting to do anything else with the gapped alignments. Martin > Is there a way to do this better (in R)? Or is this just not feasible to > do in R and I should rather preprocess the data outside R? > > I would appreciate any comments, > best, > Stefanie > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Another approach to what Martin suggests would be to simply read in the reads and process them (do the coverage calculation per transcript stuff) one chromosome at a time. On Wednesday, May 8, 2013, Martin Morgan wrote: > On 05/08/2013 04:40 AM, Stefanie Tauber wrote: > >> Dear list, >> >> I have some human data, mapped with bowtie to the transcriptome and then >> further processed with eXpress. >> The output of eXpress is a sorted bam file, containing the alignments of >> the reads to different transcripts. >> >> So, the bam file contains about 48 x 10^6 alignments, distributed over >> about 50000 transcripts. >> I read in the bam file with 'readGappedAlignments'. >> >> >> reads >>> >> GappedAlignments with 46765032 alignments and 0 metadata columns: >> seqnames strand cigar qwidth >> <rle> <rle> <character> <integer> >> [1] hg19_ensGene_ENST00000237247 + 75M 75 >> [2] hg19_ensGene_ENST00000371036 - 75M 75 >> [3] hg19_ensGene_ENST00000371037 - 75M 75 >> [4] hg19_ensGene_ENST00000471889 - 75M 75 >> [5] hg19_ensGene_ENST00000377479 + 75M 75 >> ... ... ... ... ... >> [46765028] hg19_ensGene_ENST00000344867 - 75M 75 >> [46765029] hg19_ensGene_ENST00000400848 - 75M 75 >> [46765030] hg19_ensGene_ENST00000400848 - 75M 75 >> [46765031] hg19_ensGene_ENST00000400848 - 75M 75 >> [46765032] hg19_ensGene_ENST00000400829 - 75M 75 >> start end width ngap >> <integer> <integer> <integer> <integer> >> [1] 1783 1857 75 0 >> [2] 690 764 75 0 >> [3] 1632 1706 75 0 >> [4] 2121 2195 75 0 >> [5] 391 465 75 0 >> ... ... ... ... ... >> [46765028] 280 354 75 0 >> [46765029] 509 583 75 0 >> [46765030] 509 583 75 0 >> [46765031] 509 583 75 0 >> [46765032] 179 253 75 0 >> --- >> seqlengths: >> hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 >> 2580 ... 999 >> >> >> >> >> Now, for whatever reason, I would like to have the coverage vector for >> each transcript. >> So, basically I would just compute >> coverage(reads). >> >> This works very well for smaller genomes and experiments (e.g. yeast with >> 10 x 10^6 alignments). >> >> But, using 'coverage' on this large ReadGappedAlignments Object takes >> really very very long. >> > > Hi Stefanie -- > > One reason might be that you are on the edge of physical memory > availability and you are swapping to disk. A solution is to read in chunks, > which turns out to be very easy > > bam = BamFile(, yieldSize = 10000000) > open(bam) > cvg = NULL > while (length(reads <- readGappedAlignments(bam, <...>))) { > if (cvg == NULL) > cvg = coverage(reads) > else cvg = cvg + coverage(reads) > } > close(bam) > > This will be relatively efficient; yieldSize can be adjusted, including to > a size so that several bam files can be processed in parallel. Start with a > small yieldSize to quickly debug the iteration, before processing the whole > file. It can be further simplified using coverage,BamFile-method directly, > if you're not wanting to do anything else with the gapped alignments. > > Martin > > > Is there a way to do this better (in R)? Or is this just not feasible to >> do in R and I should rather preprocess the data outside R? >> >> I would appreciate any comments, >> best, >> Stefanie >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Dear Martin and Steve, thanks for the fast response! Streaming the input is working fine, meaning that ReadGappedAlignments(bam) is really fast. But then, when computing the coverage on the (streamed) bamfile, it take again a really long time. I have about 100 000 seqlevels in my original bam file, is this the reason why coverage is so slow? Best, Stefanie Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve@gene.com>: > Another approach to what Martin suggests would be to simply read in the reads and process them (do the coverage calculation per transcript stuff) one chromosome at a time. > > On Wednesday, May 8, 2013, Martin Morgan wrote: > On 05/08/2013 04:40 AM, Stefanie Tauber wrote: > Dear list, > > I have some human data, mapped with bowtie to the transcriptome and then > further processed with eXpress. > The output of eXpress is a sorted bam file, containing the alignments of > the reads to different transcripts. > > So, the bam file contains about 48 x 10^6 alignments, distributed over > about 50000 transcripts. > I read in the bam file with 'readGappedAlignments'. > > > reads > GappedAlignments with 46765032 alignments and 0 metadata columns: > seqnames strand cigar qwidth > <rle> <rle> <character> <integer> > [1] hg19_ensGene_ENST00000237247 + 75M 75 > [2] hg19_ensGene_ENST00000371036 - 75M 75 > [3] hg19_ensGene_ENST00000371037 - 75M 75 > [4] hg19_ensGene_ENST00000471889 - 75M 75 > [5] hg19_ensGene_ENST00000377479 + 75M 75 > ... ... ... ... ... > [46765028] hg19_ensGene_ENST00000344867 - 75M 75 > [46765029] hg19_ensGene_ENST00000400848 - 75M 75 > [46765030] hg19_ensGene_ENST00000400848 - 75M 75 > [46765031] hg19_ensGene_ENST00000400848 - 75M 75 > [46765032] hg19_ensGene_ENST00000400829 - 75M 75 > start end width ngap > <integer> <integer> <integer> <integer> > [1] 1783 1857 75 0 > [2] 690 764 75 0 > [3] 1632 1706 75 0 > [4] 2121 2195 75 0 > [5] 391 465 75 0 > ... ... ... ... ... > [46765028] 280 354 75 0 > [46765029] 509 583 75 0 > [46765030] 509 583 75 0 > [46765031] 509 583 75 0 > [46765032] 179 253 75 0 > --- > seqlengths: > hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 > 2580 ... 999 > > > > > Now, for whatever reason, I would like to have the coverage vector for > each transcript. > So, basically I would just compute > coverage(reads). > > This works very well for smaller genomes and experiments (e.g. yeast with > 10 x 10^6 alignments). > > But, using 'coverage' on this large ReadGappedAlignments Object takes > really very very long. > > Hi Stefanie -- > > One reason might be that you are on the edge of physical memory availability and you are swapping to disk. A solution is to read in chunks, which turns out to be very easy > > bam = BamFile(, yieldSize = 10000000) > open(bam) > cvg = NULL > while (length(reads <- readGappedAlignments(bam, <...>))) { > if (cvg == NULL) > cvg = coverage(reads) > else cvg = cvg + coverage(reads) > } > close(bam) > > This will be relatively efficient; yieldSize can be adjusted, including to a size so that several bam files can be processed in parallel. Start with a small yieldSize to quickly debug the iteration, before processing the whole file. It can be further simplified using coverage,BamFile-method directly, if you're not wanting to do anything else with the gapped alignments. > > Martin > > > Is there a way to do this better (in R)? Or is this just not feasible to > do in R and I should rather preprocess the data outside R? > > I would appreciate any comments, > best, > Stefanie > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > DI Stefanie Tauber Center for Integrative Bioinformatics Vienna (CIBIV) (CIBIV is a joint institute of Vienna University and Medical University) Max F. Perutz Laboratories (MFPL) Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 Dr. Bohr Gasse 9 A-1030 Wien, Austria Phone: ++43 +1 / 42772-4030 Fax: ++43 +1 / 42772-4098 email: stefanie.tauber@univie.ac.at www.cibiv.at [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
To be more specific, it takes about 10 minutes to calculate the coverage of 1 x 10^6 reads on my pc (linux, 4 cpus, 8GB RAM) This seems a bit long to me, what do you think? Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve@gene.com>: > Another approach to what Martin suggests would be to simply read in the reads and process them (do the coverage calculation per transcript stuff) one chromosome at a time. > > On Wednesday, May 8, 2013, Martin Morgan wrote: > On 05/08/2013 04:40 AM, Stefanie Tauber wrote: > Dear list, > > I have some human data, mapped with bowtie to the transcriptome and then > further processed with eXpress. > The output of eXpress is a sorted bam file, containing the alignments of > the reads to different transcripts. > > So, the bam file contains about 48 x 10^6 alignments, distributed over > about 50000 transcripts. > I read in the bam file with 'readGappedAlignments'. > > > reads > GappedAlignments with 46765032 alignments and 0 metadata columns: > seqnames strand cigar qwidth > <rle> <rle> <character> <integer> > [1] hg19_ensGene_ENST00000237247 + 75M 75 > [2] hg19_ensGene_ENST00000371036 - 75M 75 > [3] hg19_ensGene_ENST00000371037 - 75M 75 > [4] hg19_ensGene_ENST00000471889 - 75M 75 > [5] hg19_ensGene_ENST00000377479 + 75M 75 > ... ... ... ... ... > [46765028] hg19_ensGene_ENST00000344867 - 75M 75 > [46765029] hg19_ensGene_ENST00000400848 - 75M 75 > [46765030] hg19_ensGene_ENST00000400848 - 75M 75 > [46765031] hg19_ensGene_ENST00000400848 - 75M 75 > [46765032] hg19_ensGene_ENST00000400829 - 75M 75 > start end width ngap > <integer> <integer> <integer> <integer> > [1] 1783 1857 75 0 > [2] 690 764 75 0 > [3] 1632 1706 75 0 > [4] 2121 2195 75 0 > [5] 391 465 75 0 > ... ... ... ... ... > [46765028] 280 354 75 0 > [46765029] 509 583 75 0 > [46765030] 509 583 75 0 > [46765031] 509 583 75 0 > [46765032] 179 253 75 0 > --- > seqlengths: > hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 > 2580 ... 999 > > > > > Now, for whatever reason, I would like to have the coverage vector for > each transcript. > So, basically I would just compute > coverage(reads). > > This works very well for smaller genomes and experiments (e.g. yeast with > 10 x 10^6 alignments). > > But, using 'coverage' on this large ReadGappedAlignments Object takes > really very very long. > > Hi Stefanie -- > > One reason might be that you are on the edge of physical memory availability and you are swapping to disk. A solution is to read in chunks, which turns out to be very easy > > bam = BamFile(, yieldSize = 10000000) > open(bam) > cvg = NULL > while (length(reads <- readGappedAlignments(bam, <...>))) { > if (cvg == NULL) > cvg = coverage(reads) > else cvg = cvg + coverage(reads) > } > close(bam) > > This will be relatively efficient; yieldSize can be adjusted, including to a size so that several bam files can be processed in parallel. Start with a small yieldSize to quickly debug the iteration, before processing the whole file. It can be further simplified using coverage,BamFile-method directly, if you're not wanting to do anything else with the gapped alignments. > > Martin > > > Is there a way to do this better (in R)? Or is this just not feasible to > do in R and I should rather preprocess the data outside R? > > I would appreciate any comments, > best, > Stefanie > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > DI Stefanie Tauber Center for Integrative Bioinformatics Vienna (CIBIV) (CIBIV is a joint institute of Vienna University and Medical University) Max F. Perutz Laboratories (MFPL) Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 Dr. Bohr Gasse 9 A-1030 Wien, Austria Phone: ++43 +1 / 42772-4030 Fax: ++43 +1 / 42772-4098 email: stefanie.tauber@univie.ac.at www.cibiv.at [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I have observed big slowdowns with GRanges when the number of seqlevels is large; not sure if this is related. My observation was mostly about findOverlaps. On Wed, May 8, 2013 at 11:27 AM, Stefanie Tauber < stefanie.tauber@univie.ac.at> wrote: > To be more specific, > it takes about 10 minutes to calculate the coverage of 1 x 10^6 reads on > my pc (linux, 4 cpus, 8GB RAM) > > > This seems a bit long to me, what do you think? > > > > Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve@gene.com>: > > > Another approach to what Martin suggests would be to simply read in the > reads and process them (do the coverage calculation per transcript stuff) > one chromosome at a time. > > > > On Wednesday, May 8, 2013, Martin Morgan wrote: > > On 05/08/2013 04:40 AM, Stefanie Tauber wrote: > > Dear list, > > > > I have some human data, mapped with bowtie to the transcriptome and then > > further processed with eXpress. > > The output of eXpress is a sorted bam file, containing the alignments of > > the reads to different transcripts. > > > > So, the bam file contains about 48 x 10^6 alignments, distributed over > > about 50000 transcripts. > > I read in the bam file with 'readGappedAlignments'. > > > > > > reads > > GappedAlignments with 46765032 alignments and 0 metadata columns: > > seqnames strand cigar qwidth > > <rle> <rle> <character> <integer> > > [1] hg19_ensGene_ENST00000237247 + 75M 75 > > [2] hg19_ensGene_ENST00000371036 - 75M 75 > > [3] hg19_ensGene_ENST00000371037 - 75M 75 > > [4] hg19_ensGene_ENST00000471889 - 75M 75 > > [5] hg19_ensGene_ENST00000377479 + 75M 75 > > ... ... ... ... ... > > [46765028] hg19_ensGene_ENST00000344867 - 75M 75 > > [46765029] hg19_ensGene_ENST00000400848 - 75M 75 > > [46765030] hg19_ensGene_ENST00000400848 - 75M 75 > > [46765031] hg19_ensGene_ENST00000400848 - 75M 75 > > [46765032] hg19_ensGene_ENST00000400829 - 75M 75 > > start end width ngap > > <integer> <integer> <integer> <integer> > > [1] 1783 1857 75 0 > > [2] 690 764 75 0 > > [3] 1632 1706 75 0 > > [4] 2121 2195 75 0 > > [5] 391 465 75 0 > > ... ... ... ... ... > > [46765028] 280 354 75 0 > > [46765029] 509 583 75 0 > > [46765030] 509 583 75 0 > > [46765031] 509 583 75 0 > > [46765032] 179 253 75 0 > > --- > > seqlengths: > > hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 > > 2580 ... 999 > > > > > > > > > > Now, for whatever reason, I would like to have the coverage vector for > > each transcript. > > So, basically I would just compute > > coverage(reads). > > > > This works very well for smaller genomes and experiments (e.g. yeast with > > 10 x 10^6 alignments). > > > > But, using 'coverage' on this large ReadGappedAlignments Object takes > > really very very long. > > > > Hi Stefanie -- > > > > One reason might be that you are on the edge of physical memory > availability and you are swapping to disk. A solution is to read in chunks, > which turns out to be very easy > > > > bam = BamFile(, yieldSize = 10000000) > > open(bam) > > cvg = NULL > > while (length(reads <- readGappedAlignments(bam, <...>))) { > > if (cvg == NULL) > > cvg = coverage(reads) > > else cvg = cvg + coverage(reads) > > } > > close(bam) > > > > This will be relatively efficient; yieldSize can be adjusted, including > to a size so that several bam files can be processed in parallel. Start > with a small yieldSize to quickly debug the iteration, before processing > the whole file. It can be further simplified using coverage,BamFile- method > directly, if you're not wanting to do anything else with the gapped > alignments. > > > > Martin > > > > > > Is there a way to do this better (in R)? Or is this just not feasible to > > do in R and I should rather preprocess the data outside R? > > > > I would appreciate any comments, > > best, > > Stefanie > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > Computational Biology / Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. > > PO Box 19024 Seattle, WA 98109 > > > > Location: Arnold Building M1 B861 > > Phone: (206) 667-2793 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Steve Lianoglou > > Computational Biologist > > Department of Bioinformatics and Computational Biology > > Genentech > > > > DI Stefanie Tauber > > Center for Integrative Bioinformatics Vienna (CIBIV) > (CIBIV is a joint institute of Vienna University and Medical University) > Max F. Perutz Laboratories (MFPL) > Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 > Dr. Bohr Gasse 9 > A-1030 Wien, Austria > Phone: ++43 +1 / 42772-4030 > Fax: ++43 +1 / 42772-4098 > email: stefanie.tauber@univie.ac.at > www.cibiv.at > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, On 05/08/2013 08:50 AM, Kasper Daniel Hansen wrote: > I have observed big slowdowns with GRanges when the number of seqlevels is > large; not sure if this is related. My observation was mostly about > findOverlaps. Right. It's the same issue here: the "coverage" method for GenomicRanges objects (which is what is called behind the scene when calling coverage() on a GappedAlignments object) is looping over the seqlevels (with an lapply). The overhead of that loop is big: ## Only 1 seqlevel: gr <- GRanges("chr1", IRanges(35:5, width=10)) seqlengths(gr) <- 100 > system.time(cvg1 <- coverage(gr)) user system elapsed 0.016 0.000 0.013 ## Add 50000 artificial seqlevels: fakechroms <- Seqinfo(paste0("fakechr", 1:50000), seqlengths=rep(200, 50000)) seqinfo(gr) <- merge(seqinfo(gr), fakechroms) > system.time(cvg2 <- coverage(gr)) user system elapsed 134.744 0.376 135.417 Moving the loop at the C level would probably help. I will look into that and report back here when it's ready. Cheers, H. > > > On Wed, May 8, 2013 at 11:27 AM, Stefanie Tauber < > stefanie.tauber at univie.ac.at> wrote: > >> To be more specific, >> it takes about 10 minutes to calculate the coverage of 1 x 10^6 reads on >> my pc (linux, 4 cpus, 8GB RAM) >> >> >> This seems a bit long to me, what do you think? >> >> >> >> Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve at="" gene.com="">: >> >>> Another approach to what Martin suggests would be to simply read in the >> reads and process them (do the coverage calculation per transcript stuff) >> one chromosome at a time. >>> >>> On Wednesday, May 8, 2013, Martin Morgan wrote: >>> On 05/08/2013 04:40 AM, Stefanie Tauber wrote: >>> Dear list, >>> >>> I have some human data, mapped with bowtie to the transcriptome and then >>> further processed with eXpress. >>> The output of eXpress is a sorted bam file, containing the alignments of >>> the reads to different transcripts. >>> >>> So, the bam file contains about 48 x 10^6 alignments, distributed over >>> about 50000 transcripts. >>> I read in the bam file with 'readGappedAlignments'. >>> >>> >>> reads >>> GappedAlignments with 46765032 alignments and 0 metadata columns: >>> seqnames strand cigar qwidth >>> <rle> <rle> <character> <integer> >>> [1] hg19_ensGene_ENST00000237247 + 75M 75 >>> [2] hg19_ensGene_ENST00000371036 - 75M 75 >>> [3] hg19_ensGene_ENST00000371037 - 75M 75 >>> [4] hg19_ensGene_ENST00000471889 - 75M 75 >>> [5] hg19_ensGene_ENST00000377479 + 75M 75 >>> ... ... ... ... ... >>> [46765028] hg19_ensGene_ENST00000344867 - 75M 75 >>> [46765029] hg19_ensGene_ENST00000400848 - 75M 75 >>> [46765030] hg19_ensGene_ENST00000400848 - 75M 75 >>> [46765031] hg19_ensGene_ENST00000400848 - 75M 75 >>> [46765032] hg19_ensGene_ENST00000400829 - 75M 75 >>> start end width ngap >>> <integer> <integer> <integer> <integer> >>> [1] 1783 1857 75 0 >>> [2] 690 764 75 0 >>> [3] 1632 1706 75 0 >>> [4] 2121 2195 75 0 >>> [5] 391 465 75 0 >>> ... ... ... ... ... >>> [46765028] 280 354 75 0 >>> [46765029] 509 583 75 0 >>> [46765030] 509 583 75 0 >>> [46765031] 509 583 75 0 >>> [46765032] 179 253 75 0 >>> --- >>> seqlengths: >>> hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829 >>> 2580 ... 999 >>> >>> >>> >>> >>> Now, for whatever reason, I would like to have the coverage vector for >>> each transcript. >>> So, basically I would just compute >>> coverage(reads). >>> >>> This works very well for smaller genomes and experiments (e.g. yeast with >>> 10 x 10^6 alignments). >>> >>> But, using 'coverage' on this large ReadGappedAlignments Object takes >>> really very very long. >>> >>> Hi Stefanie -- >>> >>> One reason might be that you are on the edge of physical memory >> availability and you are swapping to disk. A solution is to read in chunks, >> which turns out to be very easy >>> >>> bam = BamFile(, yieldSize = 10000000) >>> open(bam) >>> cvg = NULL >>> while (length(reads <- readGappedAlignments(bam, <...>))) { >>> if (cvg == NULL) >>> cvg = coverage(reads) >>> else cvg = cvg + coverage(reads) >>> } >>> close(bam) >>> >>> This will be relatively efficient; yieldSize can be adjusted, including >> to a size so that several bam files can be processed in parallel. Start >> with a small yieldSize to quickly debug the iteration, before processing >> the whole file. It can be further simplified using coverage ,BamFile-method >> directly, if you're not wanting to do anything else with the gapped >> alignments. >>> >>> Martin >>> >>> >>> Is there a way to do this better (in R)? Or is this just not feasible to >>> do in R and I should rather preprocess the data outside R? >>> >>> I would appreciate any comments, >>> best, >>> Stefanie >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Department of Bioinformatics and Computational Biology >>> Genentech >>> >> >> DI Stefanie Tauber >> >> Center for Integrative Bioinformatics Vienna (CIBIV) >> (CIBIV is a joint institute of Vienna University and Medical University) >> Max F. Perutz Laboratories (MFPL) >> Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 >> Dr. Bohr Gasse 9 >> A-1030 Wien, Austria >> Phone: ++43 +1 / 42772-4030 >> Fax: ++43 +1 / 42772-4098 >> email: stefanie.tauber at univie.ac.at >> www.cibiv.at >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6