running time: countOverlaps & summarizedOverlaps vs. HTSeq

0

Entering edit mode

Milica Krunic ▴ 40

@milica-krunic-5169

Last seen 10.6 years ago

Hello! I am working with cat RNA Seq data and after mapping I wanted to get the count tables. So, I tried to do it using countOverlaps and summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per one sorted .bam file (~20*10^6 reads), no matter which previously mentioned method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R methods I used GRangesList object downloaded directly in R from Biomart. In HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width is about 44000 in GRangesList. Does anyone know why getting count tables in R takes so long compared to HTSeq? Thank you! Best, Milica [[alternative HTML version deleted]]

• 2.6k views

ADD COMMENT • link updated 13.1 years ago by Wei Shi ★ 3.6k • written 13.1 years ago by Milica Krunic ▴ 40

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 10.6 years ago

Hi Milica, I do the exact same thing in my package (easyRNASeq, still in the devel branch of Bioc) and it definitely does not require 20 hours to read "only" 20 million reads. Are you sure you are not getting your machine to swap? I.e. did you monitor the memory usage? It would be interesting (for me, at least) if you could try my package to get your count table. You can either retrieve the annotation from biomaRt or provide a GFF file. See the vignette of the package for the details and maybe these two posts on that mailing list: https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124. html For addressing if from the countOverlaps / summarizedOverlaps point of view, it would help if you could post your code and sessionInfo(). HTH, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 28 Mar 2012, at 13:04, Milica Krunic wrote: > Hello! > > > > I am working with cat RNA Seq data and after mapping I wanted to get the > count tables. So, I tried to do it using countOverlaps and > summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per > one sorted .bam file (~20*10^6 reads), no matter which previously mentioned > method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R > methods I used GRangesList object downloaded directly in R from Biomart. In > HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width > is about 44000 in GRangesList. > Does anyone know why getting count tables in R takes so long compared to > HTSeq? > > > Thank you! > > Best, > Milica > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hi Nico, Milica, On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: > Hi Milica, > > I do the exact same thing in my package (easyRNASeq, still in the devel branch of Bioc) and it definitely does not require 20 hours to read "only" 20 million reads. Are you sure you are not getting your machine to swap? I.e. did you monitor the memory usage? > > It would be interesting (for me, at least) if you could try my package to get your count table. You can either retrieve the annotation from biomaRt or provide a GFF file. See the vignette of the package for the details and maybe these two posts on that mailing list: > > https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html > https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/04412 4.html Slightly off-topic but probably worth mentioning is that you don't need to use a BSgenome package in order to fetch the chromosome lengths of an organism. For example in the code you show in the 2 above posts, you could use makeTranscriptDbFromUCSC() (or makeTranscriptDbFromBiomart(), both from the GenomicFeatures package) to make a TranscriptDb object, and then to do seqlengths() on that object (BTW it would be nice if this worked with makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages for hg19 or mm9 already installed, that should be faster than installing them. Or, even faster: > library(rtracklayer) > session <- browserSession() > genome(session) <- "mm9" > seqlengths(session) chr1 chr1_random chr2 chr3 chr3_random chr4 197195432 1231697 181748087 159599783 41899 155630120 chr4_random chr5 chr5_random chr6 chr7 chr7_random 160594 152537259 357350 149517037 152524553 362490 chr8 chr8_random chr9 chr9_random chr10 chr11 131738871 849593 124076172 449403 129993255 121843856 chr12 chr13 chr13_random chr14 chr15 chr16 121257530 120284312 400311 125194864 103494974 98319150 chr16_random chr17 chr17_random chr18 chr19 chrX 3994 95272651 628739 90772031 61342430 166650296 chrX_random chrY chrY_random chrUn_random chrM 1785075 15902555 58682461 5900358 16299 However, those alternative require internet access... Cheers, H. > > For addressing if from the countOverlaps / summarizedOverlaps point of view, it would help if you could post your code and sessionInfo(). > > HTH, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 28 Mar 2012, at 13:04, Milica Krunic wrote: > >> Hello! >> >> >> >> I am working with cat RNA Seq data and after mapping I wanted to get the >> count tables. So, I tried to do it using countOverlaps and >> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per >> one sorted .bam file (~20*10^6 reads), no matter which previously mentioned >> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R >> methods I used GRangesList object downloaded directly in R from Biomart. In >> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width >> is about 44000 in GRangesList. >> Does anyone know why getting count tables in R takes so long compared to >> HTSeq? >> >> >> Thank you! >> >> Best, >> Milica >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi, 2012/3/28 Hervé Pagès <hpages at="" fhcrc.org="">: > Hi Nico, Milica, > > > On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: >> >> Hi Milica, >> >> I do the exact same thing in my package (easyRNASeq, still in the devel >> branch of Bioc) and it definitely does not require 20 hours to read "only" >> 20 million reads. Are you sure you are not getting your machine to swap? >> I.e. did you monitor the memory usage? >> >> It would be interesting (for me, at least) if you could try my package to >> get your count table. You can either retrieve the annotation from biomaRt or >> provide a GFF file. See the vignette of the package for the details and >> maybe these two posts on that mailing list: >> >> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html >> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/0441 24.html > > > Slightly off-topic but probably worth mentioning is that you don't > need to use a BSgenome package in order to fetch the chromosome > lengths of an organism. For example in the code you show in the > 2 above posts, you could use makeTranscriptDbFromUCSC() (or > makeTranscriptDbFromBiomart(), both from the GenomicFeatures > package) to make a TranscriptDb object, and then to do seqlengths() > on that object (BTW it would be nice if this worked with > makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages > for hg19 or mm9 already installed, that should be faster than > installing them. Also, the BAM file you are using as the source of your reads has the seqlength info in it, eg: R> library(Rsamtools) R> bf <- BamFile('/path/to/your.bam') R> seqlengths(bf) ## hg19 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 chrM 16571 Perhaps that's a helpful piece of trivia to know, too ... -steve > > Or, even faster: > >> library(rtracklayer) >> session <- browserSession() >> genome(session) <- "mm9" >> seqlengths(session) > ? ? ? ?chr1 ?chr1_random ? ? ? ? chr2 ? ? ? ? chr3 ?chr3_random ? chr4 > ? 197195432 ? ? ?1231697 ? ?181748087 ? ?159599783 ? ? ? ?41899 155630120 > ?chr4_random ? ? ? ? chr5 ?chr5_random ? ? ? ? chr6 ? ? ? ? chr7 chr7_random > ? ? ?160594 ? ?152537259 ? ? ? 357350 ? ?149517037 ? ?152524553 362490 > ? ? ? ?chr8 ?chr8_random ? ? ? ? chr9 ?chr9_random ? ? ? ?chr10 ?chr11 > ? 131738871 ? ? ? 849593 ? ?124076172 ? ? ? 449403 ? ?129993255 121843856 > ? ? ? chr12 ? ? ? ?chr13 chr13_random ? ? ? ?chr14 ? ? ? ?chr15 ?chr16 > ? 121257530 ? ?120284312 ? ? ? 400311 ? ?125194864 ? ?103494974 98319150 > chr16_random ? ? ? ?chr17 chr17_random ? ? ? ?chr18 ? ? ? ?chr19 ?chrX > ? ? ? ?3994 ? ? 95272651 ? ? ? 628739 ? ? 90772031 ? ? 61342430 166650296 > ?chrX_random ? ? ? ? chrY ?chrY_random chrUn_random ? ? ? ? chrM > ? ? 1785075 ? ? 15902555 ? ? 58682461 ? ? ?5900358 ? ? ? ?16299 > > However, those alternative require internet access... > > Cheers, > H. > > >> >> For addressing if from the countOverlaps / summarizedOverlaps point of >> view, it would help if you could post your code and sessionInfo(). >> >> HTH, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 28 Mar 2012, at 13:04, Milica Krunic wrote: >> >>> Hello! >>> >>> >>> >>> I am working with cat RNA Seq data and after mapping I wanted to get the >>> count tables. So, I tried to do it using countOverlaps and >>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, >>> per >>> one sorted .bam file (~20*10^6 reads), no matter which previously >>> mentioned >>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R >>> methods I used GRangesList object downloaded directly in R from Biomart. >>> In >>> HTSeq I used GTF file provided on Ensembl homepage. Average ?cat gene >>> width >>> is about 44000 in GRangesList. >>> Does anyone know why getting count tables in R takes so long compared to >>> HTSeq? >>> >>> >>> Thank you! >>> >>> Best, >>> Milica >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 29 Mar 2012, at 03:20, Steve Lianoglou wrote: > Hi, > > 2012/3/28 Hervé Pagès <hpages at="" fhcrc.org="">: >> Hi Nico, Milica, >> >> >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: >>> >>> Hi Milica, >>> >>> I do the exact same thing in my package (easyRNASeq, still in the devel >>> branch of Bioc) and it definitely does not require 20 hours to read "only" >>> 20 million reads. Are you sure you are not getting your machine to swap? >>> I.e. did you monitor the memory usage? >>> >>> It would be interesting (for me, at least) if you could try my package to >>> get your count table. You can either retrieve the annotation from biomaRt or >>> provide a GFF file. See the vignette of the package for the details and >>> maybe these two posts on that mailing list: >>> >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html >>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044 124.html >> >> >> Slightly off-topic but probably worth mentioning is that you don't >> need to use a BSgenome package in order to fetch the chromosome >> lengths of an organism. For example in the code you show in the >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures >> package) to make a TranscriptDb object, and then to do seqlengths() >> on that object (BTW it would be nice if this worked with >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages >> for hg19 or mm9 already installed, that should be faster than >> installing them. > > Also, the BAM file you are using as the source of your reads has the > seqlength info in it, eg: > > R> library(Rsamtools) > R> bf <- BamFile('/path/to/your.bam') > R> seqlengths(bf) ## hg19 > chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 > chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 > 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 > chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 > chrM > 16571 > > Perhaps that's a helpful piece of trivia to know, too ... > > -steve > > >> >> Or, even faster: >> >>> library(rtracklayer) >>> session <- browserSession() >>> genome(session) <- "mm9" >>> seqlengths(session) >> chr1 chr1_random chr2 chr3 chr3_random chr4 >> 197195432 1231697 181748087 159599783 41899 155630120 >> chr4_random chr5 chr5_random chr6 chr7 chr7_random >> 160594 152537259 357350 149517037 152524553 362490 >> chr8 chr8_random chr9 chr9_random chr10 chr11 >> 131738871 849593 124076172 449403 129993255 121843856 >> chr12 chr13 chr13_random chr14 chr15 chr16 >> 121257530 120284312 400311 125194864 103494974 98319150 >> chr16_random chr17 chr17_random chr18 chr19 chrX >> 3994 95272651 628739 90772031 61342430 166650296 >> chrX_random chrY chrY_random chrUn_random chrM >> 1785075 15902555 58682461 5900358 16299 >> >> However, those alternative require internet access... >> >> Cheers, >> H. >> >> >>> >>> For addressing if from the countOverlaps / summarizedOverlaps point of >>> view, it would help if you could post your code and sessionInfo(). >>> >>> HTH, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote: >>> >>>> Hello! >>>> >>>> >>>> >>>> I am working with cat RNA Seq data and after mapping I wanted to get the >>>> count tables. So, I tried to do it using countOverlaps and >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, >>>> per >>>> one sorted .bam file (~20*10^6 reads), no matter which previously >>>> mentioned >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R >>>> methods I used GRangesList object downloaded directly in R from Biomart. >>>> In >>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene >>>> width >>>> is about 44000 in GRangesList. >>>> Does anyone know why getting count tables in R takes so long compared to >>>> HTSeq? >>>> >>>> >>>> Thank you! >>>> >>>> Best, >>>> Milica >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

In devel rtracklayer, SeqinfoForUCSCGenome("hg19") does the equivalent of Herve's code. There is also a SeqinfoForBSGenome(). It returns NULL if the genome package is not found, in which case you can fall back to UCSC. That's what rtracklayer does internally when you pass a genome argument to import. Michael On Thu, Mar 29, 2012 at 5:39 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > Hi Steve, > > Yes that's true, I've been using that too, but I still encounter BAM files > without an header section. Combining both your input will be a way to avoid > BSgenome :-) > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 29 Mar 2012, at 03:20, Steve Lianoglou wrote: > > > Hi, > > > > 2012/3/28 Hervé Pagès <hpages@fhcrc.org>: > >> Hi Nico, Milica, > >> > >> > >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: > >>> > >>> Hi Milica, > >>> > >>> I do the exact same thing in my package (easyRNASeq, still in the devel > >>> branch of Bioc) and it definitely does not require 20 hours to read > "only" > >>> 20 million reads. Are you sure you are not getting your machine to > swap? > >>> I.e. did you monitor the memory usage? > >>> > >>> It would be interesting (for me, at least) if you could try my package > to > >>> get your count table. You can either retrieve the annotation from > biomaRt or > >>> provide a GFF file. See the vignette of the package for the details and > >>> maybe these two posts on that mailing list: > >>> > >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html > >>> > https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/04412 4.html > >> > >> > >> Slightly off-topic but probably worth mentioning is that you don't > >> need to use a BSgenome package in order to fetch the chromosome > >> lengths of an organism. For example in the code you show in the > >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or > >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures > >> package) to make a TranscriptDb object, and then to do seqlengths() > >> on that object (BTW it would be nice if this worked with > >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages > >> for hg19 or mm9 already installed, that should be faster than > >> installing them. > > > > Also, the BAM file you are using as the source of your reads has the > > seqlength info in it, eg: > > > > R> library(Rsamtools) > > R> bf <- BamFile('/path/to/your.bam') > > R> seqlengths(bf) ## hg19 > > chr1 chr2 chr3 chr4 chr5 chr6 chr7 > chr8 > > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 > 146364022 > > chr9 chr10 chr11 chr12 chr13 chr14 chr15 > chr16 > > 141213431 135534747 135006516 133851895 115169878 107349540 102531392 > 90354753 > > chr17 chr18 chr19 chr20 chr21 chr22 chrX > chrY > > 81195210 78077248 59128983 63025520 48129895 51304566 155270560 > 59373566 > > chrM > > 16571 > > > > Perhaps that's a helpful piece of trivia to know, too ... > > > > -steve > > > > > >> > >> Or, even faster: > >> > >>> library(rtracklayer) > >>> session <- browserSession() > >>> genome(session) <- "mm9" > >>> seqlengths(session) > >> chr1 chr1_random chr2 chr3 chr3_random chr4 > >> 197195432 1231697 181748087 159599783 41899 > 155630120 > >> chr4_random chr5 chr5_random chr6 chr7 > chr7_random > >> 160594 152537259 357350 149517037 152524553 362490 > >> chr8 chr8_random chr9 chr9_random chr10 chr11 > >> 131738871 849593 124076172 449403 129993255 > 121843856 > >> chr12 chr13 chr13_random chr14 chr15 chr16 > >> 121257530 120284312 400311 125194864 103494974 98319150 > >> chr16_random chr17 chr17_random chr18 chr19 chrX > >> 3994 95272651 628739 90772031 61342430 > 166650296 > >> chrX_random chrY chrY_random chrUn_random chrM > >> 1785075 15902555 58682461 5900358 16299 > >> > >> However, those alternative require internet access... > >> > >> Cheers, > >> H. > >> > >> > >>> > >>> For addressing if from the countOverlaps / summarizedOverlaps point of > >>> view, it would help if you could post your code and sessionInfo(). > >>> > >>> HTH, > >>> > >>> Nico > >>> > >>> --------------------------------------------------------------- > >>> Nicolas Delhomme > >>> > >>> Genome Biology Computational Support > >>> > >>> European Molecular Biology Laboratory > >>> > >>> Tel: +49 6221 387 8310 > >>> Email: nicolas.delhomme@embl.de > >>> Meyerhofstrasse 1 - Postfach 10.2209 > >>> 69102 Heidelberg, Germany > >>> --------------------------------------------------------------- > >>> > >>> > >>> > >>> > >>> > >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote: > >>> > >>>> Hello! > >>>> > >>>> > >>>> > >>>> I am working with cat RNA Seq data and after mapping I wanted to get > the > >>>> count tables. So, I tried to do it using countOverlaps and > >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using > R, > >>>> per > >>>> one sorted .bam file (~20*10^6 reads), no matter which previously > >>>> mentioned > >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. > For R > >>>> methods I used GRangesList object downloaded directly in R from > Biomart. > >>>> In > >>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene > >>>> width > >>>> is about 44000 in GRangesList. > >>>> Does anyone know why getting count tables in R takes so long compared > to > >>>> HTSeq? > >>>> > >>>> > >>>> Thank you! > >>>> > >>>> Best, > >>>> Milica > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> > >> -- > >> Hervé Pagès > >> > >> Program in Computational Biology > >> Division of Public Health Sciences > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N, M1-B514 > >> P.O. Box 19024 > >> Seattle, WA 98109-1024 > >> > >> E-mail: hpages@fhcrc.org > >> Phone: (206) 667-5791 > >> Fax: (206) 667-1319 > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael! --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 29 Mar 2012, at 18:28, Michael Lawrence wrote: > In devel rtracklayer, SeqinfoForUCSCGenome("hg19") does the equivalent of Herve's code. There is also a SeqinfoForBSGenome(). It returns NULL if the genome package is not found, in which case you can fall back to UCSC. That's what rtracklayer does internally when you pass a genome argument to import. > > Michael > > On Thu, Mar 29, 2012 at 5:39 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > Hi Steve, > > Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-) > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 29 Mar 2012, at 03:20, Steve Lianoglou wrote: > > > Hi, > > > > 2012/3/28 Hervé Pagès <hpages at="" fhcrc.org="">: > >> Hi Nico, Milica, > >> > >> > >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: > >>> > >>> Hi Milica, > >>> > >>> I do the exact same thing in my package (easyRNASeq, still in the devel > >>> branch of Bioc) and it definitely does not require 20 hours to read "only" > >>> 20 million reads. Are you sure you are not getting your machine to swap? > >>> I.e. did you monitor the memory usage? > >>> > >>> It would be interesting (for me, at least) if you could try my package to > >>> get your count table. You can either retrieve the annotation from biomaRt or > >>> provide a GFF file. See the vignette of the package for the details and > >>> maybe these two posts on that mailing list: > >>> > >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html > >>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/0 44124.html > >> > >> > >> Slightly off-topic but probably worth mentioning is that you don't > >> need to use a BSgenome package in order to fetch the chromosome > >> lengths of an organism. For example in the code you show in the > >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or > >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures > >> package) to make a TranscriptDb object, and then to do seqlengths() > >> on that object (BTW it would be nice if this worked with > >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages > >> for hg19 or mm9 already installed, that should be faster than > >> installing them. > > > > Also, the BAM file you are using as the source of your reads has the > > seqlength info in it, eg: > > > > R> library(Rsamtools) > > R> bf <- BamFile('/path/to/your.bam') > > R> seqlengths(bf) ## hg19 > > chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 > > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 > > chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 > > 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 > > chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > > 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 > > chrM > > 16571 > > > > Perhaps that's a helpful piece of trivia to know, too ... > > > > -steve > > > > > >> > >> Or, even faster: > >> > >>> library(rtracklayer) > >>> session <- browserSession() > >>> genome(session) <- "mm9" > >>> seqlengths(session) > >> chr1 chr1_random chr2 chr3 chr3_random chr4 > >> 197195432 1231697 181748087 159599783 41899 155630120 > >> chr4_random chr5 chr5_random chr6 chr7 chr7_random > >> 160594 152537259 357350 149517037 152524553 362490 > >> chr8 chr8_random chr9 chr9_random chr10 chr11 > >> 131738871 849593 124076172 449403 129993255 121843856 > >> chr12 chr13 chr13_random chr14 chr15 chr16 > >> 121257530 120284312 400311 125194864 103494974 98319150 > >> chr16_random chr17 chr17_random chr18 chr19 chrX > >> 3994 95272651 628739 90772031 61342430 166650296 > >> chrX_random chrY chrY_random chrUn_random chrM > >> 1785075 15902555 58682461 5900358 16299 > >> > >> However, those alternative require internet access... > >> > >> Cheers, > >> H. > >> > >> > >>> > >>> For addressing if from the countOverlaps / summarizedOverlaps point of > >>> view, it would help if you could post your code and sessionInfo(). > >>> > >>> HTH, > >>> > >>> Nico > >>> > >>> --------------------------------------------------------------- > >>> Nicolas Delhomme > >>> > >>> Genome Biology Computational Support > >>> > >>> European Molecular Biology Laboratory > >>> > >>> Tel: +49 6221 387 8310 > >>> Email: nicolas.delhomme at embl.de > >>> Meyerhofstrasse 1 - Postfach 10.2209 > >>> 69102 Heidelberg, Germany > >>> --------------------------------------------------------------- > >>> > >>> > >>> > >>> > >>> > >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote: > >>> > >>>> Hello! > >>>> > >>>> > >>>> > >>>> I am working with cat RNA Seq data and after mapping I wanted to get the > >>>> count tables. So, I tried to do it using countOverlaps and > >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, > >>>> per > >>>> one sorted .bam file (~20*10^6 reads), no matter which previously > >>>> mentioned > >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R > >>>> methods I used GRangesList object downloaded directly in R from Biomart. > >>>> In > >>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene > >>>> width > >>>> is about 44000 in GRangesList. > >>>> Does anyone know why getting count tables in R takes so long compared to > >>>> HTSeq? > >>>> > >>>> > >>>> Thank you! > >>>> > >>>> Best, > >>>> Milica > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> > >> -- > >> Hervé Pagès > >> > >> Program in Computational Biology > >> Division of Public Health Sciences > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N, M1-B514 > >> P.O. Box 19024 > >> Seattle, WA 98109-1024 > >> > >> E-mail: hpages at fhcrc.org > >> Phone: (206) 667-5791 > >> Fax: (206) 667-1319 > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Dear All, Thank you for useful suggestions! Best, Milica On Thu, Mar 29, 2012 at 6:33 PM, Nicolas Delhomme <delhomme@embl.de> wrote: > Thanks Michael! > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 29 Mar 2012, at 18:28, Michael Lawrence wrote: > > > In devel rtracklayer, SeqinfoForUCSCGenome("hg19") does the equivalent > of Herve's code. There is also a SeqinfoForBSGenome(). It returns NULL if > the genome package is not found, in which case you can fall back to UCSC. > That's what rtracklayer does internally when you pass a genome argument to > import. > > > > Michael > > > > On Thu, Mar 29, 2012 at 5:39 AM, Nicolas Delhomme <delhomme@embl.de> > wrote: > > Hi Steve, > > > > Yes that's true, I've been using that too, but I still encounter BAM > files without an header section. Combining both your input will be a way to > avoid BSgenome :-) > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delhomme@embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 29 Mar 2012, at 03:20, Steve Lianoglou wrote: > > > > > Hi, > > > > > > 2012/3/28 Hervé Pagès <hpages@fhcrc.org>: > > >> Hi Nico, Milica, > > >> > > >> > > >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: > > >>> > > >>> Hi Milica, > > >>> > > >>> I do the exact same thing in my package (easyRNASeq, still in the > devel > > >>> branch of Bioc) and it definitely does not require 20 hours to read > "only" > > >>> 20 million reads. Are you sure you are not getting your machine to > swap? > > >>> I.e. did you monitor the memory usage? > > >>> > > >>> It would be interesting (for me, at least) if you could try my > package to > > >>> get your count table. You can either retrieve the annotation from > biomaRt or > > >>> provide a GFF file. See the vignette of the package for the details > and > > >>> maybe these two posts on that mailing list: > > >>> > > >>> > https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html > > >>> > https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/04412 4.html > > >> > > >> > > >> Slightly off-topic but probably worth mentioning is that you don't > > >> need to use a BSgenome package in order to fetch the chromosome > > >> lengths of an organism. For example in the code you show in the > > >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or > > >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures > > >> package) to make a TranscriptDb object, and then to do seqlengths() > > >> on that object (BTW it would be nice if this worked with > > >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages > > >> for hg19 or mm9 already installed, that should be faster than > > >> installing them. > > > > > > Also, the BAM file you are using as the source of your reads has the > > > seqlength info in it, eg: > > > > > > R> library(Rsamtools) > > > R> bf <- BamFile('/path/to/your.bam') > > > R> seqlengths(bf) ## hg19 > > > chr1 chr2 chr3 chr4 chr5 chr6 chr7 > chr8 > > > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 > 146364022 > > > chr9 chr10 chr11 chr12 chr13 chr14 chr15 > chr16 > > > 141213431 135534747 135006516 133851895 115169878 107349540 102531392 > 90354753 > > > chr17 chr18 chr19 chr20 chr21 chr22 chrX > chrY > > > 81195210 78077248 59128983 63025520 48129895 51304566 155270560 > 59373566 > > > chrM > > > 16571 > > > > > > Perhaps that's a helpful piece of trivia to know, too ... > > > > > > -steve > > > > > > > > >> > > >> Or, even faster: > > >> > > >>> library(rtracklayer) > > >>> session <- browserSession() > > >>> genome(session) <- "mm9" > > >>> seqlengths(session) > > >> chr1 chr1_random chr2 chr3 chr3_random chr4 > > >> 197195432 1231697 181748087 159599783 41899 > 155630120 > > >> chr4_random chr5 chr5_random chr6 chr7 > chr7_random > > >> 160594 152537259 357350 149517037 152524553 362490 > > >> chr8 chr8_random chr9 chr9_random chr10 chr11 > > >> 131738871 849593 124076172 449403 129993255 > 121843856 > > >> chr12 chr13 chr13_random chr14 chr15 chr16 > > >> 121257530 120284312 400311 125194864 103494974 > 98319150 > > >> chr16_random chr17 chr17_random chr18 chr19 chrX > > >> 3994 95272651 628739 90772031 61342430 > 166650296 > > >> chrX_random chrY chrY_random chrUn_random chrM > > >> 1785075 15902555 58682461 5900358 16299 > > >> > > >> However, those alternative require internet access... > > >> > > >> Cheers, > > >> H. > > >> > > >> > > >>> > > >>> For addressing if from the countOverlaps / summarizedOverlaps point > of > > >>> view, it would help if you could post your code and sessionInfo(). > > >>> > > >>> HTH, > > >>> > > >>> Nico > > >>> > > >>> --------------------------------------------------------------- > > >>> Nicolas Delhomme > > >>> > > >>> Genome Biology Computational Support > > >>> > > >>> European Molecular Biology Laboratory > > >>> > > >>> Tel: +49 6221 387 8310 > > >>> Email: nicolas.delhomme@embl.de > > >>> Meyerhofstrasse 1 - Postfach 10.2209 > > >>> 69102 Heidelberg, Germany > > >>> --------------------------------------------------------------- > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote: > > >>> > > >>>> Hello! > > >>>> > > >>>> > > >>>> > > >>>> I am working with cat RNA Seq data and after mapping I wanted to > get the > > >>>> count tables. So, I tried to do it using countOverlaps and > > >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that > using R, > > >>>> per > > >>>> one sorted .bam file (~20*10^6 reads), no matter which previously > > >>>> mentioned > > >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. > For R > > >>>> methods I used GRangesList object downloaded directly in R from > Biomart. > > >>>> In > > >>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat > gene > > >>>> width > > >>>> is about 44000 in GRangesList. > > >>>> Does anyone know why getting count tables in R takes so long > compared to > > >>>> HTSeq? > > >>>> > > >>>> > > >>>> Thank you! > > >>>> > > >>>> Best, > > >>>> Milica > > >>>> > > >>>> [[alternative HTML version deleted]] > > >>>> > > >>>> _______________________________________________ > > >>>> Bioconductor mailing list > > >>>> Bioconductor@r-project.org > > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >>>> Search the archives: > > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > >>> > > >>> > > >>> _______________________________________________ > > >>> Bioconductor mailing list > > >>> Bioconductor@r-project.org > > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >>> Search the archives: > > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > >> > > >> > > >> > > >> -- > > >> Hervé Pagès > > >> > > >> Program in Computational Biology > > >> Division of Public Health Sciences > > >> Fred Hutchinson Cancer Research Center > > >> 1100 Fairview Ave. N, M1-B514 > > >> P.O. Box 19024 > > >> Seattle, WA 98109-1024 > > >> > > >> E-mail: hpages@fhcrc.org > > >> Phone: (206) 667-5791 > > >> Fax: (206) 667-1319 > > >> > > >> > > >> _______________________________________________ > > >> Bioconductor mailing list > > >> Bioconductor@r-project.org > > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >> Search the archives: > > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > -- > > > Steve Lianoglou > > > Graduate Student: Computational Systems Biology > > > | Memorial Sloan-Kettering Cancer Center > > > | Weill Medical College of Cornell University > > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Milica Krunic ▴ 40

0

Entering edit mode

Thanks Herv?! Technically, I do not depend on the BSgenome to retrieve the chromosome sizes, it can be provided as a simple list. BSgenome used to be an easy way to fetch it. Anyway, I'll definitely have a look at your suggestion! Thanks again, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 29 Mar 2012, at 02:57, Hervé Pagès wrote: > Hi Nico, Milica, > > On 03/28/2012 04:21 AM, Nicolas Delhomme wrote: >> Hi Milica, >> >> I do the exact same thing in my package (easyRNASeq, still in the devel branch of Bioc) and it definitely does not require 20 hours to read "only" 20 million reads. Are you sure you are not getting your machine to swap? I.e. did you monitor the memory usage? >> >> It would be interesting (for me, at least) if you could try my package to get your count table. You can either retrieve the annotation from biomaRt or provide a GFF file. See the vignette of the package for the details and maybe these two posts on that mailing list: >> >> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html >> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/0441 24.html > > Slightly off-topic but probably worth mentioning is that you don't > need to use a BSgenome package in order to fetch the chromosome > lengths of an organism. For example in the code you show in the > 2 above posts, you could use makeTranscriptDbFromUCSC() (or > makeTranscriptDbFromBiomart(), both from the GenomicFeatures > package) to make a TranscriptDb object, and then to do seqlengths() > on that object (BTW it would be nice if this worked with > makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages > for hg19 or mm9 already installed, that should be faster than > installing them. > > Or, even faster: > > > library(rtracklayer) > > session <- browserSession() > > genome(session) <- "mm9" > > seqlengths(session) > chr1 chr1_random chr2 chr3 chr3_random chr4 > 197195432 1231697 181748087 159599783 41899 155630120 > chr4_random chr5 chr5_random chr6 chr7 chr7_random > 160594 152537259 357350 149517037 152524553 362490 > chr8 chr8_random chr9 chr9_random chr10 chr11 > 131738871 849593 124076172 449403 129993255 121843856 > chr12 chr13 chr13_random chr14 chr15 chr16 > 121257530 120284312 400311 125194864 103494974 98319150 > chr16_random chr17 chr17_random chr18 chr19 chrX > 3994 95272651 628739 90772031 61342430 166650296 > chrX_random chrY chrY_random chrUn_random chrM > 1785075 15902555 58682461 5900358 16299 > > However, those alternative require internet access... > > Cheers, > H. > >> >> For addressing if from the countOverlaps / summarizedOverlaps point of view, it would help if you could post your code and sessionInfo(). >> >> HTH, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 28 Mar 2012, at 13:04, Milica Krunic wrote: >> >>> Hello! >>> >>> >>> >>> I am working with cat RNA Seq data and after mapping I wanted to get the >>> count tables. So, I tried to do it using countOverlaps and >>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per >>> one sorted .bam file (~20*10^6 reads), no matter which previously mentioned >>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R >>> methods I used GRangesList object downloaded directly in R from Biomart. In >>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width >>> is about 44000 in GRangesList. >>> Does anyone know why getting count tables in R takes so long compared to >>> HTSeq? >>> >>> >>> Thank you! >>> >>> Best, >>> Milica >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

ADD REPLY • link 13.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 2 days ago

Australia/Melbourne

Hi Milica, You may try the featureCounts function in Rsubread package, which only uses ~2 minutes to get read counts for mouse genes with a 10 million read dataset. This function calls a C function to do the read counting and the entire operation is carried out in C space rather than in R space. This function is not only fast, but also has a small memory footprint (it does not need to create large R objects at all). Cheers, Wei On Mar 28, 2012, at 10:04 PM, Milica Krunic wrote: > Hello! > > > > I am working with cat RNA Seq data and after mapping I wanted to get the > count tables. So, I tried to do it using countOverlaps and > summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per > one sorted .bam file (~20*10^6 reads), no matter which previously mentioned > method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R > methods I used GRangesList object downloaded directly in R from Biomart. In > HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width > is about 44000 in GRangesList. > Does anyone know why getting count tables in R takes so long compared to > HTSeq? > > > Thank you! > > Best, > Milica > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 13.1 years ago Wei Shi ★ 3.6k

Login before adding your answer.