bumphunter::matchGenes and all=TRUE

0

Entering edit mode

Anne Biton ▴ 30

@anne-biton-6062

Last seen 10.6 years ago

Hello, I am using the function matchGenes of the package bumphunter to associate many regions of interest with transcripts. I use all=TRUE in order to get several results per region. When I use all=TRUE and a small amount of queries, the output indeed contains the columns 'queries' and 'Tx' which give the query index and the matching transcript. But when I use all=TRUE and many regions (when the function starts to split the data), the output data.frame does not contain anymore these two columns and it is not possible to know which region is associated with which transcript. I think there is a problem in the function when the annotation is split into chunks of 10000 regions each. Please let me know if it can be solved! Best regards, Anne [[alternative HTML version deleted]]

Annotation bumphunter Annotation bumphunter • 2.2k views

ADD COMMENT • link updated 11.7 years ago by Kasper Daniel Hansen ★ 6.5k • written 11.7 years ago by Anne Biton ▴ 30

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 21 months ago

United States

This sounds weird and it sounds like a bug. We'll have a look. You can (greatly) speed up the process by providing some example object(s) and code to show the error. Best, Kasper On Thu, Jul 25, 2013 at 8:13 PM, Anne Biton <anne.biton@berkeley.edu> wrote: > Hello, > > I am using the function matchGenes of the package bumphunter to associate > many regions of interest with transcripts. > > I use all=TRUE in order to get several results per region. > When I use all=TRUE and a small amount of queries, the output indeed > contains the columns 'queries' and 'Tx' which give the query index and the > matching transcript. > But when I use all=TRUE and many regions (when the function starts to split > the data), the output data.frame does not contain anymore these two columns > and it is not possible to know which region is associated with which > transcript. > I think there is a problem in the function when the annotation is split > into chunks of 10000 regions each. > > Please let me know if it can be solved! > > Best regards, > Anne > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.7 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Anne, Can you run it with 'verbose=TRUE'? Also, tell us what your mc.cores value is (This will also appear in the verbose output.)? If it's >1, please try 'mc.cores=1' to see if that makes any difference. On Jul 26, 2013, at 9:45 AM, Kasper Daniel Hansen wrote: > This sounds weird and it sounds like a bug. We'll have a look. > > You can (greatly) speed up the process by providing some example object(s) > and code to show the error. > > Best, > Kasper > > > On Thu, Jul 25, 2013 at 8:13 PM, Anne Biton <anne.biton at="" berkeley.edu=""> wrote: > >> Hello, >> >> I am using the function matchGenes of the package bumphunter to associate >> many regions of interest with transcripts. >> >> I use all=TRUE in order to get several results per region. >> When I use all=TRUE and a small amount of queries, the output indeed >> contains the columns 'queries' and 'Tx' which give the query index and the >> matching transcript. >> But when I use all=TRUE and many regions (when the function starts to split >> the data), the output data.frame does not contain anymore these two columns >> and it is not possible to know which region is associated with which >> transcript. >> I think there is a problem in the function when the annotation is split >> into chunks of 10000 regions each. >> >> Please let me know if it can be solved! >> >> Best regards, >> Anne >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.7 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Hi Harris, With a small number of regions: > re2genestest <- bumphunter::matchGenes(x=regr[1:2],build='hg19',promoterDist=5000,all= TRUE, mc.cores=1, verbose=TRUE) Matching regions to genes. nearestgene: loading bumphunter hg19 transcript database finding nearest transcripts... AnnotatingDone. > head(re2genestest) queries Tx name annotation 1 uc009vis.3 WASH7P NR_024540 1 uc009vit.3 WASH7P NR_024540 1 uc009viu.3 WASH7P NR_024540 1 uc001aae.4 WASH7P NR_024540 1 uc001aah.4 WASH7P NR_024540 1 uc009vir.3 WASH7P NR_024540 With a high number of regions: > re2genestest <- bumphunter::matchGenes(x=regr[1:10000],build='hg19',promoterDist=5000, all=TRUE, mc.cores=1, verbose=TRUE) Matching regions to genes. nearestgene: loading bumphunter hg19 transcript database finding nearest transcripts... Splitting the annotation into 3 chunks of 10000 regions each Annotating..........Done. Annotating..........Done. Annotating.Done. > head(re2genestest) name annotation description 1 WASH7P NR_024540 <na> 2 WASH7P NR_024540 <na> 3 WASH7P NR_024540 <na> 4 WASH7P NR_024540 <na> 5 WASH7P NR_024540 <na> 6 WASH7P NR_024540 <na> I can bypass the bug by splitting my regions into a list of 3000 regions at each iteration and then using rbind on the output data.frames. Best, Anne On Fri, Jul 26, 2013 at 8:14 AM, Harris A. Jaffee <hj@jhu.edu> wrote: > Anne, > > Can you run it with 'verbose=TRUE'? Also, tell us what your mc.cores > value is (This will also appear in the verbose output.)? If it's >1, > please try 'mc.cores=1' to see if that makes any difference. > > On Jul 26, 2013, at 9:45 AM, Kasper Daniel Hansen wrote: > > > This sounds weird and it sounds like a bug. We'll have a look. > > > > You can (greatly) speed up the process by providing some example > object(s) > > and code to show the error. > > > > Best, > > Kasper > > > > > > On Thu, Jul 25, 2013 at 8:13 PM, Anne Biton <anne.biton@berkeley.edu> > wrote: > > > >> Hello, > >> > >> I am using the function matchGenes of the package bumphunter to > associate > >> many regions of interest with transcripts. > >> > >> I use all=TRUE in order to get several results per region. > >> When I use all=TRUE and a small amount of queries, the output indeed > >> contains the columns 'queries' and 'Tx' which give the query index and > the > >> matching transcript. > >> But when I use all=TRUE and many regions (when the function starts to > split > >> the data), the output data.frame does not contain anymore these two > columns > >> and it is not possible to know which region is associated with which > >> transcript. > >> I think there is a problem in the function when the annotation is split > >> into chunks of 10000 regions each. > >> > >> Please let me know if it can be solved! > >> > >> Best regards, > >> Anne > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Anne Biton ▴ 30

0

Entering edit mode

Thank you again for the very precise bug report and your data (off list). The annotation handlers for the 10000-nearest-transcript chunks were not being sent the user's value of 'all', so it defaulted to FALSE. Having fixed that, I found out that these handlers also needed a 'queries' and 'Tx' argument, when all=TRUE, which they now have. The explanation for all this seeming not very well integrated is that the 'all' argument was an after thought thrown in by me. You may be the first real user, certainly the first with more than 10000 nearest transcripts. I'd be interested in whether it actually helps your analysis. Sorry about the mc.cores business. It was irrelevant to the problem. The fix will appear in bumphunter version 1.1.11, just committed. On Jul 26, 2013, at 11:14 AM, Harris A. Jaffee wrote: > Anne, > > Can you run it with 'verbose=TRUE'? Also, tell us what your mc.cores > value is (This will also appear in the verbose output.)? If it's >1, > please try 'mc.cores=1' to see if that makes any difference. > > On Jul 26, 2013, at 9:45 AM, Kasper Daniel Hansen wrote: > >> This sounds weird and it sounds like a bug. We'll have a look. >> >> You can (greatly) speed up the process by providing some example object(s) >> and code to show the error. >> >> Best, >> Kasper >> >> >> On Thu, Jul 25, 2013 at 8:13 PM, Anne Biton <anne.biton at="" berkeley.edu=""> wrote: >> >>> Hello, >>> >>> I am using the function matchGenes of the package bumphunter to associate >>> many regions of interest with transcripts. >>> >>> I use all=TRUE in order to get several results per region. >>> When I use all=TRUE and a small amount of queries, the output indeed >>> contains the columns 'queries' and 'Tx' which give the query index and the >>> matching transcript. >>> But when I use all=TRUE and many regions (when the function starts to split >>> the data), the output data.frame does not contain anymore these two columns >>> and it is not possible to know which region is associated with which >>> transcript. >>> I think there is a problem in the function when the annotation is split >>> into chunks of 10000 regions each. >>> >>> Please let me know if it can be solved! >>> >>> Best regards, >>> Anne >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.7 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Hi Harris, Many thanks for the fast correction, I will test the new version and let you know. Best, Anne On Sun, Jul 28, 2013 at 6:51 PM, Harris A. Jaffee <hj@jhu.edu> wrote: > Thank you again for the very precise bug report and your data (off list). > > The annotation handlers for the 10000-nearest-transcript chunks were not > being sent the user's value of 'all', so it defaulted to FALSE. Having > fixed that, I found out that these handlers also needed a 'queries' and > 'Tx' argument, when all=TRUE, which they now have. > > The explanation for all this seeming not very well integrated is that the > 'all' argument was an after thought thrown in by me. You may be the first > real user, certainly the first with more than 10000 nearest transcripts. > I'd be interested in whether it actually helps your analysis. > > Sorry about the mc.cores business. It was irrelevant to the problem. > > The fix will appear in bumphunter version 1.1.11, just committed. > > On Jul 26, 2013, at 11:14 AM, Harris A. Jaffee wrote: > > > Anne, > > > > Can you run it with 'verbose=TRUE'? Also, tell us what your mc.cores > > value is (This will also appear in the verbose output.)? If it's >1, > > please try 'mc.cores=1' to see if that makes any difference. > > > > On Jul 26, 2013, at 9:45 AM, Kasper Daniel Hansen wrote: > > > >> This sounds weird and it sounds like a bug. We'll have a look. > >> > >> You can (greatly) speed up the process by providing some example > object(s) > >> and code to show the error. > >> > >> Best, > >> Kasper > >> > >> > >> On Thu, Jul 25, 2013 at 8:13 PM, Anne Biton <anne.biton@berkeley.edu> > wrote: > >> > >>> Hello, > >>> > >>> I am using the function matchGenes of the package bumphunter to > associate > >>> many regions of interest with transcripts. > >>> > >>> I use all=TRUE in order to get several results per region. > >>> When I use all=TRUE and a small amount of queries, the output indeed > >>> contains the columns 'queries' and 'Tx' which give the query index and > the > >>> matching transcript. > >>> But when I use all=TRUE and many regions (when the function starts to > split > >>> the data), the output data.frame does not contain anymore these two > columns > >>> and it is not possible to know which region is associated with which > >>> transcript. > >>> I think there is a problem in the function when the annotation is split > >>> into chunks of 10000 regions each. > >>> > >>> Please let me know if it can be solved! > >>> > >>> Best regards, > >>> Anne > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Anne Biton ▴ 30

Login before adding your answer.