How do I parse HTML table using RCurl?

0

Entering edit mode

Ruppert Valentino ▴ 270

@ruppert-valentino-1376

Last seen 10.3 years ago

Hello, I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only. For example I am search for target genes for the miRNA mmu-miR-1 as follows: http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?sp ecies=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 This generates a table The script is: URL <- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetsca n.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1" dat <- readLines(URL) But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes. In the example above the first gene COL4A3 starts at HTML code: COL4A3 Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable? Many thanks,

miRNA miRNA • 1.9k views

ADD COMMENT • link updated 13.8 years ago by Vivek Jayaswal ▴ 10 • written 13.8 years ago by Ruppert Valentino ▴ 270

0

Entering edit mode

Dan Tenenbaum ★ 8.2k

@dan-tenenbaum-4256

Last seen 6 months ago

United States

On Mon, Mar 14, 2011 at 1:18 PM, Ruppert Valentino <ruppert7 at="" hotmail.com=""> wrote: > > > Hello, > > I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only. > > For example I am search for target genes for the miRNA mmu-miR-1 as follows: > > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi? species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 > > This generates a table > > > > The script is: > > URL <- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targets can.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1" > dat <- readLines(URL) > > > But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes. > > > In the example above the first gene COL4A3 starts at HTML code: > > COL4A3 > > > > Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable? > > > Many thanks, > > Hi, In general, screen scraping is not the best solution--if the page design changes, your code will break. (If you just need to do this once, you could just copy and paste the table into Excel.) When faced with this type of situation, you might try and see if the web site in question has a programmatic interface, or web service. Looking at it briefly, it doesn't appear that they do, however, they do make all of their data available in CSV format along with some Perl scripts to do basic analysis: http://www.targetscan.org/cgi- bin/targetscan/data_download.cgi?db=vert_50 This may get you closer to what you want to do. Consider downloading the data in CSV format and using R (or the Perl scripts in combination with R) to recreate the table you got with your original query...from there it's a simple matter (in R) to subset the column(s) you're interested in. If that doesn't work out, Sean's suggestion to use the XML package is a good one. Dan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.8 years ago Dan Tenenbaum ★ 8.2k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

Hi, Ruppert. You might want to look at the XML package for doing such things. Sean On Mon, Mar 14, 2011 at 4:18 PM, Ruppert Valentino <ruppert7@hotmail.com>wrote: > > > Hello, > > I am trying to write a script that will enter miRNA and get the predicted > target genes for that miRNA. I am trying to use various software to do this, > one of them is TargetScan. The problem is that I don't know how to parse the > HTML output table so that I can get the target genes only. > > For example I am search for target genes for the miRNA mmu-miR-1 as > follows: > > > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi? species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 > > This generates a table > > > > The script is: > > URL <- " > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi? species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 > " > dat <- readLines(URL) > > > But I don't know how to parse the table to separate it into columns then I > can take the column entitled "Human ortholog of target gene" which would > have the target genes. > > > In the example above the first gene COL4A3 starts at HTML code: > > target=new>COL4A3 > > > > Is there any way to format such a table into columns then transpose the > column entitled "Human ortholog of target gene" and pass that to a variable? > > > Many thanks, > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.8 years ago Sean Davis 21k

0

Entering edit mode

> You might want to look at the XML package for doing such things. Specifically, see the following StackOverflow questions for examples on how to do this: http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r -data-frames-using-the-xml-package http://stackoverflow.com/questions/2998655/how-to-isolate-a-single- element-from-a-scraped-web-page-in-r/

ADD REPLY • link 13.8 years ago Geoff Jentry ▴ 50

0

Entering edit mode

James F. Reid ▴ 610

@james-f-reid-3148

Last seen 10.3 years ago

Hi Ruppert, the targetscan database for Human and Mouse is already available in bioconductor as an AnnotationDbi annotation resource (targetscan.Hs.eg.db and targetscan.Mm.eg.db), so is mirbase but without any target predictions. As others have pointed out on the mailing list I would not recommend parsing the html of a query as the format is likely to change in time, but rather download the database and re-format. If you are interested in providing other miRNA target prediction resources to the community, I would be willing to help. Best, J. On 03/14/2011 09:18 PM, Ruppert Valentino wrote: > > > Hello, > > I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only. > > For example I am search for target genes for the miRNA mmu-miR-1 as follows: > > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi? species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 > > This generates a table > > > > The script is: > > URL<- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetsc an.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1" > dat<- readLines(URL) > > > But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes. > > > In the example above the first gene COL4A3 starts at HTML code: > > COL4A3 > > > > Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable? > > > Many thanks, > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.8 years ago James F. Reid ▴ 610

0

Entering edit mode

Hi James, Many thanks for telling me that target scan is accessible via AnnotationDbi as this will help me to solve the problem in a different way as the others suggested. Can you tell me if bioconductor has resource to access miRanda http://www.microrna.org/microrna/ and pictar http://pictar.mdc- berlin.de/cgi-bin/PicTar_vertebrate.cgi If so, which library can I use? Many thanks Ruppert ---------------------------------------- > Date: Mon, 14 Mar 2011 23:15:45 +0100 > From: james.reid at ifom-ieo-campus.it > To: ruppert7 at hotmail.com > CC: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] How do I parse HTML table using RCurl? > > Hi Ruppert, > > the targetscan database for Human and Mouse is already available in > bioconductor as an AnnotationDbi annotation resource > (targetscan.Hs.eg.db and targetscan.Mm.eg.db), so is mirbase but without > any target predictions. As others have pointed out on the mailing list I > would not recommend parsing the html of a query as the format is likely > to change in time, but rather download the database and re-format. > If you are interested in providing other miRNA target prediction > resources to the community, I would be willing to help. > > Best, > J. > > > On 03/14/2011 09:18 PM, Ruppert Valentino wrote: > > > > > > Hello, > > > > I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only. > > > > For example I am search for target genes for the miRNA mmu-miR-1 as follows: > > > > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cg i?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 > > > > This generates a table > > > > > > > > The script is: > > > > URL<- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/target scan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1" > > dat<- readLines(URL) > > > > > > But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes. > > > > > > In the example above the first gene COL4A3 starts at HTML code: > > > > COL4A3 > > > > > > > > Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable? > > > > > > Many thanks, > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 13.8 years ago Ruppert Valentino ▴ 270

0

Entering edit mode

Hi Ruppert, On 03/14/2011 11:35 PM, Ruppert Valentino wrote: > > Hi James, > > Many thanks for telling me that target scan is accessible via AnnotationDbi as this will help me to solve the problem in a different way as the others suggested. > > Can you tell me if bioconductor has resource to access miRanda http://www.microrna.org/microrna/ and pictar http://pictar.mdc- berlin.de/cgi-bin/PicTar_vertebrate.cgi > > If so, which library can I use? No, I'm afraid these two resources are not available within bioconductor. The miranda-based predictions at microrna.org are available for download as tab delim txt. This is not the case for the pictar resource AFAIK, notice that this resource has not been updated since March 2007. Best, J. > > > Many thanks > > Ruppert > > > > ---------------------------------------- >> Date: Mon, 14 Mar 2011 23:15:45 +0100 >> From: james.reid at ifom-ieo-campus.it >> To: ruppert7 at hotmail.com >> CC: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] How do I parse HTML table using RCurl? >> >> Hi Ruppert, >> >> the targetscan database for Human and Mouse is already available in >> bioconductor as an AnnotationDbi annotation resource >> (targetscan.Hs.eg.db and targetscan.Mm.eg.db), so is mirbase but without >> any target predictions. As others have pointed out on the mailing list I >> would not recommend parsing the html of a query as the format is likely >> to change in time, but rather download the database and re-format. >> If you are interested in providing other miRNA target prediction >> resources to the community, I would be willing to help. >> >> Best, >> J. >> >> >> On 03/14/2011 09:18 PM, Ruppert Valentino wrote: >>> >>> >>> Hello, >>> >>> I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only. >>> >>> For example I am search for target genes for the miRNA mmu-miR-1 as follows: >>> >>> http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cg i?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1 >>> >>> This generates a table >>> >>> >>> >>> The script is: >>> >>> URL<- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/target scan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1" >>> dat<- readLines(URL) >>> >>> >>> But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes. >>> >>> >>> In the example above the first gene COL4A3 starts at HTML code: >>> >>> COL4A3 >>> >>> >>> >>> Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable? >>> >>> >>> Many thanks, >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.8 years ago James F. Reid ▴ 610

0

Entering edit mode

Vivek Jayaswal ▴ 10

@vivek-jayaswal-4546

Last seen 10.3 years ago

Hi Rupert, I've recently developed an R package that among other things can generate miRNA-target mRNA lists using the source files downloaded from miRGen website. Currently, the CADMIM package is available on www.maths.usyd.edu.au/u/vivek Regards, Vivek Message: 22 Date: Tue, 15 Mar 2011 09:48:54 +0100 From: "James F. Reid" <james.reid@ifom-ieo-campus.it> To: Ruppert Valentino <ruppert7@hotmail.com> Cc: BioC <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] How do I parse HTML table using RCurl? Message-ID: <4D7F27F6.70505@ifom-ieo-campus.it> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Hi Ruppert, On 03/14/2011 11:35 PM, Ruppert Valentino wrote: > > Hi James, > > Many thanks for telling me that target scan is accessible via AnnotationDbi as this will help me to solve the problem in a different way as the others suggested. > > Can you tell me if bioconductor has resource to access miRanda http://www.microrna.org/microrna/ and pictar http://pictar.mdc- berlin.de/cgi-bin/PicTar_vertebrate.cgi > > If so, which library can I use? No, I'm afraid these two resources are not available within bioconductor. The miranda-based predictions at microrna.org are available for download as tab delim txt. This is not the case for the pictar resource AFAIK, notice that this resource has not been updated since March 2007. Best, J. > > > Many thanks > > Ruppert > > > > ---------------------------------------- _____________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.8 years ago Vivek Jayaswal ▴ 10

Login before adding your answer.