Genbank to Unigene IDs

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

I have a list of GenBank IDs for which I'd like the corresponding Unigene cluster IDs. What is the easiest way to do this using Bioconductor functions? (I've scanned annotate and AnnBuilder help and vignettes, although way too quickly.) For the sake of being specific, here's a concrete example. What's Unigene for GB="NM_004551"? Thanks a lot Gordon

annotate AnnBuilder annotate AnnBuilder • 4.5k views

ADD COMMENT • link updated 20.7 years ago by Kane, David ▴ 10 • written 20.8 years ago by Gordon Smyth 52k

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.4 years ago

Gordon Smyth <smyth@wehi.edu.au> writes: > I have a list of GenBank IDs for which I'd like the corresponding > Unigene cluster IDs. What is the easiest way to do this using > Bioconductor functions? (I've scanned annotate and AnnBuilder help and > vignettes, although way too quickly.) > > For the sake of being specific, here's a concrete example. What's > Unigene for GB="NM_004551"? Here's what I'd do (more of a chip-style analysis than instant WWW-based gratification, which might also be possible): 1. First create a tab-separated 2 column file, first row dummy probe IDs (could be real or not), second row GB ID's. So, you'd have 1 row in a file called "Dummy.tsv" 1 NM_004551 2. Have a script similar to: library(AnnBuilder) myBaseType <- "gb" # myDir maps the directory where you want the data package built --- # obviously this should be changed for the directory structure on the # linux box myDir <- "C:/DavidsData/Annotation_Folders" # myBase maps the file that contains the mapping of Agilent feature # numbers to GenBank ID's myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" #use AnnBuilder internal lists of data sources mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") #invoke ABPkgBuilder ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir, organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer = "dpritch@u.washington.edu"), fromWeb = TRUE) 3. install the package environment 4. use it to find the IDs (can verify the ID mapping with the XML output file, as well) best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 20.8 years ago A.J. Rossini ▴ 810

0

Entering edit mode

I tried running this but got an error: > library(AnnBuilder) > myBaseType <- "gb" > myDir <- "C:/Temp" > myBase <- "C:/Temp/tempFile.txt" > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = + "human", version = "1.0", + makeXML = TRUE, author = list(author = "dpritch", maintainer = + "dpritch@u.washington.edu"), fromWeb = TRUE) [1] "It may take me a while to process the data. Be patient!" Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse LocusLink data because of: Error in file(file, "r") : unable to open connection I had changed this directory from "Read Only" and checked that I had write permissions from within R: > setwd("C:/R/rw1090beta/library/AnnBuilder/temp") > dir() [1] "file24842Tgo.xml" "README" > write("Hello") > dir() [1] "data" "file24842Tgo.xml" "README" I get the same error if I run example("ABPkgBuilder") Any suggestions? Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. Rossini Sent: Thursday, April 15, 2004 8:48 AM To: Gordon Smyth Cc: BioC Mailing List Subject: Re: [BioC] Genbank to Unigene IDs Gordon Smyth <smyth@wehi.edu.au> writes: > I have a list of GenBank IDs for which I'd like the corresponding > Unigene cluster IDs. What is the easiest way to do this using > Bioconductor functions? (I've scanned annotate and AnnBuilder help and > vignettes, although way too quickly.) > > For the sake of being specific, here's a concrete example. What's > Unigene for GB="NM_004551"? Here's what I'd do (more of a chip-style analysis than instant WWW-based gratification, which might also be possible): 1. First create a tab-separated 2 column file, first row dummy probe IDs (could be real or not), second row GB ID's. So, you'd have 1 row in a file called "Dummy.tsv" 1 NM_004551 2. Have a script similar to: library(AnnBuilder) myBaseType <- "gb" # myDir maps the directory where you want the data package built --- # obviously this should be changed for the directory structure on the # linux box myDir <- "C:/DavidsData/Annotation_Folders" # myBase maps the file that contains the mapping of Agilent feature # numbers to GenBank ID's myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" #use AnnBuilder internal lists of data sources mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") #invoke ABPkgBuilder ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir, organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer = "dpritch@u.washington.edu"), fromWeb = TRUE) 3. install the package environment 4. use it to find the IDs (can verify the ID mapping with the XML output file, as well) best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}} _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.8 years ago Dave Waddell ▴ 160

0

Entering edit mode

Dave Waddell ▴ 160

@dave-waddell-323

Last seen 10.4 years ago

I'm struggling with the same problem and using the command line version of Matchminer right now. http://discover.nci.nih.gov/matchminer/html/command.jsp For example, I have a list of Genbank Accession numbers (Matchminer will take a whole slew of inputs and produce almost any output) as follows: Suppose tempFile contains: AA936757 AA683077 R60193 AA495846 AA488391 AA487582 AA115076 N92478 R43483 W65461 R22625 N64741 H99588 AI091770 N47099 AA927490 H93335 AA460756 H91651 R98064 N92519 H57309 AA676254 R70685 AA156324 AA970865 AA426311 AI266752 Then running Matchminer (you have to give all of the options on the command line or it will go into interactive mode): java -jar MatchMiner.jar -Tlookup -ORhuman -I1accno -IA1genebankaccnumber -OTsymbol -Arefseqnumber -IF1tempFile -OFstdout -HStrue will produce the output: ************************************** Matchminer Build:115 Genomics and Bioinformatics Group,NCI NIH ************************************** Input Summary Value Build 115 Date Thursday, April 15, 2004 Operation Lookup Organism Homo sapiens Input Source Name C:\Temp\tempFile.txt Input Type GenBank Accession Number Input Algorithm GenBank(All inc. RefSeq) Output Type Symbol Output Algorithm RefSeq (DNA, RNA and Protein) Lookup Summary: 22 Items from the input list that has output 6 Items from the input list with no output 2 Items from the input list that were not found in the database Function Original Order Input GenBank Accession Number Output Symbol Mult. Assoc. in Input GenBank Accession Number Index Lookup Output 19 H91651 NM_002041 Y 2357 Lookup Output 19 H91651 NM_005254 Y 2357 Lookup Output 19 H91651 NM_016654 Y 2357 Lookup Output 19 H91651 NM_016655 Y 2357 Lookup Output 19 H91651 NM_181427 Y 2357 Lookup Output 19 H91651 NM_017976 Y 2357 Lookup Output 17 H93335 NM_022465 - 16249 Lookup Output 13 H99588 NM_002285 - 3642 Lookup Output 5 AA488391 NM_005875 - 9334 Lookup Output 11 R22625 NM_001799 - 959 Lookup Output 10 W65461 NM_004419 - 1694 Lookup Output 7 AA115076 NM_006079 - 9407 Lookup Output 6 AA487582 NM_000127 - 1952 Lookup Output 2 AA683077 NM_002745 - 5222 Lookup Output 2 AA683077 NM_138957 - 5222 Lookup Output 25 AA156324 NM_004613 - 6604 Lookup Output 25 AA156324 NM_198951 - 6604 Lookup Output 26 AA970865 NM_021167 - 15857 Lookup Output 3 R60193 NM_014423 - 11919 Lookup Output 15 N47099 NM_005901 - 3811 Lookup Output 16 AA927490 NM_005419 Y 6344 Lookup Output 16 AA927490 NM_001638 Y 6344 Lookup Output 27 AA426311 NM_004527 - 3937 Lookup Output 27 AA426311 NM_013999 - 3937 Lookup Output 19 H91651 NM_017976 Y 14380 Lookup Output 19 H91651 NM_181427 Y 14380 Lookup Output 19 H91651 NM_002041 Y 14380 Lookup Output 19 H91651 NM_005254 Y 14380 Lookup Output 19 H91651 NM_016654 Y 14380 Lookup Output 19 H91651 NM_016655 Y 14380 Lookup Output 9 R43483 NM_000210 - 3408 Lookup Output 22 H57309 NM_003068 - 6168 Lookup Output 16 AA927490 NM_001638 Y 302 Lookup Output 16 AA927490 NM_005419 Y 302 Lookup Output 18 AA460756 NM_005056 - 5540 Lookup Output 1 AA936757 NM_005130 - 9059 Lookup Output 4 AA495846 NM_001453 - 2109 No Output 21 N92519 - - 325911 No Output 20 R98064 - - 63470 No Output 28 AI266752 - - 49752 No Output 8 N92478 - - 60114 No Output 23 AA676254 - - 603790 No Output 14 AI091770 - - 63315 No GeneIndex 12 N64741 - - - No GeneIndex 24 R70685 - - - For your example, substitute " -OTunigene -Aunigenenumber" for the output. I have had less success with Unigene IDs and in fact your example should produce: Hs.429506 But it gives: java -jar MatchMiner.jar -Tlookup -ORhuman -I1accno -IA1genebankaccnumber -OTunigene -Aunigenenumber -IF1C:\Temp\tempFile.txt -OFstdout -HStrue ************************************** Matchminer Build:115 Genomics and Bioinformatics Group,NCI NIH ************************************** Input Summary Value Build 115 Date Thursday, April 15, 2004 Operation Lookup Organism Homo sapiens Input Source Name C:\Temp\tempFile.txt Input Type GenBank Accession Number Input Algorithm GenBank(All inc. RefSeq) Output Type UniGene Cluster Id Output Algorithm Active UniGene Cluster Ids Lookup Summary: 0 Items from the input list that has output 0 Items from the input list with no output 1 Items from the input list that were not found in the database Function Original Order Input GenBank Accession Number Output UniGene Cluster Id Mult. Assoc. in Input GenBank Ac cession Number Index No GeneIndex 1 NM_004551 - - - Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of Gordon Smyth Sent: Thursday, April 15, 2004 2:45 AM To: BioC Mailing List Subject: [BioC] Genbank to Unigene IDs I have a list of GenBank IDs for which I'd like the corresponding Unigene cluster IDs. What is the easiest way to do this using Bioconductor functions? (I've scanned annotate and AnnBuilder help and vignettes, although way too quickly.) For the sake of being specific, here's a concrete example. What's Unigene for GB="NM_004551"? Thanks a lot Gordon _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.8 years ago Dave Waddell ▴ 160

0

Entering edit mode

Well I take that back, on a second run it correctly produces (I guess I caught it at a bad time ;-): java -jar MatchMiner.jar -Tlookup -ORhuman -I1accno -IA1genebankaccnumber -OTunigene -Aunigenenumber -IF1C:\Temp\t empFile.txt -OFstdout -HStrue ************************************** Matchminer Build:115 Genomics and Bioinformatics Group,NCI NIH ************************************** Input Summary Value Build 115 Date Thursday, April 15, 2004 Operation Lookup Organism Homo sapiens Input Source Name C:\Temp\tempFile.txt Input Type GenBank Accession Number Input Algorithm GenBank(All inc. RefSeq) Output Type UniGene Cluster Id Output Algorithm Active UniGene Cluster Ids Lookup Summary: 2 Items from the input list that has output 0 Items from the input list with no output 0 Items from the input list that were not found in the database Function Original Order Input GenBank Accession Number Output UniGene Cluster Id Mult. Assoc. in Input GenBank Ac cession Number Index Lookup Output 1 NM_004551 Hs.429506 Y 20092 Lookup Output 1 NM_004551 Hs.429506 Y 4407 Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of Gordon Smyth Sent: Thursday, April 15, 2004 2:45 AM To: BioC Mailing List Subject: [BioC] Genbank to Unigene IDs I have a list of GenBank IDs for which I'd like the corresponding Unigene cluster IDs. What is the easiest way to do this using Bioconductor functions? (I've scanned annotate and AnnBuilder help and vignettes, although way too quickly.) For the sake of being specific, here's a concrete example. What's Unigene for GB="NM_004551"? Thanks a lot Gordon _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.8 years ago Dave Waddell ▴ 160

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

You probably need to update your AnnBuilder. A recent version was using the system temp directory instead of the AnnBuilder temp directory, which didn't work well on Win32. AFAIK, the current devel version of AnnBuilder has been rolled back to use the AnnBuilder temp dir. As an aside, if all you need is GB -> UG mappings, it is probably overkill to use ABPkgBuilder in this way, which is going to parse locus link and KEGG also (which takes some time). There are two alternatives that I can think of, (both untested by me). First, use ABPkgBuilder, but only parse UG by changing the srcUrl to: mySrcUrl <- getSrcUrl("UG") Another possiblity is to use the UG class directly. See ?UG. Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Dave Waddell" <dwaddell@nutecsciences.com> 04/15/04 10:37AM >>> I tried running this but got an error: > library(AnnBuilder) > myBaseType <- "gb" > myDir <- "C:/Temp" > myBase <- "C:/Temp/tempFile.txt" > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = + "human", version = "1.0", + makeXML = TRUE, author = list(author = "dpritch", maintainer = + "dpritch@u.washington.edu"), fromWeb = TRUE) [1] "It may take me a while to process the data. Be patient!" Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse LocusLink data because of: Error in file(file, "r") : unable to open connection I had changed this directory from "Read Only" and checked that I had write permissions from within R: > setwd("C:/R/rw1090beta/library/AnnBuilder/temp") > dir() [1] "file24842Tgo.xml" "README" > write("Hello") > dir() [1] "data" "file24842Tgo.xml" "README" I get the same error if I run example("ABPkgBuilder") Any suggestions? Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. Rossini Sent: Thursday, April 15, 2004 8:48 AM To: Gordon Smyth Cc: BioC Mailing List Subject: Re: [BioC] Genbank to Unigene IDs Gordon Smyth <smyth@wehi.edu.au> writes: > I have a list of GenBank IDs for which I'd like the corresponding > Unigene cluster IDs. What is the easiest way to do this using > Bioconductor functions? (I've scanned annotate and AnnBuilder help and > vignettes, although way too quickly.) > > For the sake of being specific, here's a concrete example. What's > Unigene for GB="NM_004551"? Here's what I'd do (more of a chip-style analysis than instant WWW-based gratification, which might also be possible): 1. First create a tab-separated 2 column file, first row dummy probe IDs (could be real or not), second row GB ID's. So, you'd have 1 row in a file called "Dummy.tsv" 1 NM_004551 2. Have a script similar to: library(AnnBuilder) myBaseType <- "gb" # myDir maps the directory where you want the data package built --- # obviously this should be changed for the directory structure on the # linux box myDir <- "C:/DavidsData/Annotation_Folders" # myBase maps the file that contains the mapping of Agilent feature # numbers to GenBank ID's myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" #use AnnBuilder internal lists of data sources mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") #invoke ABPkgBuilder ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir, organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer = "dpritch@u.washington.edu"), fromWeb = TRUE) 3. install the package environment 4. use it to find the IDs (can verify the ID mapping with the XML output file, as well) best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}

ADD COMMENT • link 20.8 years ago James W. MacDonald 67k

0

Entering edit mode

Dave Waddell ▴ 160

@dave-waddell-323

Last seen 10.4 years ago

That is closer, thanks. library(AnnBuilder) myBaseType <- "gb" myDir <- "C:/Temp" myBase <- "C:/Temp/tempFile.txt" mySrcUrl <- getSrcUrl("UG") ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer ="dpritch@u.washington.edu"), fromWeb = TRUE) I see that this is now writing to C:/R/rw1090beta/library/AnnBuilder/data/ Instead of temp and I also had to unzip C:/R/rw1090beta/library/AnnBuilder/data/Rdata.zip as it couldn't find Anninfo: "In addition: Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/data/AnnInfo'" Also the example still fails with Error in file(file, "r") : unable to open connection In addition: Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut27202' Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse LocusLink data because of: Error in file(file, "r") : unable to open connection I also get a few PERL error messages: Scalar value @vals[1] better written as $vals[1] at C:\R\rw1090beta\library\AnnBuilder\temp\tempPerl28396.pl line 16. Use of uninitialized value in split at C:\R\rw1090beta\library\AnnBuilder\temp\t empPerl28396.pl line 16, <base> line 1. Use of uninitialized value in split at C:\R\rw1090beta\library\AnnBuilder\temp\t empPerl23771.pl line 16, <base> line 1. Useless use of a variable in void context at C:\R\rw1090beta\library\AnnBuilder\ temp\tempPerl7801.pl line 38. It also downloaded LL_tmpl.gz twice (refGene.txt,gz, refLink.txt,gz, Hs.data.gz, and go_200403-termdb.xml.gz once) and finally failed after 45 minutes with: > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer ="dpritch@u.washington.edu"), fromWeb = TRUE) [1] "It may take me a while to process the data. Be patient!" Error in url(paste(srcUrl, exten, sep = ""), "r") : unable to open connection Examining the failed command gives: > url(paste(srcUrl, exten, sep = ""), "r") Error in paste(srcUrl, exten, sep = "") : Object "exten" not found Has anyone got this running in Windows? Dave. -----Original Message----- From: James MacDonald [mailto:jmacdon@med.umich.edu] Sent: Thursday, April 15, 2004 9:52 AM To: dwaddell@nutecsciences.com; bioconductor@stat.math.ethz.ch Subject: RE: [BioC] Genbank to Unigene IDs You probably need to update your AnnBuilder. A recent version was using the system temp directory instead of the AnnBuilder temp directory, which didn't work well on Win32. AFAIK, the current devel version of AnnBuilder has been rolled back to use the AnnBuilder temp dir. As an aside, if all you need is GB -> UG mappings, it is probably overkill to use ABPkgBuilder in this way, which is going to parse locus link and KEGG also (which takes some time). There are two alternatives that I can think of, (both untested by me). First, use ABPkgBuilder, but only parse UG by changing the srcUrl to: mySrcUrl <- getSrcUrl("UG") Another possiblity is to use the UG class directly. See ?UG. Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Dave Waddell" <dwaddell@nutecsciences.com> 04/15/04 10:37AM >>> I tried running this but got an error: > library(AnnBuilder) > myBaseType <- "gb" > myDir <- "C:/Temp" > myBase <- "C:/Temp/tempFile.txt" > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = + "human", version = "1.0", + makeXML = TRUE, author = list(author = "dpritch", maintainer = + "dpritch@u.washington.edu"), fromWeb = TRUE) [1] "It may take me a while to process the data. Be patient!" Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse LocusLink data because of: Error in file(file, "r") : unable to open connection I had changed this directory from "Read Only" and checked that I had write permissions from within R: > setwd("C:/R/rw1090beta/library/AnnBuilder/temp") > dir() [1] "file24842Tgo.xml" "README" > write("Hello") > dir() [1] "data" "file24842Tgo.xml" "README" I get the same error if I run example("ABPkgBuilder") Any suggestions? Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. Rossini Sent: Thursday, April 15, 2004 8:48 AM To: Gordon Smyth Cc: BioC Mailing List Subject: Re: [BioC] Genbank to Unigene IDs Gordon Smyth <smyth@wehi.edu.au> writes: > I have a list of GenBank IDs for which I'd like the corresponding > Unigene cluster IDs. What is the easiest way to do this using > Bioconductor functions? (I've scanned annotate and AnnBuilder help and > vignettes, although way too quickly.) > > For the sake of being specific, here's a concrete example. What's > Unigene for GB="NM_004551"? Here's what I'd do (more of a chip-style analysis than instant WWW-based gratification, which might also be possible): 1. First create a tab-separated 2 column file, first row dummy probe IDs (could be real or not), second row GB ID's. So, you'd have 1 row in a file called "Dummy.tsv" 1 NM_004551 2. Have a script similar to: library(AnnBuilder) myBaseType <- "gb" # myDir maps the directory where you want the data package built --- # obviously this should be changed for the directory structure on the # linux box myDir <- "C:/DavidsData/Annotation_Folders" # myBase maps the file that contains the mapping of Agilent feature # numbers to GenBank ID's myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" #use AnnBuilder internal lists of data sources mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") #invoke ABPkgBuilder ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir, organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer = "dpritch@u.washington.edu"), fromWeb = TRUE) 3. install the package environment 4. use it to find the IDs (can verify the ID mapping with the XML output file, as well) best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}

ADD COMMENT • link 20.8 years ago Dave Waddell ▴ 160

0

Entering edit mode

Dave Waddell ▴ 160

@dave-waddell-323

Last seen 10.4 years ago

The output from: mySrcUrl <- getSrcUrl("UG") is > mySrcUrl [1] "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" this is rejected by ABPkgBuilder: "Error in toupper(x) : non-character argument to toupper()" when getSrcUrl has the ALL argument it gives: mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > mySrcUrls LL "ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz" GP "http://www.genome.ucsc.edu/goldenPath/hg16/database/" UG "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" GO "http://www.godatabase.org/dev/database/archive/2004-03-01/go_200403-t ermdb. xml.gz" KEGG "ftp://ftp.genome.ad.jp/pub/kegg/pathways" YG "http://www.yeastgenome.org/DownloadContents.shtml" HG "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/hmlg.ftp" So I thought I might cheat and use: mySrcUrl <- mySrcUrls[3] > mySrcUrls[3] UG "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" As you can see this gets rejected as well: Error in loadFromUrl(srcUrl(object), dist) : URL NA is incorrect or the target site is not responding! Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse UniGene data becaus of: Error in loadFromUrl(srcUrl(object), dist) : URL NA is incorrect or the target site is not responding! Is it possible to use Annotation that was created on Linux in the Windows environment? If so, does anyone want to donate it? Thanks, Dave. -----Original Message----- From: James MacDonald [mailto:jmacdon@med.umich.edu] Sent: Thursday, April 15, 2004 9:52 AM To: dwaddell@nutecsciences.com; bioconductor@stat.math.ethz.ch Subject: RE: [BioC] Genbank to Unigene IDs You probably need to update your AnnBuilder. A recent version was using the system temp directory instead of the AnnBuilder temp directory, which didn't work well on Win32. AFAIK, the current devel version of AnnBuilder has been rolled back to use the AnnBuilder temp dir. As an aside, if all you need is GB -> UG mappings, it is probably overkill to use ABPkgBuilder in this way, which is going to parse locus link and KEGG also (which takes some time). There are two alternatives that I can think of, (both untested by me). First, use ABPkgBuilder, but only parse UG by changing the srcUrl to: mySrcUrl <- getSrcUrl("UG") Another possiblity is to use the UG class directly. See ?UG. Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Dave Waddell" <dwaddell@nutecsciences.com> 04/15/04 10:37AM >>> I tried running this but got an error: > library(AnnBuilder) > myBaseType <- "gb" > myDir <- "C:/Temp" > myBase <- "C:/Temp/tempFile.txt" > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = + "human", version = "1.0", + makeXML = TRUE, author = list(author = "dpritch", maintainer = + "dpritch@u.washington.edu"), fromWeb = TRUE) [1] "It may take me a while to process the data. Be patient!" Warning message: cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : Failed to get or parse LocusLink data because of: Error in file(file, "r") : unable to open connection I had changed this directory from "Read Only" and checked that I had write permissions from within R: > setwd("C:/R/rw1090beta/library/AnnBuilder/temp") > dir() [1] "file24842Tgo.xml" "README" > write("Hello") > dir() [1] "data" "file24842Tgo.xml" "README" I get the same error if I run example("ABPkgBuilder") Any suggestions? Dave. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. Rossini Sent: Thursday, April 15, 2004 8:48 AM To: Gordon Smyth Cc: BioC Mailing List Subject: Re: [BioC] Genbank to Unigene IDs Gordon Smyth <smyth@wehi.edu.au> writes: > I have a list of GenBank IDs for which I'd like the corresponding > Unigene cluster IDs. What is the easiest way to do this using > Bioconductor functions? (I've scanned annotate and AnnBuilder help and > vignettes, although way too quickly.) > > For the sake of being specific, here's a concrete example. What's > Unigene for GB="NM_004551"? Here's what I'd do (more of a chip-style analysis than instant WWW-based gratification, which might also be possible): 1. First create a tab-separated 2 column file, first row dummy probe IDs (could be real or not), second row GB ID's. So, you'd have 1 row in a file called "Dummy.tsv" 1 NM_004551 2. Have a script similar to: library(AnnBuilder) myBaseType <- "gb" # myDir maps the directory where you want the data package built --- # obviously this should be changed for the directory structure on the # linux box myDir <- "C:/DavidsData/Annotation_Folders" # myBase maps the file that contains the mapping of Agilent feature # numbers to GenBank ID's myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" #use AnnBuilder internal lists of data sources mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") #invoke ABPkgBuilder ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir, organism = "human", version = "1.0", makeXML = TRUE, author = list(author = "dpritch", maintainer = "dpritch@u.washington.edu"), fromWeb = TRUE) 3. install the package environment 4. use it to find the IDs (can verify the ID mapping with the XML output file, as well) best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}

ADD COMMENT • link 20.8 years ago Dave Waddell ▴ 160

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.4 years ago

Dave - Sorry to have led you on a wild goose chase. We've been much more successful on Linux builds; one solution was to have pre-downloaded files, but I can't seem to quickly find our mini-script that did that (it removed the D/L hassle problem, espec if you are working with genes which probably aren't changing). "Dave Waddell" <dwaddell@nutecsciences.com> writes: > The output from: > mySrcUrl <- getSrcUrl("UG") > is >> mySrcUrl > [1] "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > this is rejected by ABPkgBuilder: > "Error in toupper(x) : non-character argument to toupper()" > > when getSrcUrl has the ALL argument it gives: > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") >> mySrcUrls > > LL > > "ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz" > > GP > > "http://www.genome.ucsc.edu/goldenPath/hg16/database/" > > UG > > "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > > GO > "http://www.godatabase.org/dev/database/archive/2004-03-01/go_200403 -termdb. > xml.gz" > > KEGG > > "ftp://ftp.genome.ad.jp/pub/kegg/pathways" > > YG > > "http://www.yeastgenome.org/DownloadContents.shtml" > > HG > > "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/hmlg.ftp" > So I thought I might cheat and use: > mySrcUrl <- mySrcUrls[3] >> mySrcUrls[3] > UG > "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > > As you can see this gets rejected as well: > Error in loadFromUrl(srcUrl(object), dist) : > URL NA is incorrect or the target site is not responding! > Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : > Failed to get or parse UniGene data becaus of: > > Error in loadFromUrl(srcUrl(object), dist) : > URL NA is incorrect or the target site is not responding! > Is it possible to use Annotation that was created on Linux in the Windows > environment? If so, does anyone want to donate it? > Thanks, Dave. > > > -----Original Message----- > From: James MacDonald [mailto:jmacdon@med.umich.edu] > Sent: Thursday, April 15, 2004 9:52 AM > To: dwaddell@nutecsciences.com; bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] Genbank to Unigene IDs > > You probably need to update your AnnBuilder. A recent version was using > the system temp directory instead of the AnnBuilder temp directory, > which didn't work well on Win32. AFAIK, the current devel version of > AnnBuilder has been rolled back to use the AnnBuilder temp dir. > > As an aside, if all you need is GB -> UG mappings, it is probably > overkill to use ABPkgBuilder in this way, which is going to parse locus > link and KEGG also (which takes some time). There are two alternatives > that I can think of, (both untested by me). First, use ABPkgBuilder, but > only parse UG by changing the srcUrl to: > > mySrcUrl <- getSrcUrl("UG") > > Another possiblity is to use the UG class directly. See ?UG. > > Best, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > >>>> "Dave Waddell" <dwaddell@nutecsciences.com> 04/15/04 10:37AM >>> > I tried running this but got an error: >> library(AnnBuilder) >> myBaseType <- "gb" >> myDir <- "C:/Temp" >> myBase <- "C:/Temp/tempFile.txt" >> mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") >> ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = > + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = > + "human", version = "1.0", > + makeXML = TRUE, author = list(author = "dpritch", maintainer = > + "dpritch@u.washington.edu"), fromWeb = TRUE) > [1] "It may take me a while to process the data. Be patient!" > Warning message: > cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' > > Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : > Failed to get or parse LocusLink data because of: > > Error in file(file, "r") : unable to open connection > > I had changed this directory from "Read Only" and checked that I had > write > permissions from within R: >> setwd("C:/R/rw1090beta/library/AnnBuilder/temp") >> dir() > [1] "file24842Tgo.xml" "README" >> write("Hello") >> dir() > [1] "data" "file24842Tgo.xml" "README" > > I get the same error if I run > example("ABPkgBuilder") > > Any suggestions? > > Dave. > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. > Rossini > Sent: Thursday, April 15, 2004 8:48 AM > To: Gordon Smyth > Cc: BioC Mailing List > Subject: Re: [BioC] Genbank to Unigene IDs > > Gordon Smyth <smyth@wehi.edu.au> writes: > >> I have a list of GenBank IDs for which I'd like the corresponding >> Unigene cluster IDs. What is the easiest way to do this using >> Bioconductor functions? (I've scanned annotate and AnnBuilder help > and >> vignettes, although way too quickly.) >> >> For the sake of being specific, here's a concrete example. What's >> Unigene for GB="NM_004551"? > > Here's what I'd do (more of a chip-style analysis than instant > WWW-based gratification, which might also be possible): > > 1. First create a tab-separated 2 column file, first row dummy > probe IDs (could be real or not), second row GB ID's. So, you'd have > 1 row in a file called "Dummy.tsv" > > > > 1 NM_004551 > > > > > 2. Have a script similar to: > > > > library(AnnBuilder) > myBaseType <- "gb" > # myDir maps the directory where you want the data package built --- > # obviously this should be changed for the directory structure on the > # linux box > myDir <- "C:/DavidsData/Annotation_Folders" > > # myBase maps the file that contains the mapping of Agilent feature > # numbers to GenBank ID's > myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" > > #use AnnBuilder internal lists of data sources > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > > #invoke ABPkgBuilder > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = > myBaseType, pkgName = "Hum_Agi1A", pkgPath = > myDir, > organism = > "human", version = "1.0", > makeXML = TRUE, author = list(author = > "dpritch", > maintainer = > "dpritch@u.washington.edu"), fromWeb = TRUE) > > 3. install the package environment > > 4. use it to find the IDs (can verify the ID mapping with the XML > output file, as well) > > best, > -tony > > -- > rossini@u.washington.edu > http://www.analytics.washington.edu/ > Biomedical and Health Informatics University of Washington > Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research > Center > UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable > FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email > > CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 20.8 years ago A.J. Rossini ▴ 810

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Dave, If all you want is GB -> UG mappings, then you can use the UG class in AnnBuilder. As normal, you want a two column text file with some sort of identifier in the first column and the accession numbers in the second. You then have to open the perl script gbUGparser that is in rw1090\library\annbuilder\scripts, and change the two instances of LOCUSLINK to ID. Rename this file something like gbUGparserTest and save (nuke the .txt subscript if wordpad/notepad puts it there). Back in R, try this: tst <- UG(srcUrl=getSrcUrl("UG"), parser="C:/r/rw1090/library/annbuilder/scripts/gbUGparserTest", baseFile="myBase") Data <- parseData(tst) This will only download the Hs.data.gz file, so should be much quicker than using ABPkgBuilder(). Also, if you look in the code for ABPkgBuilder, there is a much more elegant way to get the path to the perl script. However, when I am just trying to get stuff to work, I find that ugly and straightforward are the way to go :). HTH, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> rossini@blindglobe.net 04/15/04 5:31 PM >>> Dave - Sorry to have led you on a wild goose chase. We've been much more successful on Linux builds; one solution was to have pre-downloaded files, but I can't seem to quickly find our mini-script that did that (it removed the D/L hassle problem, espec if you are working with genes which probably aren't changing). "Dave Waddell" <dwaddell@nutecsciences.com> writes: > The output from: > mySrcUrl <- getSrcUrl("UG") > is >> mySrcUrl > [1] "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > this is rejected by ABPkgBuilder: > "Error in toupper(x) : non-character argument to toupper()" > > when getSrcUrl has the ALL argument it gives: > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") >> mySrcUrls > > LL > > "ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz" > > GP > > "http://www.genome.ucsc.edu/goldenPath/hg16/database/" > > UG > > "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > > GO > "http://www.godatabase.org/dev/database/archive/2004-03-01/go_200403-t ermdb. > xml.gz" > > KEGG > > "ftp://ftp.genome.ad.jp/pub/kegg/pathways" > > YG > > "http://www.yeastgenome.org/DownloadContents.shtml" > > HG > > "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/hmlg.ftp" > So I thought I might cheat and use: > mySrcUrl <- mySrcUrls[3] >> mySrcUrls[3] > UG > "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" > > As you can see this gets rejected as well: > Error in loadFromUrl(srcUrl(object), dist) : > URL NA is incorrect or the target site is not responding! > Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : > Failed to get or parse UniGene data becaus of: > > Error in loadFromUrl(srcUrl(object), dist) : > URL NA is incorrect or the target site is not responding! > Is it possible to use Annotation that was created on Linux in the Windows > environment? If so, does anyone want to donate it? > Thanks, Dave. > > > -----Original Message----- > From: James MacDonald [mailto:jmacdon@med.umich.edu] > Sent: Thursday, April 15, 2004 9:52 AM > To: dwaddell@nutecsciences.com; bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] Genbank to Unigene IDs > > You probably need to update your AnnBuilder. A recent version was using > the system temp directory instead of the AnnBuilder temp directory, > which didn't work well on Win32. AFAIK, the current devel version of > AnnBuilder has been rolled back to use the AnnBuilder temp dir. > > As an aside, if all you need is GB -> UG mappings, it is probably > overkill to use ABPkgBuilder in this way, which is going to parse locus > link and KEGG also (which takes some time). There are two alternatives > that I can think of, (both untested by me). First, use ABPkgBuilder, but > only parse UG by changing the srcUrl to: > > mySrcUrl <- getSrcUrl("UG") > > Another possiblity is to use the UG class directly. See ?UG. > > Best, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > >>>> "Dave Waddell" <dwaddell@nutecsciences.com> 04/15/04 10:37AM >>> > I tried running this but got an error: >> library(AnnBuilder) >> myBaseType <- "gb" >> myDir <- "C:/Temp" >> myBase <- "C:/Temp/tempFile.txt" >> mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") >> ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = > + myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism = > + "human", version = "1.0", > + makeXML = TRUE, author = list(author = "dpritch", maintainer = > + "dpritch@u.washington.edu"), fromWeb = TRUE) > [1] "It may take me a while to process the data. Be patient!" > Warning message: > cannot open file `C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783' > > Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : > Failed to get or parse LocusLink data because of: > > Error in file(file, "r") : unable to open connection > > I had changed this directory from "Read Only" and checked that I had > write > permissions from within R: >> setwd("C:/R/rw1090beta/library/AnnBuilder/temp") >> dir() > [1] "file24842Tgo.xml" "README" >> write("Hello") >> dir() > [1] "data" "file24842Tgo.xml" "README" > > I get the same error if I run > example("ABPkgBuilder") > > Any suggestions? > > Dave. > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of A.J. > Rossini > Sent: Thursday, April 15, 2004 8:48 AM > To: Gordon Smyth > Cc: BioC Mailing List > Subject: Re: [BioC] Genbank to Unigene IDs > > Gordon Smyth <smyth@wehi.edu.au> writes: > >> I have a list of GenBank IDs for which I'd like the corresponding >> Unigene cluster IDs. What is the easiest way to do this using >> Bioconductor functions? (I've scanned annotate and AnnBuilder help > and >> vignettes, although way too quickly.) >> >> For the sake of being specific, here's a concrete example. What's >> Unigene for GB="NM_004551"? > > Here's what I'd do (more of a chip-style analysis than instant > WWW-based gratification, which might also be possible): > > 1. First create a tab-separated 2 column file, first row dummy > probe IDs (could be real or not), second row GB ID's. So, you'd have > 1 row in a file called "Dummy.tsv" > > > > 1 NM_004551 > > > > > 2. Have a script similar to: > > > > library(AnnBuilder) > myBaseType <- "gb" > # myDir maps the directory where you want the data package built --- > # obviously this should be changed for the directory structure on the > # linux box > myDir <- "C:/DavidsData/Annotation_Folders" > > # myBase maps the file that contains the mapping of Agilent feature > # numbers to GenBank ID's > myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv" > > #use AnnBuilder internal lists of data sources > mySrcUrls <- getSrcUrl(src = "ALL",organism = "human") > > #invoke ABPkgBuilder > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = > myBaseType, pkgName = "Hum_Agi1A", pkgPath = > myDir, > organism = > "human", version = "1.0", > makeXML = TRUE, author = list(author = > "dpritch", > maintainer = > "dpritch@u.washington.edu"), fromWeb = TRUE) > > 3. install the package environment > > 4. use it to find the IDs (can verify the ID mapping with the XML > output file, as well) > > best, > -tony > > -- > rossini@u.washington.edu > http://www.analytics.washington.edu/ > Biomedical and Health Informatics University of Washington > Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research > Center > UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable > FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email > > CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}

ADD COMMENT • link 20.8 years ago James W. MacDonald 67k

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 10.4 years ago

>I have a list of GenBank IDs for which I'd like the corresponding Unigene >cluster IDs. What is the easiest way to do this using Bioconductor >functions? (I've scanned annotate and AnnBuilder help and vignettes, >although way too quickly.) > >For the sake of being specific, here's a concrete example. What's Unigene >for GB="NM_004551"? Sorry for this delayed posting (I took one day off yesterday) I think the most direct way of getting the ids maped is to use sources available at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t") and then do a matching. If your target file contains RefSeq ids (e. g. "NM_130786", "NM_000014", ...), read ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example: > ids <- c("AC010642", "AF414429", "X56654", "Y08432") > ids2ll <- as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc" , header = FALSE, sep = "\t", strip.white = TRUE)) # We only need the second and third column > ids2ll <- ids2ll[, c(2, 3)] > colnames(ids2ll) <- c("GB", "LL") # Drop the version number > ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1]) > mapped <- ids2ll[is.element(ids2ll[,1], ids),] > mapped GB LL 1 "AC010642" "-" 4 "AF414429" "15778556" 10671 "X56654" "30506" 10677 "Y08432" "-" > >Thanks a lot >Gordon > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Jianhua Zhang Department of Biostatistics Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 20.7 years ago John Zhang ★ 2.9k

0

Entering edit mode

Dear John, Thanks for your suggestion. I can see the attraction of going through LocusLink because the LocusLink files are relatively small. But the fact that LocusLink is only a subset of GenBank (as pointed out by Dave Waddell) seems disasterous. I tried your code on a set of Genbank IDs from a human oligo array based on the Compugen 19k library. The code found LocusLink IDs for only 4587 of the Genbank IDs. Meanwhile, SOURCE found Unigene IDs for 16230 of them. So going through LocusLink found the UniGene ID in less than 30% of cases in which there was one to find. Gordon At 11:24 PM 16/04/2004, John Zhang wrote: > >I have a list of GenBank IDs for which I'd like the corresponding Unigene > >cluster IDs. What is the easiest way to do this using Bioconductor > >functions? (I've scanned annotate and AnnBuilder help and vignettes, > >although way too quickly.) > > > >For the sake of being specific, here's a concrete example. What's Unigene > >for GB="NM_004551"? > >Sorry for this delayed posting (I took one day off yesterday) > >I think the most direct way of getting the ids maped is to use sources >available >at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file >contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read >ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t") >and then do a matching. If your target file contains RefSeq ids (e. g. >"NM_130786", "NM_000014", ...), read >ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example: > > > ids <- c("AC010642", "AF414429", "X56654", "Y08432") > > ids2ll <- >as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc ", >header = >FALSE, sep = "\t", strip.white = TRUE)) ># We only need the second and third column > > ids2ll <- ids2ll[, c(2, 3)] > > colnames(ids2ll) <- c("GB", "LL") ># Drop the version number > > ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1]) > > mapped <- ids2ll[is.element(ids2ll[,1], ids),] > > mapped > GB LL >1 "AC010642" "-" >4 "AF414429" "15778556" >10671 "X56654" "30506" >10677 "Y08432" "-" > > > > > > >Thanks a lot > >Gordon > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@stat.math.ethz.ch > >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >Jianhua Zhang >Department of Biostatistics >Dana-Farber Cancer Institute >44 Binney Street >Boston, MA 02115-6084

ADD REPLY • link 20.7 years ago Gordon Smyth 52k

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 10.4 years ago

Sorry, the example code should be > ids <- c("AC010642", "AF414429", "X56654", "Y08432") > ids2ll <- as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc" , header = FALSE, sep = "\t")) # We only need the first and second column > ids2ll <- ids2ll[, c(1, 2)] > colnames(ids2ll) <- c("LL", "GB") ># Drop the version number > ids2ll[,2] <- gsub("\\..*", "", ids2ll[,2]) > mapped <- ids2ll[is.element(ids2ll[,2], ids),] > mapped LL GB 1 " 1" "AC010642" 4 " 1" "AF414429" 10671 " 1828" "X56654" 10677 " 1830" "Y08432" >I think the most direct way of getting the ids maped is to use sources available >at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file >contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read >ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t") >and then do a matching. If your target file contains RefSeq ids (e. g. >"NM_130786", "NM_000014", ...), read >ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example: > >> ids <- c("AC010642", "AF414429", "X56654", "Y08432") >> ids2ll <- >as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc ", header = >FALSE, sep = "\t", strip.white = TRUE)) ># We only need the second and third column >> ids2ll <- ids2ll[, c(2, 3)] >> colnames(ids2ll) <- c("GB", "LL") ># Drop the version number >> ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1]) >> mapped <- ids2ll[is.element(ids2ll[,1], ids),] >> mapped > GB LL >1 "AC010642" "-" >4 "AF414429" "15778556" >10671 "X56654" "30506" >10677 "Y08432" "-" > > > >> >>Thanks a lot >>Gordon >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >Jianhua Zhang >Department of Biostatistics >Dana-Farber Cancer Institute >44 Binney Street >Boston, MA 02115-6084 > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Jianhua Zhang Department of Biostatistics Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 20.7 years ago John Zhang ★ 2.9k

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 10.4 years ago

I forgot to mention in my previous email that when you have the mappings between your target ids to LocusLink ids, use the mappings to get UniGene ids using another file ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2UG.

ADD COMMENT • link 20.7 years ago John Zhang ★ 2.9k

0

Entering edit mode

Dave Waddell ▴ 160

@dave-waddell-323

Last seen 10.4 years ago

There are a number of problems in all of the solutions proposed. 1. Flat files like Hs are huge and grepping them takes forever. 2. Keeping flat files up to date is a waste of bandwidth. 3. The annotation really needs to be in some kind of database such as SOURCE, Matchminer, DAVID or whatever with indexes on each field so that searches can complete in a reasonable period of time. 4. HTML based tools are handy for small searches but useless if you want to perform searches with a large number of terms where you expect to get back parseable data. 5. Many Genbank Accession numbers (ESTs in particular) don't map to Locuslink therefore going from Accession number to Locuslink to Unigene simply doesn't work i.e. AA683077. Matchminer works for me because I'm calling Rserve and Matchminer from Java, the response is relatively quick, and I don't have to worry about keeping the data current. Dave. -----Original Message----- From: Gordon Smyth [mailto:smyth@wehi.edu.au] Sent: Thursday, April 15, 2004 8:48 PM To: rossini@u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa Yang Subject: RE: [BioC] Genbank to Unigene IDs Dear Jean, Tony, James and Dave, Many thanks for your very helpful replies. Just to re-iterate, my interest was to map from GenBank from UniGene IDs within R, i.e., write a function that will take a character vector or list of GenBank IDs and will return the corresponding vector or list of UniGene IDs. If one ignores R, the easiest way that I know of to map GenBank to UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for the GenBank IDs as text strings. (My lab keeps a mirror of the usual databases, so downloading isn't actually required if the code is to be used within my own lab.) As as far as R is concerned, you've described a number of methods by which the job could be done in principle, but no one has shown actual code to answer my example question, "What's Unigene for GB="NM_004551?" Would it be a fair statement to say that there isn't a reasonably easy way to do the job using Bioconductor, and I would be better to stick to the download and grep idea (which of course could be done within R if need be)? Cheers Gordon PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst other problems, AnnBuilder won't load without the XML package, and that package is not available for R 1.9.0 under Windows.

ADD COMMENT • link 20.7 years ago Dave Waddell ▴ 160

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 9.7 years ago

United States

On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote: > There are a number of problems in all of the solutions proposed. > 1. Flat files like Hs are huge and grepping them takes forever. Yes, but I don't think that anyone is doing that for a production system (for one off, it may in fact be more efficient depending on how you measure efficiency). > 2. Keeping flat files up to date is a waste of bandwidth. Is there really an option, given that you want to keep up to date? I know of no standard diff format that would allow us to keep up to date. Virtually every one of the important public databases uses different formats and conventions. But if so, please do let us know. > 3. The annotation really needs to be in some kind of database such as > SOURCE, Matchminer, DAVID or whatever with indexes on each field so that > searches can complete in a reasonable period of time. Yes, and you can easily do that locally - if that is what you want or do it over the net. The advantage to local is that you have faster access and you can tailor the database to your needs. Another option would be to treat these as web services (but I do not think that they support it, however your comments below suggest that they might. My scanning of the relevant webpages turned up no clear callable interface, but I certainly could have missed something). If one exists then this can be made very simple using the XML packages and R's connections (no need for Java, nor any need to exclude it either - if it is your favorite language). > 4. HTML based tools are handy for small searches but useless if you want to > perform searches with a large number of terms where you expect to get back > parseable data. Yes, XML is preferable and many of these DBs could provide it with little extra effort - but I think we need to start asking them to do so. > 5. Many Genbank Accession numbers (ESTs in particular) don't map to > Locuslink therefore going from Accession number to Locuslink to Unigene > simply doesn't work i.e. AA683077. A very good point. > > Matchminer works for me because I'm calling Rserve and Matchminer from Java, > the response is relatively quick, and I don't have to worry about keeping > the data current. Yes, but you do have to worry about repeatability (if they update between queries). Do they always tell you and can you determine which actual data resources they used. I'm not saying you cannot, just raising one of the points of difference between a locally amalgamated and managed meta-data resource and an on-line one. There are good points for both (and bad points for both). Doing your own amalgamation allows for more control over how disparate data sources get merged (and for some folks that is important). Thanks for the interesting comments, Robert > Dave. > > -----Original Message----- > From: Gordon Smyth [mailto:smyth@wehi.edu.au] > Sent: Thursday, April 15, 2004 8:48 PM > To: rossini@u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa > Yang > Subject: RE: [BioC] Genbank to Unigene IDs > > Dear Jean, Tony, James and Dave, > > Many thanks for your very helpful replies. Just to re-iterate, my interest > was to map from GenBank from UniGene IDs within R, i.e., write a function > that will take a character vector or list of GenBank IDs and will return > the corresponding vector or list of UniGene IDs. > > If one ignores R, the easiest way that I know of to map GenBank to > UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for > the GenBank IDs as text strings. (My lab keeps a mirror of the usual > databases, so downloading isn't actually required if the code is to be used > within my own lab.) > > As as far as R is concerned, you've described a number of methods by which > the job could be done in principle, but no one has shown actual code to > answer my example question, "What's Unigene for GB="NM_004551?" Would it be > a fair statement to say that there isn't a reasonably easy way to do the > job using Bioconductor, and I would be better to stick to the download and > grep idea (which of course could be done within R if need be)? > > Cheers > Gordon > > PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst > other problems, AnnBuilder won't load without the XML package, and that > package is not available for R 1.9.0 under Windows. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ------+

ADD COMMENT • link 20.7 years ago rgentleman ★ 5.5k

0

Entering edit mode

There are other issues as well i.e. licensing: For DAVID: http://david.niaid.nih.gov/david/ease.htm For SOURCE: There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch ). and for GOMiner/MatchMiner Barry Zeeberg [zeebergb@mail.nih.gov] says: Unofficially, pending any corrections from David Kane, as far as I know, there are no restrictions on either. At the moment, neither is available as open source, and we are engaged internally in making a decision about this issue. Both programs have command line interfaces, which allow a great deal of flexibility in incorporating them in your own custom data processing stream. There is no restriction whatever on how you choose to do so. Our basic idea was to make these as freely available as possible, without even requiring free registration, to lower the barrier to someone using it. There are frequent updates, as we either fix a problem, add a feature, or make changes required by changes in external databases from which these programs draw information, so it is advisable to be on our email list to be kept up to date. This is an important issue, for me at least, as we annotate Microarrays to GO (and many other databases). IMHO, to have one of these databases available from within Bioconductor would greatly increase its value as a tool to carry out a complete analysis. A single authoritative database which would consistently provide results that was being maintained by a competent organization could reduce the requirement for downloading flat files. MatchMiner is not 100% reliable right now as can be seen in the output from one of the earlier posts in this thread but with a little effort (assuming they go open source) this could be fixed. XML output would definitely be a boon. Dave. -----Original Message----- From: Robert Gentleman [mailto:rgentlem@jimmy.harvard.edu] Sent: Monday, April 19, 2004 1:23 PM To: Dave Waddell Cc: Bioconductor Subject: Re: [BioC] Genbank to Unigene IDs On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote: > There are a number of problems in all of the solutions proposed. > 1. Flat files like Hs are huge and grepping them takes forever. Yes, but I don't think that anyone is doing that for a production system (for one off, it may in fact be more efficient depending on how you measure efficiency). > 2. Keeping flat files up to date is a waste of bandwidth. Is there really an option, given that you want to keep up to date? I know of no standard diff format that would allow us to keep up to date. Virtually every one of the important public databases uses different formats and conventions. But if so, please do let us know. > 3. The annotation really needs to be in some kind of database such as > SOURCE, Matchminer, DAVID or whatever with indexes on each field so that > searches can complete in a reasonable period of time. Yes, and you can easily do that locally - if that is what you want or do it over the net. The advantage to local is that you have faster access and you can tailor the database to your needs. Another option would be to treat these as web services (but I do not think that they support it, however your comments below suggest that they might. My scanning of the relevant webpages turned up no clear callable interface, but I certainly could have missed something). If one exists then this can be made very simple using the XML packages and R's connections (no need for Java, nor any need to exclude it either - if it is your favorite language). > 4. HTML based tools are handy for small searches but useless if you want to > perform searches with a large number of terms where you expect to get back > parseable data. Yes, XML is preferable and many of these DBs could provide it with little extra effort - but I think we need to start asking them to do so. > 5. Many Genbank Accession numbers (ESTs in particular) don't map to > Locuslink therefore going from Accession number to Locuslink to Unigene > simply doesn't work i.e. AA683077. A very good point. > > Matchminer works for me because I'm calling Rserve and Matchminer from Java, > the response is relatively quick, and I don't have to worry about keeping > the data current. Yes, but you do have to worry about repeatability (if they update between queries). Do they always tell you and can you determine which actual data resources they used. I'm not saying you cannot, just raising one of the points of difference between a locally amalgamated and managed meta-data resource and an on-line one. There are good points for both (and bad points for both). Doing your own amalgamation allows for more control over how disparate data sources get merged (and for some folks that is important). Thanks for the interesting comments, Robert > Dave. > > -----Original Message----- > From: Gordon Smyth [mailto:smyth@wehi.edu.au] > Sent: Thursday, April 15, 2004 8:48 PM > To: rossini@u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa > Yang > Subject: RE: [BioC] Genbank to Unigene IDs > > Dear Jean, Tony, James and Dave, > > Many thanks for your very helpful replies. Just to re-iterate, my interest > was to map from GenBank from UniGene IDs within R, i.e., write a function > that will take a character vector or list of GenBank IDs and will return > the corresponding vector or list of UniGene IDs. > > If one ignores R, the easiest way that I know of to map GenBank to > UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for > the GenBank IDs as text strings. (My lab keeps a mirror of the usual > databases, so downloading isn't actually required if the code is to be used > within my own lab.) > > As as far as R is concerned, you've described a number of methods by which > the job could be done in principle, but no one has shown actual code to > answer my example question, "What's Unigene for GB="NM_004551?" Would it be > a fair statement to say that there isn't a reasonably easy way to do the > job using Bioconductor, and I would be better to stick to the download and > grep idea (which of course could be done within R if need be)? > > Cheers > Gordon > > PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst > other problems, AnnBuilder won't load without the XML package, and that > package is not available for R 1.9.0 under Windows. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------ + | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ------ +

ADD REPLY • link 20.7 years ago Dave Waddell ▴ 160

0

Entering edit mode

We are very interested in participating with either for profit or not for profit organizations, and feedback on what would be helpful would be fed into our workflow. Any problems with matchminer or gominer are of concern to us, and we prioritize correcting these. In addition to the concrete suggestion of XML output, could you elaborate on the matchminer unreliability issue? It is possible that we have fixed this already in not yet released update, but we would like to track and correct any residual problems. There is a great emphasis now at NIH on technology transfer, and we could all benefit from the successful use of one of our resources in your product. barry On 04/19/04 16:17, "Dave Waddell" <dwaddell@nutecsciences.com> wrote: > There are other issues as well i.e. licensing: > For DAVID: > http://david.niaid.nih.gov/david/ease.htm > > For SOURCE: > There are no restrictions on its use by non-profit institutions as long as > its content is in no way modified and this statement is not removed. Usage > by and for commercial entities requires a license agreement (See > http://www.isb-sib.ch/announce/ or send an email to license@isb- sib.ch ). > > and for GOMiner/MatchMiner Barry Zeeberg [zeebergb@mail.nih.gov] says: > Unofficially, pending any corrections from David Kane, as far as I know, > there are no restrictions on either. At the moment, neither is available as > open source, and we are engaged internally in making a decision about this > issue. Both programs have command line interfaces, which allow a great deal > of flexibility in incorporating them in your own custom data processing > stream. There is no restriction whatever on how you choose to do so. Our > basic idea was to make these as freely available as possible, without even > requiring free registration, to lower the barrier to someone using it. There > are frequent updates, as we either fix a problem, add a feature, or make > changes required by changes in external databases from which these programs > draw information, so it is advisable to be on our email list to be kept up > to date. > > This is an important issue, for me at least, as we annotate Microarrays to > GO (and many other databases). IMHO, to have one of these databases > available from within Bioconductor would greatly increase its value as a > tool to carry out a complete analysis. > > A single authoritative database which would consistently provide results > that was being maintained by a competent organization could reduce the > requirement for downloading flat files. MatchMiner is not 100% reliable > right now as can be seen in the output from one of the earlier posts in this > thread but with a little effort (assuming they go open source) this could be > fixed. XML output would definitely be a boon. > Dave. > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem@jimmy.harvard.edu] > Sent: Monday, April 19, 2004 1:23 PM > To: Dave Waddell > Cc: Bioconductor > Subject: Re: [BioC] Genbank to Unigene IDs > > On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote: >> There are a number of problems in all of the solutions proposed. >> 1. Flat files like Hs are huge and grepping them takes forever. > > Yes, but I don't think that anyone is doing that for a production > system (for one off, it may in fact be more efficient depending on > how you measure efficiency). > >> 2. Keeping flat files up to date is a waste of bandwidth. > > Is there really an option, given that you want to keep up to date? > I know of no standard diff format that would allow us to keep up to > date. Virtually every one of the important public databases uses > different formats and conventions. But if so, please do let us know. > > >> 3. The annotation really needs to be in some kind of database such as >> SOURCE, Matchminer, DAVID or whatever with indexes on each field so that >> searches can complete in a reasonable period of time. > > Yes, and you can easily do that locally - if that is what you want > or do it over the net. The advantage to local is that you have > faster access and you can tailor the database to your needs. > > Another option would be to treat these as web services (but I do not > think that they support it, however your comments below suggest that > they might. My scanning of the relevant webpages turned up no clear > callable interface, but I certainly could have missed something). > If one exists then this can be made very simple using the XML > packages and R's connections (no need for Java, nor any need to > exclude it either - if it is your favorite language). > >> 4. HTML based tools are handy for small searches but useless if you want > to >> perform searches with a large number of terms where you expect to get back >> parseable data. > > Yes, XML is preferable and many of these DBs could provide it with > little extra effort - but I think we need to start asking them to do > so. > > >> 5. Many Genbank Accession numbers (ESTs in particular) don't map to >> Locuslink therefore going from Accession number to Locuslink to Unigene >> simply doesn't work i.e. AA683077. > > A very good point. > >> >> Matchminer works for me because I'm calling Rserve and Matchminer from > Java, >> the response is relatively quick, and I don't have to worry about keeping >> the data current. > > Yes, but you do have to worry about repeatability (if they update > between queries). Do they always tell you and can you determine > which actual data resources they used. I'm not saying you cannot, > just raising one of the points of difference between a locally > amalgamated and managed meta-data resource and an on-line one. There > are good points for both (and bad points for both). > > Doing your own amalgamation allows for more control over how > disparate data sources get merged (and for some folks that is > important). > > Thanks for the interesting comments, > Robert > > >> Dave. >> >> -----Original Message----- >> From: Gordon Smyth [mailto:smyth@wehi.edu.au] >> Sent: Thursday, April 15, 2004 8:48 PM >> To: rossini@u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee > Hwa >> Yang >> Subject: RE: [BioC] Genbank to Unigene IDs >> >> Dear Jean, Tony, James and Dave, >> >> Many thanks for your very helpful replies. Just to re-iterate, my interest > >> was to map from GenBank from UniGene IDs within R, i.e., write a function >> that will take a character vector or list of GenBank IDs and will return >> the corresponding vector or list of UniGene IDs. >> >> If one ignores R, the easiest way that I know of to map GenBank to >> UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for > >> the GenBank IDs as text strings. (My lab keeps a mirror of the usual >> databases, so downloading isn't actually required if the code is to be > used >> within my own lab.) >> >> As as far as R is concerned, you've described a number of methods by which > >> the job could be done in principle, but no one has shown actual code to >> answer my example question, "What's Unigene for GB="NM_004551?" Would it > be >> a fair statement to say that there isn't a reasonably easy way to do the >> job using Bioconductor, and I would be better to stick to the download and > >> grep idea (which of course could be done within R if need be)? >> >> Cheers >> Gordon >> >> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst >> other problems, AnnBuilder won't load without the XML package, and that >> package is not available for R 1.9.0 under Windows. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.7 years ago Barry Zeeberg ▴ 20

0

Entering edit mode

Kane, David ▴ 10

@kane-david-735

Last seen 10.4 years ago

Dave, We have discussed your suggestion about an XML interface, and we would be interested in including one. In fact, that feature had been on our queue of possible features for some time, but we did not have an obvious consumer for the interface. I have a couple questions. 1) In reading the thread below, it sounds as if there is more interest in the Lookup interface than the Merge interface. Is that correct? 2) What sort of usage level did you expect? I know bio-conductor is very popular, and I want to make sure that if we commit to providing a service for use from BioConductor that we are able to meet the expected level of usage. Sincerely, David Kane P.S. For those of you who have been cc'd on this note, but not on the other messages between Dave and I, I believe the issue of the connection problem that Dave alluded to in the note below has been resolved in the latest build that is on our web site. If there are other issues, please let us know. -----Original Message----- From: Barry Zeeberg [mailto:zeebergb@mail.nih.gov] Sent: Monday, April 19, 2004 4:38 PM To: Dave Waddell; Bioconductor Cc: Kane, David; Bussey, Kimberly (NIH/NCI); John N. Weinstein Subject: Re: [BioC] Genbank to Unigene IDs We are very interested in participating with either for profit or not for profit organizations, and feedback on what would be helpful would be fed into our workflow. Any problems with matchminer or gominer are of concern to us, and we prioritize correcting these. In addition to the concrete suggestion of XML output, could you elaborate on the matchminer unreliability issue? It is possible that we have fixed this already in not yet released update, but we would like to track and correct any residual problems. There is a great emphasis now at NIH on technology transfer, and we could all benefit from the successful use of one of our resources in your product. barry On 04/19/04 16:17, "Dave Waddell" <dwaddell@nutecsciences.com> wrote: > There are other issues as well i.e. licensing: > For DAVID: > http://david.niaid.nih.gov/david/ease.htm > > For SOURCE: > There are no restrictions on its use by non-profit institutions as > long as its content is in no way modified and this statement is not > removed. Usage by and for commercial entities requires a license > agreement (See http://www.isb-sib.ch/announce/ or send an email to > license@isb-sib.ch ). > > and for GOMiner/MatchMiner Barry Zeeberg [zeebergb@mail.nih.gov] says: > Unofficially, pending any corrections from David Kane, as far as I > know, there are no restrictions on either. At the moment, neither is > available as open source, and we are engaged internally in making a > decision about this issue. Both programs have command line interfaces, > which allow a great deal of flexibility in incorporating them in your > own custom data processing stream. There is no restriction whatever on > how you choose to do so. Our basic idea was to make these as freely > available as possible, without even requiring free registration, to > lower the barrier to someone using it. There are frequent updates, as > we either fix a problem, add a feature, or make changes required by > changes in external databases from which these programs draw > information, so it is advisable to be on our email list to be kept up > to date. > > This is an important issue, for me at least, as we annotate > Microarrays to GO (and many other databases). IMHO, to have one of > these databases available from within Bioconductor would greatly > increase its value as a tool to carry out a complete analysis. > > A single authoritative database which would consistently provide > results that was being maintained by a competent organization could > reduce the requirement for downloading flat files. MatchMiner is not > 100% reliable right now as can be seen in the output from one of the > earlier posts in this thread but with a little effort (assuming they > go open source) this could be fixed. XML output would definitely be a > boon. Dave. > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem@jimmy.harvard.edu] > Sent: Monday, April 19, 2004 1:23 PM > To: Dave Waddell > Cc: Bioconductor > Subject: Re: [BioC] Genbank to Unigene IDs > > On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote: >> There are a number of problems in all of the solutions proposed. 1. >> Flat files like Hs are huge and grepping them takes forever. > > Yes, but I don't think that anyone is doing that for a production > system (for one off, it may in fact be more efficient depending on how > you measure efficiency). > >> 2. Keeping flat files up to date is a waste of bandwidth. > > Is there really an option, given that you want to keep up to date? I > know of no standard diff format that would allow us to keep up to > date. Virtually every one of the important public databases uses > different formats and conventions. But if so, please do let us know. > > >> 3. The annotation really needs to be in some kind of database such as >> SOURCE, Matchminer, DAVID or whatever with indexes on each field so >> that searches can complete in a reasonable period of time. > > Yes, and you can easily do that locally - if that is what you want or > do it over the net. The advantage to local is that you have faster > access and you can tailor the database to your needs. > > Another option would be to treat these as web services (but I do not > think that they support it, however your comments below suggest that > they might. My scanning of the relevant webpages turned up no clear > callable interface, but I certainly could have missed something). If > one exists then this can be made very simple using the XML packages > and R's connections (no need for Java, nor any need to exclude it > either - if it is your favorite language). > >> 4. HTML based tools are handy for small searches but useless if you >> want > to >> perform searches with a large number of terms where you expect to get >> back parseable data. > > Yes, XML is preferable and many of these DBs could provide it with > little extra effort - but I think we need to start asking them to do > so. > > >> 5. Many Genbank Accession numbers (ESTs in particular) don't map to >> Locuslink therefore going from Accession number to Locuslink to >> Unigene simply doesn't work i.e. AA683077. > > A very good point. > >> >> Matchminer works for me because I'm calling Rserve and Matchminer >> from > Java, >> the response is relatively quick, and I don't have to worry about >> keeping the data current. > > Yes, but you do have to worry about repeatability (if they update > between queries). Do they always tell you and can you determine which > actual data resources they used. I'm not saying you cannot, just > raising one of the points of difference between a locally amalgamated > and managed meta-data resource and an on-line one. There are good > points for both (and bad points for both). > > Doing your own amalgamation allows for more control over how disparate > data sources get merged (and for some folks that is important). > > Thanks for the interesting comments, > Robert > > >> Dave. >> >> -----Original Message----- >> From: Gordon Smyth [mailto:smyth@wehi.edu.au] >> Sent: Thursday, April 15, 2004 8:48 PM >> To: rossini@u.washington.edu", James MacDonald"; Dave Waddell; Jean >> Yee > Hwa >> Yang >> Subject: RE: [BioC] Genbank to Unigene IDs >> >> Dear Jean, Tony, James and Dave, >> >> Many thanks for your very helpful replies. Just to re-iterate, my >> interest > >> was to map from GenBank from UniGene IDs within R, i.e., write a >> function that will take a character vector or list of GenBank IDs and >> will return the corresponding vector or list of UniGene IDs. >> >> If one ignores R, the easiest way that I know of to map GenBank to >> UniGene IDs is to download Hs.data.gz, and to grep or otherwise >> search for > >> the GenBank IDs as text strings. (My lab keeps a mirror of the usual >> databases, so downloading isn't actually required if the code is to >> be > used >> within my own lab.) >> >> As as far as R is concerned, you've described a number of methods by >> which > >> the job could be done in principle, but no one has shown actual code >> to answer my example question, "What's Unigene for GB="NM_004551?" >> Would it > be >> a fair statement to say that there isn't a reasonably easy way to do >> the job using Bioconductor, and I would be better to stick to the >> download and > >> grep idea (which of course could be done within R if need be)? >> >> Cheers >> Gordon >> >> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. >> Amongst other problems, AnnBuilder won't load without the XML >> package, and that package is not available for R 1.9.0 under Windows. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.7 years ago Kane, David ▴ 10

Login before adding your answer.