AnnBuilders paseData() doesn't recognize ACCs with underscore?
3
0
Entering edit mode
Benjamin Otto ▴ 830
@benjamin-otto-1519
Last seen 10.3 years ago
Hi, parseData() seems to have problems in recognition of accession numbers including an underscore like "NM_001815". The function just doesn't find them although they do exist in the database file. Here is the example I'm trying to get working: >library(AnnBuilder) >pkgpath <- .find.package("AnnBuilder") ># unigene infos >ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" ># parsing >ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >"scripts", "gbUGParser"), baseFile = "geneNMap", >organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >parseData(ug) The geneNMap file has the entries: 32468_f_at D90278;M16652 32469_at L00693 NM_001815 NM_001815 BF897514 BF897514 38912_at D90042 BC028014 BC028014 D90042 D90042 I get out: [,1] [,2] 32468_f_at "32468_f_at" "1084;63036" 32469_at "32469_at" "1084" 38912_at "38912_at" "10" BF897514 "BF897514" "1084" D90042 "D90042" "10" Thanks a lot for your help in advance.. Regards, Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg
• 1.3k views
ADD COMMENT
0
Entering edit mode
John Zhang ★ 2.9k
@john-zhang-6
Last seen 10.3 years ago
> >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't find >them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084
ADD COMMENT
0
Entering edit mode
Hi John, Your right, my problem is bound to the mix of accession and RefSeq Ids so being correct gbUGParser wouldn't be expectd to find the refseqs (my description of "accessions including underscores" was pretty dopey, I admitt). I just, probably in an attack of wild speculation, thought the "gb" scipts would automatically include the refseqs because there are no REF2xxxParsers and the gbNRef2LLParser is the only parser with refseq on the input side (as far as I can remember).The gbNRef2LLParser returns LocusLink Ids but I would like to match unigene ids and there seems to be no "gbNREF2UGParser"... So probably I should rename a copy of the gbUGParser to "gbNREF2UGParser" and add the "_" to regular expression. Regards, Benjamin -----Urspr?ngliche Nachricht----- Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] Gesendet: 17 January 2007 15:12 An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with underscore? > >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't >find them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084
ADD REPLY
0
Entering edit mode
Hi John, Here comes a correction to my last email. Probably my brain is working in power save mode today but now I'm a little bit confused: 1) gbUGParser should get genbank ids (accessions) and return unigene ids, right? 2) NM_xxxxxx might denote reference sequences but still ARE accessions, right? AND they are genbank identifiers. So gbUGParser SHOULD recognize them as valid identifier. Regards, benjamin -----Urspr?ngliche Nachricht----- Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] Gesendet: 17 January 2007 15:12 An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with underscore? > >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't >find them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084
ADD REPLY
0
Entering edit mode
Benjamin Otto ▴ 830
@benjamin-otto-1519
Last seen 10.3 years ago
Ah, sorry, just solved the problem. I had to add the "_" to the regular expression in the gbUGParser file in the scripts folder... Regards, Benjamin -----Urspr?ngliche Nachricht----- Von: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Benjamin Otto Gesendet: 17 January 2007 14:50 An: bioconductor at stat.math.ethz.ch Betreff: [BioC] AnnBuilders paseData() doesn't recognize ACCs withunderscore? Hi, parseData() seems to have problems in recognition of accession numbers including an underscore like "NM_001815". The function just doesn't find them although they do exist in the database file. Here is the example I'm trying to get working: >library(AnnBuilder) >pkgpath <- .find.package("AnnBuilder") ># unigene infos >ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" ># parsing >ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, "scripts", >"gbUGParser"), baseFile = "geneNMap", organism = "Homo sapiens", built >= "N/A", fromWeb = FALSE) >parseData(ug) The geneNMap file has the entries: 32468_f_at D90278;M16652 32469_at L00693 NM_001815 NM_001815 BF897514 BF897514 38912_at D90042 BC028014 BC028014 D90042 D90042 I get out: [,1] [,2] 32468_f_at "32468_f_at" "1084;63036" 32469_at "32469_at" "1084" 38912_at "38912_at" "10" BF897514 "BF897514" "1084" D90042 "D90042" "10" Thanks a lot for your help in advance.. Regards, Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
John Zhang ★ 2.9k
@john-zhang-6
Last seen 10.3 years ago
> >Your right, my problem is bound to the mix of accession and RefSeq Ids so >being correct gbUGParser wouldn't be expectd to find the refseqs (my >description of "accessions including underscores" was pretty dopey, I >admitt). I just, probably in an attack of wild speculation, thought the "gb" >scipts would automatically include the refseqs because there are no >REF2xxxParsers and the gbNRef2LLParser is the only parser with refseq on the >input side (as far as I can remember).The gbNRef2LLParser returns LocusLink >Ids but I would like to match unigene ids and there seems to be no >"gbNREF2UGParser"... >So probably I should rename a copy of the gbUGParser to "gbNREF2UGParser" >and add the "_" to regular expression. Yes, you can always write your own parsers to meet special requirements. > > >Regards, > >Benjamin > > > > > > > > >-----Urspr?ngliche Nachricht----- >Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] >Gesendet: 17 January 2007 15:12 >An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de >Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with >underscore? > > >> >>parseData() seems to have problems in recognition of accession numbers >>including an underscore like "NM_001815". The function just doesn't >>find them although they do exist in the database file. > >You have used a wrong parser. There are parsers, such as egRefseqParser and >gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick >one that fits your data. > >> >>Here is the example I'm trying to get working: >> >>>library(AnnBuilder) >>>pkgpath <- .find.package("AnnBuilder") >>># unigene infos >>>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >>># parsing >>>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>>"scripts", "gbUGParser"), baseFile = "geneNMap", >>>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>>parseData(ug) >> >>The geneNMap file has the entries: >> >>32468_f_at D90278;M16652 >>32469_at L00693 >>NM_001815 NM_001815 >>BF897514 BF897514 >>38912_at D90042 >>BC028014 BC028014 >>D90042 D90042 >> >>I get out: >> [,1] [,2] >>32468_f_at "32468_f_at" "1084;63036" >>32469_at "32469_at" "1084" >>38912_at "38912_at" "10" >>BF897514 "BF897514" "1084" >>D90042 "D90042" "10" >> >> >>Thanks a lot for your help in advance.. >> >>Regards, >> >>Benjamin >> >> >>-- >>Benjamin Otto >>Universitaetsklinikum Eppendorf Hamburg >>Institut fuer Klinische Chemie >>Martinistrasse 52 >>20246 Hamburg >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Jianhua Zhang >Department of Medical Oncology >Dana-Farber Cancer Institute >44 Binney Street >Boston, MA 02115-6084 Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084
ADD COMMENT

Login before adding your answer.

Traffic: 561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6