getBM returns shorter vectors than values

3

Entering edit mode

Lescai, Francesco ▴ 380

@lescai-francesco-5078

Last seen 6.2 years ago

Denmark

Hi, I have the same problem, and it's been this way since I used biomaRt I might say. is there any way to force getBM to return NA when the attribute corresponding to the filter cannot be found? At least when annotating your results you'd be able to get same length vectors, and it would be much easier to do that in data.frames. thanks for any suggestions, cheers, Francesco On 29 Aug 2013, at 05:40, Atul <atulkakrana@outlook.com<mailto:atulkakrana@outlook.com>> wrote: Hi All, I am using Oligo package to analyse samples generated using HuEx 1.0 ST v2 chip. The problem I am facing is with annotating the results. Here is my code (simplified): celFilesA <- list.celfiles() AF_data.A <- read.celfiles(celFilesA,pkgname='pd.huex.1.0.st.v2') AF.eset.RMA <- rma(AF_data.A,target='core') > dim(exprs(AF.eset.RMA)) [1] 22011 10 ##Attempt to annotate library(biomaRt) ID <- rownames(AF.eset.RMA) ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl') Anno <- getBM(attributes=c("strand","transcript_start","chromosome_nam e","hgnc_symbol"),filters=c("affy_huex_1_0_st_v2"),values=ID,mart=ense mbl) > dim(Anno) [1] 1635 4 As you see, out of total 22011 genes/probeset I can annotate only 1635 genes/probesets. Is there any way I can get the annotations for all of the genes/probesets and add them back to my expression set (AF.eset.RMA). So, that annotations are included in the final results. Usually, with other chips I do this: ID <- featureNames(AF.eset.RMA) Symbol <- getSYMBOL(ID, 'mouse4302.db') Name <- as.character(lookUp(ID, "mouse4302.db", "GENENAME")) tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) tmp[tmp=="NA"] <- NA fData(AF.esetRMA) <- tmp And this is what I want to achieve in present case. I would appreciate your help. Thanks AK _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

annotate biomaRt oligo annotate biomaRt oligo • 5.5k views

ADD COMMENT • link updated 11.2 years ago by Steffen Durinck ▴ 540 • written 11.2 years ago by Lescai, Francesco ▴ 380

0

Entering edit mode

I feel your situation. Guess it is complicated to modify things on the part of biomart. But, Hey,

dplyr::left_join should be able to take care of the missing NA. (as mentioned, like a wrapper, via constructing a data frame using the input parameter "values")

ADD REPLY • link 7.1 years ago Jerry • 0

0

Entering edit mode

Steffen Durinck ▴ 540

@steffen-durinck-4465

Last seen 10.2 years ago

Hi Francesco, That is correct, biomaRt doesn't return anything if it can find it. It is designed to work just like the BioMart web services at www.biomart.orgwhich behave the same. I usually add the filter as an attribute so I can match things up and figure out what did return a result. Your query would be: Anno <- getBM(attributes=c("affy_huex_1_0_st_v2","strand"," transcript_start","chromosome_name","hgnc_symbol"),filters= c("affy_huex_1_0_st_v2"),values=ID,mart=ensembl) If you want a vector back with the same length as ID and with NA's where you didn't get a result, you could write a wrapper function around getBM that does that for you. Best, Steffen On Wed, Sep 11, 2013 at 6:15 AM, Francesco Lescai < francesco.lescai@hum-gen.au.dk> wrote: > Hi, > I have the same problem, and it's been this way since I used biomaRt I > might say. > is there any way to force getBM to return NA when the attribute > corresponding to the filter cannot be found? > At least when annotating your results you'd be able to get same length > vectors, and it would be much easier to do that in data.frames. > > thanks for any suggestions, > cheers, > Francesco > > > On 29 Aug 2013, at 05:40, Atul <atulkakrana@outlook.com<mailto:> atulkakrana@outlook.com>> wrote: > > Hi All, > > I am using Oligo package to analyse samples generated using HuEx 1.0 ST v2 > chip. The problem I am facing is with annotating the results. > > Here is my code (simplified): > > celFilesA <- list.celfiles() > AF_data.A <- read.celfiles(celFilesA,pkgname='pd.huex.1.0.st.v2') > AF.eset.RMA <- rma(AF_data.A,target='core') > > > dim(exprs(AF.eset.RMA)) > [1] 22011 10 > > ##Attempt to annotate > library(biomaRt) > ID <- rownames(AF.eset.RMA) > ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl') > Anno <- > getBM(attributes=c("strand","transcript_start","chromosome_name","hg nc_symbol"),filters=c("affy_huex_1_0_st_v2"),values=ID,mart=ensembl) > > > dim(Anno) > [1] 1635 4 > > As you see, out of total 22011 genes/probeset I can annotate only 1635 > genes/probesets. Is there any way I can get the annotations for all of the > genes/probesets and add them back to my expression set (AF.eset.RMA). So, > that annotations are included in the final results. > > > Usually, with other chips I do this: > ID <- featureNames(AF.eset.RMA) > Symbol <- getSYMBOL(ID, 'mouse4302.db') > Name <- as.character(lookUp(ID, "mouse4302.db", "GENENAME")) > tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) > tmp[tmp=="NA"] <- NA > fData(AF.esetRMA) <- tmp > > And this is what I want to achieve in present case. I would appreciate > your help. > > Thanks > > AK > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org<mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.2 years ago Steffen Durinck ▴ 540

0

Entering edit mode

Hi Steffen, thanks for your reply, yes it works this way :-) however, getBM doesn't seem to return results in the same order. here's a simple test > tesgenes [1] "ENSMUSG00000027255" "ENSMUSG00000020472" "ENSMUSG00000020807" "ENSMUSG00000086769" "ENSMUSG00000016024" > getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", "external_gene_id"), values=tesgenes, mart=ensembl) ensembl_gene_id external_gene_id 1 ENSMUSG00000016024 Lbp 2 ENSMUSG00000020472 Zkscan17 3 ENSMUSG00000020807 4933427D14Rik 4 ENSMUSG00000027255 Arfgap2 5 ENSMUSG00000086769 Gm15587 therefore if I have a data.frame with gene IDs and I just make a cbind, it doesn't match. I solved it by merging the two data.frame by columns id like this MyResults <- merge( MyResults, getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", "external_gene_id"), values= MyResults$geneID, mart=ensembl), by.x="geneID", by.y="ensembl_gene_id" ) is there any way to control getBM() to return data in the same order of the vector of values, or it is a behaviour due to the way the query works? thanks for your prompt reply, Francesco On 11 Sep 2013, at 17:46, Steffen Durinck <durinck.steffen@gene.com<mailto:durinck.steffen@gene.com>> wrote: Hi Francesco, That is correct, biomaRt doesn't return anything if it can find it. It is designed to work just like the BioMart web services at www.biomart.org<http: www.biomart.org=""/> which behave the same. I usually add the filter as an attribute so I can match things up and figure out what did return a result. Your query would be: Anno <- getBM(attributes=c("affy_huex_1_0_st_v2","strand","transcript_ start","chromosome_name","hgnc_symbol"),filters=c("affy_huex_1_0_st_v2 "),values=ID,mart=ensembl) If you want a vector back with the same length as ID and with NA's where you didn't get a result, you could write a wrapper function around getBM that does that for you. Best, Steffen On Wed, Sep 11, 2013 at 6:15 AM, Francesco Lescai <francesco.lescai @hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk="">> wrote: Hi, I have the same problem, and it's been this way since I used biomaRt I might say. is there any way to force getBM to return NA when the attribute corresponding to the filter cannot be found? At least when annotating your results you'd be able to get same length vectors, and it would be much easier to do that in data.frames. thanks for any suggestions, cheers, Francesco On 29 Aug 2013, at 05:40, Atul <atulkakrana@outlook.com<mailto:atulkak rana@outlook.com=""><mailto:atulkakrana@outlook.com<mailto:atulkakrana@ou tlook.com="">>> wrote: Hi All, I am using Oligo package to analyse samples generated using HuEx 1.0 ST v2 chip. The problem I am facing is with annotating the results. Here is my code (simplified): celFilesA <- list.celfiles() AF_data.A <- read.celfiles(celFilesA,pkgname='pd.huex.1.0.st.v2') AF.eset.RMA <- rma(AF_data.A,target='core') > dim(exprs(AF.eset.RMA)) [1] 22011 10 ##Attempt to annotate library(biomaRt) ID <- rownames(AF.eset.RMA) ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl') Anno <- getBM(attributes=c("strand","transcript_start","chromosome_nam e","hgnc_symbol"),filters=c("affy_huex_1_0_st_v2"),values=ID,mart=ense mbl) > dim(Anno) [1] 1635 4 As you see, out of total 22011 genes/probeset I can annotate only 1635 genes/probesets. Is there any way I can get the annotations for all of the genes/probesets and add them back to my expression set (AF.eset.RMA). So, that annotations are included in the final results. Usually, with other chips I do this: ID <- featureNames(AF.eset.RMA) Symbol <- getSYMBOL(ID, 'mouse4302.db') Name <- as.character(lookUp(ID, "mouse4302.db", "GENENAME")) tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) tmp[tmp=="NA"] <- NA fData(AF.esetRMA) <- tmp And this is what I want to achieve in present case. I would appreciate your help. Thanks AK _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Lescai, Francesco ▴ 380

0

Entering edit mode

Hi Francesco This is due to the actual biomart server which is access by the Bioconductor package biomaRt. Unless, I am unaware of a recent change in the biomart server, there is now way to preserve the order of the input (or keep duplicates, or indicate which id does not have a result, etc). Of course, there is a quick and dirty (and bad) solution: You loop over your gene IDs and make an individual request for each gene.... Regards, Hans-Rudolf On 09/12/2013 11:21 AM, Francesco Lescai wrote: > Hi Steffen, > thanks for your reply, yes it works this way :-) > > however, getBM doesn't seem to return results in the same order. here's a simple test > >> tesgenes > [1] "ENSMUSG00000027255" "ENSMUSG00000020472" "ENSMUSG00000020807" "ENSMUSG00000086769" "ENSMUSG00000016024" >> getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", "external_gene_id"), values=tesgenes, mart=ensembl) > ensembl_gene_id external_gene_id > 1 ENSMUSG00000016024 Lbp > 2 ENSMUSG00000020472 Zkscan17 > 3 ENSMUSG00000020807 4933427D14Rik > 4 ENSMUSG00000027255 Arfgap2 > 5 ENSMUSG00000086769 Gm15587 > > therefore if I have a data.frame with gene IDs and I just make a cbind, it doesn't match. > I solved it by merging the two data.frame by columns id like this > > MyResults <- merge( > MyResults, > getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", "external_gene_id"), values= MyResults$geneID, mart=ensembl), > by.x="geneID", > by.y="ensembl_gene_id" > ) > > is there any way to control getBM() to return data in the same order of the vector of values, or it is a behaviour due to the way the query works? > > thanks for your prompt reply, > Francesco > > On 11 Sep 2013, at 17:46, Steffen Durinck <durinck.steffen at="" gene.com<mailto:durinck.steffen="" at="" gene.com="">> wrote: > > Hi Francesco, > > That is correct, biomaRt doesn't return anything if it can find it. It is designed to work just like the BioMart web services at www.biomart.org<http: www.biomart.org=""/> which behave the same. > I usually add the filter as an attribute so I can match things up and figure out what did return a result. > Your query would be: > > Anno <- getBM(attributes=c("affy_huex_1_0_st_v2","strand","transcrip t_start","chromosome_name","hgnc_symbol"),filters=c("affy_huex_1_0_st_ v2"),values=ID,mart=ensembl) > > If you want a vector back with the same length as ID and with NA's where you didn't get a result, you could write a wrapper function around getBM that does that for you. > > Best, > Steffen > > > On Wed, Sep 11, 2013 at 6:15 AM, Francesco Lescai <francesco.lescai at="" hum-gen.au.dk<mailto:francesco.lescai="" at="" hum-gen.au.dk="">> wrote: > Hi, > I have the same problem, and it's been this way since I used biomaRt I might say. > is there any way to force getBM to return NA when the attribute corresponding to the filter cannot be found? > At least when annotating your results you'd be able to get same length vectors, and it would be much easier to do that in data.frames. > > thanks for any suggestions, > cheers, > Francesco > > > On 29 Aug 2013, at 05:40, Atul <atulkakrana at="" outlook.com<mailto:atulkakrana="" at="" outlook.com=""><mailto:atulkakrana at="" outlook.com<mailto:atulkakrana="" at="" outlook.com="">>> wrote: > > Hi All, > > I am using Oligo package to analyse samples generated using HuEx 1.0 ST v2 chip. The problem I am facing is with annotating the results. > > Here is my code (simplified): > > celFilesA <- list.celfiles() > AF_data.A <- read.celfiles(celFilesA,pkgname='pd.huex.1.0.st.v2') > AF.eset.RMA <- rma(AF_data.A,target='core') > >> dim(exprs(AF.eset.RMA)) > [1] 22011 10 > > ##Attempt to annotate > library(biomaRt) > ID <- rownames(AF.eset.RMA) > ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl') > Anno <- getBM(attributes=c("strand","transcript_start","chromosome_n ame","hgnc_symbol"),filters=c("affy_huex_1_0_st_v2"),values=ID,mart=en sembl) > >> dim(Anno) > [1] 1635 4 > > As you see, out of total 22011 genes/probeset I can annotate only 1635 genes/probesets. Is there any way I can get the annotations for all of the genes/probesets and add them back to my expression set (AF.eset.RMA). So, that annotations are included in the final results. > > > Usually, with other chips I do this: > ID <- featureNames(AF.eset.RMA) > Symbol <- getSYMBOL(ID, 'mouse4302.db') > Name <- as.character(lookUp(ID, "mouse4302.db", "GENENAME")) > tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) > tmp[tmp=="NA"] <- NA > fData(AF.esetRMA) <- tmp > > And this is what I want to achieve in present case. I would appreciate your help. > > Thanks > > AK > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.2 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

On Thu, Sep 12, 2013 at 5:47 AM, Hans-Rudolf Hotz <hrh at="" fmi.ch=""> wrote: > Hi Francesco > > This is due to the actual biomart server which is access by the Bioconductor > package biomaRt. Unless, I am unaware of a recent change in the biomart > server, there is now way to preserve the order of the input (or keep > duplicates, or indicate which id does not have a result, etc). Francesco, It is an extra step, but see the match() and merge() functions to rectify your input vector with the results from biomaRt. Sean > Of course, there is a quick and dirty (and bad) solution: You loop over your > gene IDs and make an individual request for each gene.... > > > Regards, Hans-Rudolf > > > > On 09/12/2013 11:21 AM, Francesco Lescai wrote: >> >> Hi Steffen, >> thanks for your reply, yes it works this way :-) >> >> however, getBM doesn't seem to return results in the same order. here's a >> simple test >> >>> tesgenes >> >> [1] "ENSMUSG00000027255" "ENSMUSG00000020472" "ENSMUSG00000020807" >> "ENSMUSG00000086769" "ENSMUSG00000016024" >>> >>> getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", >>> "external_gene_id"), values=tesgenes, mart=ensembl) >> >> ensembl_gene_id external_gene_id >> 1 ENSMUSG00000016024 Lbp >> 2 ENSMUSG00000020472 Zkscan17 >> 3 ENSMUSG00000020807 4933427D14Rik >> 4 ENSMUSG00000027255 Arfgap2 >> 5 ENSMUSG00000086769 Gm15587 >> >> therefore if I have a data.frame with gene IDs and I just make a cbind, it >> doesn't match. >> I solved it by merging the two data.frame by columns id like this >> >> MyResults <- merge( >> MyResults, >> getBM(filters=c("ensembl_gene_id"), attributes=c("ensembl_gene_id", >> "external_gene_id"), values= MyResults$geneID, mart=ensembl), >> by.x="geneID", >> by.y="ensembl_gene_id" >> ) >> >> is there any way to control getBM() to return data in the same order of >> the vector of values, or it is a behaviour due to the way the query works? >> >> thanks for your prompt reply, >> Francesco >> >> On 11 Sep 2013, at 17:46, Steffen Durinck >> <durinck.steffen at="" gene.com<mailto:durinck.steffen="" at="" gene.com="">> wrote: >> >> Hi Francesco, >> >> That is correct, biomaRt doesn't return anything if it can find it. It is >> designed to work just like the BioMart web services at >> www.biomart.org<http: www.biomart.org=""/> which behave the same. >> I usually add the filter as an attribute so I can match things up and >> figure out what did return a result. >> Your query would be: >> >> Anno <- >> getBM(attributes=c("affy_huex_1_0_st_v2","strand","transcript_start ","chromosome_name","hgnc_symbol"),filters=c("affy_huex_1_0_st_v2"),va lues=ID,mart=ensembl) >> >> If you want a vector back with the same length as ID and with NA's where >> you didn't get a result, you could write a wrapper function around getBM >> that does that for you. >> >> Best, >> Steffen >> >> >> On Wed, Sep 11, 2013 at 6:15 AM, Francesco Lescai >> <francesco.lescai at="" hum-gen.au.dk<mailto:francesco.lescai="" at="" hum-="" gen.au.dk="">> >> wrote: >> Hi, >> I have the same problem, and it's been this way since I used biomaRt I >> might say. >> is there any way to force getBM to return NA when the attribute >> corresponding to the filter cannot be found? >> At least when annotating your results you'd be able to get same length >> vectors, and it would be much easier to do that in data.frames. >> >> thanks for any suggestions, >> cheers, >> Francesco >> >> >> On 29 Aug 2013, at 05:40, Atul >> <atulkakrana at="" outlook.com<mailto:atulkakrana="" at="" outlook.com=""><mailto:atulkakrana at="" outlook.com<mailto:atulkakrana="" at="" outlook.com="">>> >> wrote: >> >> Hi All, >> >> I am using Oligo package to analyse samples generated using HuEx 1.0 ST v2 >> chip. The problem I am facing is with annotating the results. >> >> Here is my code (simplified): >> >> celFilesA <- list.celfiles() >> AF_data.A <- read.celfiles(celFilesA,pkgname='pd.huex.1.0.st.v2') >> AF.eset.RMA <- rma(AF_data.A,target='core') >> >>> dim(exprs(AF.eset.RMA)) >> >> [1] 22011 10 >> >> ##Attempt to annotate >> library(biomaRt) >> ID <- rownames(AF.eset.RMA) >> ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl') >> Anno <- >> getBM(attributes=c("strand","transcript_start","chromosome_name","h gnc_symbol"),filters=c("affy_huex_1_0_st_v2"),values=ID,mart=ensembl) >> >>> dim(Anno) >> >> [1] 1635 4 >> >> As you see, out of total 22011 genes/probeset I can annotate only 1635 >> genes/probesets. Is there any way I can get the annotations for all of the >> genes/probesets and add them back to my expression set (AF.eset.RMA). So, >> that annotations are included in the final results. >> >> >> Usually, with other chips I do this: >> ID <- featureNames(AF.eset.RMA) >> Symbol <- getSYMBOL(ID, 'mouse4302.db') >> Name <- as.character(lookUp(ID, "mouse4302.db", "GENENAME")) >> tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) >> tmp[tmp=="NA"] <- NA >> fData(AF.esetRMA) <- tmp >> >> And this is what I want to achieve in present case. I would appreciate >> your help. >> >> Thanks >> >> AK >> >> _______________________________________________ >> Bioconductor mailing list >> >> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.2 years ago Sean Davis 21k

Login before adding your answer.