Question

bioMart GO inconsistency, normal?

0

Entering edit mode

R Tagett ▴ 30

@r-tagett-5272

Last seen 10.6 years ago

Hello, I am a graduate student at Wayne State University in Detroit. I am running BioMart to collect GO terms for lists of genes and I noticed that some genes are annotated with top nodes (eg "biological_process") and others are not. I wonder if any one can tell me why. An example code and my sessionInfo are below. In this example, I collect all human HUGO gene symbols using the HGNChelper package. From those, I use BioMart to get the GO terms for these genes, and take only the "biological_process" (BP) annotations. There are 15368 unique genes that have BP annotations (uniqGenesInGO). Then , I split the list of all BP annotations into those which include "GO:0008150" (which is the "biological_process" term), and those which do not. 596 genes are annotated with "GO:0008150", and 14772 are not. This is inconsistent! Thanks for your help, Becky library("biomaRt") ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") library(HGNChelper) data(hgnc.table) # gives names of all approved symbols allHgnc <- unique(hgnc.table[,2]) allGOhgnc <- getBM(attributes = c("go_id", "go_linkage_type", "entrezgene","hgnc_symbol","namespace_1003"), filters = "hgnc_symbol", values = allHgnc, mart = ensembl) # load("allGOhgnc.RData") 266749 BP <- allGOhgnc[which(allGOhgnc$namespace_1003 == "biological_process"),] # 126789 human BP terms in GO BP<-BP[!duplicated(BP),] # 108019 uniqGenesInGO <- unique(BP$hgnc_symbol) # 15368 # "GO:0008150" is "biological_process" hasBPtab <- BP[which(BP$go_id == "GO:0008150"), ] hasBP<- unique(hasBPtab$hgnc_symbol) length(hasBP) # 596 noBPtab<-BP[ -which(BP$hgnc_symbol %in% hasBP), ] length(unique(noBPtab$hgnc_symbol)) # 14772 # 14772 + 596 = 15368 # why are some genes annotated with the top node and others are not?? > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.16.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 XML_3.96-1.1

GO biomaRt GO biomaRt • 1.4k views

ADD COMMENT • link 11.7 years ago R Tagett ▴ 30