Question

AnnotationDBI Select with incomplete keys tags

0

Entering edit mode

Rockbar • 0

@rockbar-9979

Last seen 18 months ago

Germany

In relation to this topic

how to use "non-standard" species for KEGG / GO analysis in limma?

, I also want to further annotate some CHO microarray files and extract Gene symbols from genbank IDs.

The data i want to use is the following:

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-30321/files/A-GEOD-13791.adf.txt

Unfortunately, the author just published the accession, but not the full accession.version notation. Thus, it would not match with the annotation package "ACCNUM" keys, which include accesion.version number, like "EGW04382.1". So, is there a way to look/map not for exactly the same name, but to allow searching parts of the key name? Then it would be able to match the accession number.

Thus, in brief, e.g.: I want to match "EGW04382.1" (annotation package) with "EGW04382" (published microarray data) using AnnotationDbi. Is there a way?

Thank you!

annotationdbi select • 1.4k views

ADD COMMENT • link 9.1 years ago • updated 9.0 years ago Rockbar • 0

score 1 · Answer 1 · 2016-03-25

Ideally the database wouldn't have the version postfix for the accession numbers. In general we strip that off, because it's not really necessary for annotating things (e.g., the changes between say EGW048382.1 and EGW048382.2 won't change what gene we are talking about, etc).

Anyway, the easiest thing to do is just make a data.frame that has the (postfix stripped) accession numbers in one column, and the gene symbols in the other, and use match to match them up.

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-03-09
> z <- hub[["AH48061"]]
> mapper <- select(z, keys(z), c("ACCNUM","SYMBOL"))
'select()' returned 1:many mapping between keys and columns
> mapper$ACCNUM <- gsub("\\.[1-9]", "", as.character(mapper$ACCNUM))
> head(mapper)
      GID    ACCNUM SYMBOL
1 3979178  ABD49734   ND4L
2 3979178  ACC86255   ND4L
3 3979178 YP_537127   ND4L
4 3979179  ABD49735    ND4
5 3979179 YP_537128    ND4
6 3979180  ABD49736    ND5

Then say you have a set of accession numbers (here I fake some up)

> accnum <- mapper$ACCNUM[sample(1:5000, 30)]
> accnum
 [1] "JP059326"     "JI889453"     "EGW08349"     "AAA74140"     "NP_001230979"
 [6] "AAD30976"     "NM_001246717" "XP_007646054" "EGW01067"     "XP_007646591"
[11] "AAL57738"     "XP_007645054" "EGW08308"     "XM_007628285" "FN825776"    
[16] "ABQ85432"     "NM_001246755" "XP_007621204" "XM_007645844" "XP_007639173"
[21] "BAA34652"     "BAA88319"     "JI869646"     "JP056468"     "XM_007641315"
[26] "XM_003514799" "XP_007622787" "XM_007653487" "NP_001233694" "NP_001233637"
> mapped <- mapper[match(accnum, mapper$ACCNUM),]
> mapped
           GID       ACCNUM       SYMBOL
3426 100689473     JP059326        Hspd1
2156 100689312     JI889453        Gosr1
4551 100750715     EGW08349        Ints7
321  100689017     AAA74140       Srebf2
800  100689064 NP_001230979         Ldha
627  100689049     AAD30976        Mpdu1
3316 100689459 NM_001246717        Cenpa
1803 100689245 XP_007646054        Pparg
490  100689036     EGW01067         Fut9
3497 100736552 XP_007646591       Scarb1
239  100689008     AAL57738 LOC100689008
1203 100689177 XP_007645054        Pam16
4900 100750819     EGW08308        Dnm1l
3834 100750426 XM_007628285         Btrc
470  100689031     FN825776      Slc35a2
1832 100689247     ABQ85432         Cnbp
2223 100689322 NM_001246755      Slc35a1
697  100689055 XP_007621204      Lrrfip1
4796 100750781 XM_007645844        Mtmr3
3617 100750381 XP_007639173 LOC100750381
481  100689033     BAA34652      Cyp2a14
2681 100689377     BAA88319        Ercc1
833  100689069     JI869646        Gnao1
3132 100689432     JP056468         Ugcg
2954 100689407 XM_007641315      Slc19a1
4663 100750748 XM_003514799      Arfgef1
1083 100689099 XP_007622787          Vim
4620 100750728 XM_007653487 LOC100750728
2303 100689332 NP_001233694        Prdx1
3250 100689449 NP_001233637         Pgs1