Question

trouble using select() to get all GO terms when have duplicate query IDs

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 3 months ago

United States

Hello,

I just spent WAY too much time tracking this down, but there's been a change somewhere between R 3.1.3 and R 3.2.2 in how select() handles duplicated query IDs, particularly for GO terms that have 1:many relationships. (I'm having trouble getting R 3.2.0 and 3.2.1 not to use 3.2.2's packages, so I'm not sure when the change happened). In R 3.1.3, select() would give you a warning about duplicate query keys, but would still give you all the mappings:

R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"
#lines removed...
> library(org.Sc.sgd.db)
#lines removed...
> 
> #get all SGD ids for yeast
> all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD")
> length(all.sgd)
[1] 16389
> 
> all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and
return rows
> dim(all.sgd.go)
[1] 87827     4
> 
> #duplicate just a few SGD:
> 
> dup.sgd <- c(all.sgd,all.sgd[1:3])
> 
> dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' and duplicate query keys resulted in 1:many
mapping between keys and return rows
> dim(dup.sgd.go)
[1] 87827     4
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] org.Sc.sgd.db_3.0.0  RSQLite_1.0.0       
[3] DBI_0.3.1            AnnotationDbi_1.28.1
[5] GenomeInfoDb_1.2.4   IRanges_2.0.1       
[7] S4Vectors_0.4.0      Biobase_2.26.0      
[9] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):
[1] tools_3.1.3

In R 3.2.2, I do like that select now outputs whether you get 1:1, 1:many or many:many mappings, but there seems to be a bug in how it treats many:many mapping, because it's only listing one GO term per query key, including duplicates:

R version 3.2.2 (2015-08-14) -- "Fire Safety"
#lines removed
> library(org.Sc.sgd.db)
#lines removed
> 
> #get all SGD ids for yeast
> all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD")
> length(all.sgd)
[1] 16450
> 
> all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO")
'select()' returned 1:many mapping between keys and columns
> dim(all.sgd.go)
[1] 88004     4
> 
> #duplicate just a few SGD:
> 
> dup.sgd <- c(all.sgd,all.sgd[1:3])

> length(dup.sgd)
[1] 16453

> dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO")
'select()' returned many:many mapping between keys and columns
> dim(dup.sgd.go)
[1] 16453     4
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] org.Sc.sgd.db_3.2.3  RSQLite_1.0.0        DBI_0.3.1           
[4] AnnotationDbi_1.32.0 IRanges_2.4.1        S4Vectors_0.8.2     
[7] Biobase_2.30.0       BiocGenerics_0.16.1 

loaded via a namespace (and not attached):
[1] tools_3.2.2

Is this a bug or a desired behavior?

Thanks,

Jenny

AnnotationDbi • 12k views

ADD COMMENT • link updated 9.4 years ago by James W. MacDonald 68k • written 9.4 years ago by Jenny Drnevich ★ 2.0k

Martin Morgan · Answer 1 · 2015-11-20

Hi Jenny,

It's probably a bug, but what was happening before was probably a bug as well. Or maybe not. It depends on expectations, I suppose.

In other words, what should be returned if there are duplicate keys? You could argue that the return object should conform to the inputs, in which case if you pass in the same key 5 times you should get five sets of results back out, which is what happens currently:

> select(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "ORF", "SGD")
'select()' returned many:1 mapping between keys and columns
         SGD     ORF
1 S000000001 YAL001C
2 S000000001 YAL001C
3 S000000001 YAL001C
4 S000000001 YAL001C
5 S000000001 YAL001C

But not consistently. I patched AnnotationDbi to do what it used to do, and this is what you get:

> select(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "GO", "SGD")
'select()' returned many:many mapping between keys and columns
          SGD         GO EVIDENCE ONTOLOGY
1  S000000001 GO:0000127      IDA       CC
2  S000000001 GO:0000799      IDA       CC
3  S000000001 GO:0001002      IDA       MF
4  S000000001 GO:0001003      IDA       MF
5  S000000001 GO:0001005      IDA       MF
6  S000000001 GO:0001008      IDA       MF
7  S000000001 GO:0001009      IDA       BP
8  S000000001 GO:0001041      IDA       BP
9  S000000001 GO:0003677      IEA       MF
10 S000000001 GO:0005634      IEA       CC
11 S000000001 GO:0005739      IDA       CC
12 S000000001 GO:0005739      IEA       CC
13 S000000001 GO:0006351      IEA       BP
14 S000000001 GO:0006355      IEA       BP
15 S000000001 GO:0008301      IDA       MF
16 S000000001 GO:0042791      IDA       BP
17 S000000001 GO:0042791      IMP       BP
18 S000000001 GO:0071168      IMP       BP

That is not the same - to be consistent we should get these results repeated five times. But that's sort of silly, isn't it? But mapIds() does more what I suppose we should expect:

> mapIds(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "GO", "SGD", multiVals="CharacterList")
'select()' returned many:many mapping between keys and columns
CharacterList of length 5
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168

You could argue that duplicate keys imply an error on the part of the end user, in which case you would first subset to unique keys and then do the mapping, with a message saying that duplicate keys have been removed. That would certainly make the code simpler, and would remove any of these ambiguities. But maybe there is a reasonable use case for duplicated keys, in which case select() should be further modified to return duplicated results.