Hello,
I just spent WAY too much time tracking this down, but there's been a change somewhere between R 3.1.3 and R 3.2.2 in how select() handles duplicated query IDs, particularly for GO terms that have 1:many relationships. (I'm having trouble getting R 3.2.0 and 3.2.1 not to use 3.2.2's packages, so I'm not sure when the change happened). In R 3.1.3, select() would give you a warning about duplicate query keys, but would still give you all the mappings:
R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk" #lines removed... > library(org.Sc.sgd.db) #lines removed... > > #get all SGD ids for yeast > all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD") > length(all.sgd) [1] 16389 > > all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO") Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows > dim(all.sgd.go) [1] 87827 4 > > #duplicate just a few SGD: > > dup.sgd <- c(all.sgd,all.sgd[1:3]) > > dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO") Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows > dim(dup.sgd.go) [1] 87827 4 > sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils [7] datasets methods base other attached packages: [1] org.Sc.sgd.db_3.0.0 RSQLite_1.0.0 [3] DBI_0.3.1 AnnotationDbi_1.28.1 [5] GenomeInfoDb_1.2.4 IRanges_2.0.1 [7] S4Vectors_0.4.0 Biobase_2.26.0 [9] BiocGenerics_0.12.1 loaded via a namespace (and not attached): [1] tools_3.1.3
In R 3.2.2, I do like that select now outputs whether you get 1:1, 1:many or many:many mappings, but there seems to be a bug in how it treats many:many mapping, because it's only listing one GO term per query key, including duplicates:
R version 3.2.2 (2015-08-14) -- "Fire Safety" #lines removed > library(org.Sc.sgd.db) #lines removed > > #get all SGD ids for yeast > all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD") > length(all.sgd) [1] 16450 > > all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO") 'select()' returned 1:many mapping between keys and columns > dim(all.sgd.go) [1] 88004 4 > > #duplicate just a few SGD: > > dup.sgd <- c(all.sgd,all.sgd[1:3])
> length(dup.sgd) [1] 16453
> dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO") 'select()' returned many:many mapping between keys and columns > dim(dup.sgd.go) [1] 16453 4 > sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Sc.sgd.db_3.2.3 RSQLite_1.0.0 DBI_0.3.1 [4] AnnotationDbi_1.32.0 IRanges_2.4.1 S4Vectors_0.8.2 [7] Biobase_2.30.0 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] tools_3.2.2
Is this a bug or a desired behavior?
Thanks,
Jenny
Hi Jenny,
I have patched AnnotationDbi to do what is consistent, which is NOT the same thing as it did before. In other words, if you pass in duplicated keys, you get duplicated results, even if it's a 1:many mapping. As an example:
Previously you would have received the same result as if you passed in a set of unique keys, but that wasn't a consistent result, as you got duplicates for a 1:1 mapping. This also affects mapIds():
Thanks for the bug report!
Thanks for making the changes. I’m out all this week so I won’t get to try it out until next week. I think an argument can be made that in the case of a X:many mapping using select(), that any duplicated keys could be removed safely because there are a variable number of return rows per key. However, I can see why you’d want the behavior to be similar to mapIds() or select() with X:1 mappings, where it makes sense to output one row per input key. Most importantly is that select() with many:many no longer just outputs one of the many! Happy Thanksgiving, Jenny