AnnotationDbi: Select - Join order
3
0
Entering edit mode
@faustfrankenstein-19555
Last seen 4.2 years ago

Hey everyone,

I am working with the AnnotationForge package, but realised a more general issue wrt. to the select command. The results seem to be sensitive to the order passed to the colums parameter, and do in some cases not contain all the data.

Let me give a working example to illustrate my issue:

Gid    <- paste0("transcript",c(1:8))
Sym    <- c(paste0("rp",1:8))
Chr    <- c(rep(1,3), rep(2,5))
Gid2   <- paste0("transcript",c(7:13,7))
Pro    <- c(7:14)
fSym    <- data.frame(GID=Gid, SYMBOL=Sym, CHROM=as.integer(Chr))
fSym$GID <- as.character(fSym$GID)
fPro    <- data.frame(GID=Gid2, PROTEIN=Pro)
fPro$GID <- as.character(fPro$GID)

makeOrgPackage(gene_info=fSym, pro_info=fPro,
               version="0.1",
               maintainer="Some One <so@someplace.org>",
               author="Some One <so@someplace.org>",
               outputDir = ".",
               tax_id="42",
               genus="Omnia",
               species="Marsupilami")

install.packages("./org.OMarsupilami.eg.db", repos=NULL)

Calling now the select command three times

select(x = org.OMarsupilami.eg.db, 
          keys = keys(org.OMarsupilami.eg.db), 
          columns = c("GID","CHROM","PROTEIN"), 
          keytype = "GID")
select(x = org.OMarsupilani.eg.db, 
          keys = keys(org.OMarsupilami.eg.db), 
          columns = c("CHROM","PROTEIN"), 
          keytype = "GID")
select(x = org.OMarsupilami.eg.db, 
          keys = keys(org.OMarsupilami.eg.db), 
          columns = c("PROTEIN","CHROM"), 
          keytype = "GID")

where all examples query for the columns PROTEINS and CHROM. The last two example query them in the two possible orders c("CHROM","PROTEIN") and c("PROTEIN", "CHROM"), and example one includes also the GID column. So, I had expected their output to not deviate much from each other, however the results are somehow remarkable:

            GID CHROM PROTEIN
1   transcript1     1    <NA>
2   transcript2     1    <NA>
3   transcript3     1    <NA>
4   transcript4     2    <NA>
5   transcript5     2    <NA>
6   transcript6     2    <NA>
7   transcript7     2       7
8   transcript7     2      14
9   transcript8     2       8
10  transcript9  <NA>       9
11 transcript10  <NA>      10
12 transcript11  <NA>      11
13 transcript12  <NA>      12
14 transcript13  <NA>      13

for the first example, which is pretty exactly what I expected. However, the other two made me raise my eyebrows.

            GID CHROM PROTEIN
1   transcript1     1    <NA>
2   transcript2     1    <NA>
3   transcript3     1    <NA>
4   transcript4     2    <NA>
5   transcript5     2    <NA>
6   transcript6     2    <NA>
7   transcript7     2       7
8   transcript7     2      14
9   transcript8     2       8
10  transcript9  <NA>    <NA>
11 transcript10  <NA>    <NA>
12 transcript11  <NA>    <NA>
13 transcript12  <NA>    <NA>
14 transcript13  <NA>    <NA>

            GID PROTEIN CHROM
1   transcript1    <NA>  <NA>
2   transcript2    <NA>  <NA>
3   transcript3    <NA>  <NA>
4   transcript4    <NA>  <NA>
5   transcript5    <NA>  <NA>
6   transcript6    <NA>  <NA>
7   transcript7       7     2
8   transcript7      14     2
9   transcript8       8     2
10  transcript9       9  <NA>
11 transcript10      10  <NA>
12 transcript11      11  <NA>
13 transcript12      12  <NA>
14 transcript13      13  <NA>

It is somehow striking that neither of both examples is returning all the data, and furthermore the data in the last column seems to be dominated by the first column wrt. to an intersection restriction, i.e. is order dependent. This dominance of the first column is my attempt to explain why all data is returned in the first example.

So, this behaviour boiled down in my mind to the three following questions : 1. Is this behaviour due to a certain SQL join (left, right, ...) order ? 2. Is there a plain way to overcome it and return a full joint ? 3. Or is the workaround to have GID always included in the query as the very first column?

Anyone out there an idea?

AnnotationDbi • 2.7k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 22 hours ago
United States

SQLite, which is the database underlying the AnnotationDbi only supports left joins, so it's not easy to do a full outer join. You can emulate a full outer join by doing two left joins where you swap the order of the tables, but that (probably) becomes programmatically difficult as the number of tables increases. An alternative is to use mapIds and cbind the results:

> as.data.frame(lapply(c("PROTEIN","CHROM"), 
                        function(x) mapIds(org.OMarsupilami.eg.db, keys(org.OMarsupilami.eg.db), x, "GID")), 
                        col.names = c("PROTEIN","CHROM"))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
             PROTEIN CHROM
transcript1     <NA>     1
transcript2     <NA>     1
transcript3     <NA>     1
transcript4     <NA>     2
transcript5     <NA>     2
transcript6     <NA>     2
transcript7        7     2
transcript8        8     2
transcript9        9  <NA>
transcript10      10  <NA>
transcript11      11  <NA>
transcript12      12  <NA>
transcript13      13  <NA>
ADD COMMENT
0
Entering edit mode

And you could argue that a DataFrame is an even better thing to use, as there are one to many mappings in there:

> z <- DataFrame(lapply(c("PROTEIN","CHROM"), function(x) mapIds(org.OMarsupilami.eg.db, keys(org.OMarsupilami.eg.db), x, "GID", multiVals = "CharacterList")))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns

> names(z) <- c("PROTEIN","CHROM")
> z
DataFrame with 13 rows and 2 columns
                     PROTEIN           CHROM
             <CharacterList> <CharacterList>
transcript1               NA               1
transcript2               NA               1
transcript3               NA               1
transcript4               NA               2
transcript5               NA               2
...                      ...             ...
transcript9                9              NA
transcript10              10              NA
transcript11              11              NA
transcript12              12              NA
transcript13              13              NA

> z[5:8,]
DataFrame with 4 rows and 2 columns
                    PROTEIN           CHROM
            <CharacterList> <CharacterList>
transcript5              NA               2
transcript6              NA               2
transcript7            7,14               2
transcript8               8               2

ADD REPLY
0
Entering edit mode
@faustfrankenstein-19555
Last seen 4.2 years ago

Uha, I see. First of all, thank you very much for your thoroughly answer. I somehow suspected the underlying SQLite architecture (indeed, I hadn't have a look at it, yet) would only permit such a rather user-unfriendly solution. Thing is, I would like to deploy an organism package and relieve the user at the back-end from such considerations as left, right, full joins etc. I guess, I had to mask or even overwrite the select function, which is probably not that straight-forward. Might even be not a good idea at all. Will have to get my head around it tomorrow, when I am back in my office.

ADD COMMENT
0
Entering edit mode

In future, if you want to add a reply, please use the ADD REPLY or ADD COMMENT buttons rather than adding an answer, which you by definition are not doing.

ADD REPLY
0
Entering edit mode

I was realizing it the moment I hit the 'submit' button. Sometimes, late at night, 'answer', 'respond' and 'reply' appear to be semantically equivalent concepts. Seriously, I beg your pardon for it.

The SQLite join is giving me some headaches - I am wondering, whether it might be a more suitable way to create only one huge dataframe, including duplicates for some fields, and assigning new unique IDs. However, I am somehow doubting it will get along with corresponding BSgenome & TxDb objects, which eventually will be tied together by an OrganismDbi package instance.

ADD REPLY
0
Entering edit mode

Well, you already showed one fix, which is to require that the central ID be part of any query. I showed an alternative, which is to make individual queries using mapIds and converting to a data.frame. Is there something wrong with either of those ideas?

ADD REPLY
0
Entering edit mode

I would not say there is anything wrong with those ideas, for me it's rather a question of efficiency and elegant style. Let's say, I am choosing my annotation package to rely on one huge data.frame and assign unique ids, which cannot be transcript ids anymore - that's what I can assure you of. In this case, I am not able to foresee whether the select of the OrganismDbi package instance will still seamlessly work, as the primary keys of the BSgenome and TxDb objects are still transcript ids - how to change the latter ones into general ids is a current major concern about this fix. On the other side, masking or rewriting the select function only for an annotation object - while it seems mapIds is making use of it; I do not want to run into any recursive calls - might cause another compatibility issue at the OrganismDbi object level. As there are only three organisms being so comprehensive on Bioconductor - in terms of being tied together by an OrganismDbi object - , makes me lack some exemplary material; elaborating either of those two fixes will take me a couple of days, therefore I am just curious about a preference.

ADD REPLY
0
Entering edit mode

I just realized the fix you were referring to is not the one of adding an additional ID. I am sorry, didn't get that during my first read. Anyhow, I preferred your approach over messing around with a new ID, for this reason I did spend some time implementing yours. And yup, it happened exactly what I assumed - I ran into a recursive call blowing up my memory stack. Some google-fu led me to the vignette "Creating select Interfaces for custom Annotation resources", which gave me a hint where I had to spud in. So, I masked the generic select function

.select <- function(db, keys, columns, keytype){ DataFrame(lapply(columns, function(x) mapIds(db, keys, x, keytype))) } setMethod("select", signature("OrgDb"), function(x, keys, columns, keytype) { .select(x, keys=keys, columns=columns, keytype = keytype) } )

and got for the plain call select(org.OMarsupilami.eg.db, head(keys(org.OMarsupilami.eg.db)), columns = c("GID"), keytype = "GID") the error thrown

Error: C stack usage 7972692 is too close to the limit, which hints generally at a too deeply nested call, here the self-recurrent call.

Hence, as last resource, I will have to code the select function in plain SQLite omitting the left join restriction by querying the db for each column separately and bind the result appropriately.

ADD REPLY
0
Entering edit mode
@faustfrankenstein-19555
Last seen 4.2 years ago

For the sake of completion, one elegant way is to mask the global select function (, which is expecting an OrgDbj, - see your current available select functions via showMethods("select")) for your current organism package.

In your zzz file add to the utmost layer a class

 setClass("foo", representation(name="character"),
       prototype(name="DUMMY_VALUE"))

change in the .onLoad function the dbNewname to

dbNewname <- paste0(AnnotationDbi:::dbObjectName(pkgname,"OrgDb"),"0")

which was the organism's exported object you are referring to by default to in the select call. Make a replacement for the original object

Foo <- new("foo")
assign("org.REPLACEURORGANISM.eg.db", Foo, envir=ns)
namespaceExport(ns, "org.REPLACEURORGANISM.eg.db")`.

You are going to work now with the replacement as an object - most generally, you are able to deliberately declare and define any functions on it. For instance, the select function

setMethod("select", signature("foo"), 
          function(x, keys, columns, keytype)
          {
            .select(x, keys=keys, columns=columns, keytype = keytype)
          }
)

Even though, you are having now a deliberate freedom to define your select function, it seems natural to have it operating on the organism's database. So, whatever you might have in mind, we are simply extending the select's genuine functionality by overcoming the left join restriction.

Therefore mask, actually declare for your new class foo, the pre-existing select function, i.e. overload it by

.select <- function(db, keys, columns, keytype){
  org <- org.REPLACEURORGANISM.eg.db::org.REPLACEURORGANISM.eg.db0 
  res <- DataFrame(  lapply(columns , function(x) mapIds(org, keys, x, keytype, multiVals = "CharacterList")))
  colnames(res) <- c(columns)
  return(res)
}

The overloaded object orgDbi is now first called by the instance of foo, and does whatever you desires.

In our case, it returns the full join with multi-value results as a data frame. For providing this splendid fix for the full join, which made me find the rest, I'd like to thank James.

ADD COMMENT

Login before adding your answer.

Traffic: 387 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6