XStringSet operations discard names - how to address?
1
0
Entering edit mode
@nicholasbauer-10742
Last seen 8.6 years ago

I'm trying to perform set operations (union, intersect, setdiff) on DNAStringSets, but doing so strips off the names. How can I do set operations while keeping the names intact?

biostrings • 1.5k views
ADD COMMENT
0
Entering edit mode

Looking at the source... the SetOperation method does a unique on the arguments, turns them into character vectors, then performs the set operation on the vector. But as.character strips attributes. I can see the use of this in some cases, but they appear inconsistent with the rest of the class, as it doesn't appear that any other parts of the class assume or enforce that sequences are unique or that names should not be preserved if possible.

I don't know the proper R way this could be done, but for the purposes of the set operations, could the name be appended to the character vector prior to the set operation and then extracted afterwards?

ADD REPLY
0
Entering edit mode

Discovered that as.character accepts use.names, but this has no effect on the result of the set operation.

ADD REPLY
4
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

The implementation of the set operations for XStringSet objects is a relic from prehistoric times. A better (and more generic) implementation is:

setMethod("union", c("Vector", "Vector"),
    function(x, y) unique(c(x, y))
)
setMethod("intersect", c("Vector", "Vector"),
    function(x, y) unique(x[x %in% y])
)
setMethod("setdiff", c("Vector", "Vector"),
    function(x, y) unique(x[!(x %in% y)])
)

They don't coerce to character vector internally (so are more efficient) and they propagate the names and metadata columns of the first argument (x).

Note that right now if you define the above methods (by copy/past'ing the above code in your session), the more specific methods for XStringSet objects will get in the way, that is, dispatch will still get the methods for XStringSet objects. So for now, to work around this, you would need to replace the occurrences of Vector with XStringSet. I'm in the process of adding the above methods to the S4Vectors package (where they belong) and removing the old methods for XStringSet objects from the Biostrings package. I'll let you know when I'm done.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Awesome, thanks!

ADD REPLY
0
Entering edit mode

Done in S4Vectors 0.10.1 and Biostrings 2.40.1. It will take about 48 hours before they become available via biocLite().

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6