DataFrame compatible left_join (merge) operation supporting S4 complex columns that never reorders rows
1
0
Entering edit mode
@mjsteinbaugh
Last seen 13 months ago
Cambridge, MA

Is there a left join operation that works on DataFrame class objects and NEVER rearranges the rows? The base merge operation (i.e. S4Vectors::merge) currently will reorder rows, even when sort = FALSE.

# This supports S4 columns but will flip rows.
m <- S4Vectors::merge(
    x = x, y = y,
    by = "gene_id",
    all.x = TRUE,
    sort = FALSE
)

I'd like to be able to use something like dplyr::left_join() that supports complex S4 columns (e.g. CompressedCharacterList), rather than just atomic and list columns supported in tibbles.

# This never flips rows, but doesn't support S4.
m <- dplyr::left_join(
    x = x, y = y,
    by = "gene_id"
)
s4 dataframe left_join merge • 1.3k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States

Hi,

A workaround is to perform your own merge e.g. with something like this:

library(S4Vectors)
x <- DataFrame(tx_id=letters[1:7], gene_id=c(3, 19, 4, 1, 1, 3, 1))
y <- DataFrame(gene_id=1:5, gene_name=LETTERS[1:5])
m <- match(x$gene_id, y$gene_id)
cbind(x, y[m, ])
# DataFrame with 7 rows and 4 columns
#         tx_id   gene_id   gene_id   gene_name
#   <character> <numeric> <integer> <character>
# 1           a         3         3           C
# 2           b        19        NA          NA
# 3           c         4         4           D
# 4           d         1         1           A
# 5           e         1         1           A
# 6           f         3         3           C
# 7           g         1         1           A

There is one problem though if the right DataFrame has a column that is an S4 object that doesn't support subsetting by a subscript with NAs:

library(GenomicRanges)
y$range <- GRanges("chr1", IRanges(11:15, width=5))
y
# DataFrame with 5 rows and 3 columns
#     gene_id   gene_name      range
#   <integer> <character>  <GRanges>
# 1         1           A chr1:11-15
# 2         2           B chr1:12-16
# 3         3           C chr1:13-17
# 4         4           D chr1:14-18
# 5         5           E chr1:15-19

cbind(x, y[m, ])
# Error: subscript contains NAs

That's because GRanges objects don't accept NAs in the subscript:

y$range[m]
# Error: subscript contains NAs

One way to deal with this is to make sure that all the gene ids in the left DataFrame are mapped to a gene id in the right DataFrame. This will guarantee that the call to match() doesn't return any NA.

Another way is to exclude from the results the rows in x that are not matched to a row in y:

keep_idx <- !is.na(m)
cbind(x[keep_idx, ], y[m[keep_idx], ])
# DataFrame with 6 rows and 5 columns
#         tx_id   gene_id   gene_id   gene_name      range
#   <character> <numeric> <integer> <character>  <GRanges>
# 1           a         3         3           C chr1:13-17
# 2           c         4         4           D chr1:14-18
# 3           d         1         1           A chr1:11-15
# 4           e         1         1           A chr1:11-15
# 5           f         3         3           C chr1:13-17
# 6           g         1         1           A chr1:11-15

This is equivalent to calling merge() with all.x=FALSE, except that we've preserved the original order of the rows in x.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thanks Hervé! That's really clever, and exactly what I'm looking for.

Best, Mike

ADD REPLY

Login before adding your answer.

Traffic: 471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6