Question

IRanges: reduce(), findOverlaps(), and intersect() by group

0

Entering edit mode

pesquivel • 0

@pesquivel-9246

Last seen 9.4 years ago

United States

Hi,

Background I have two drug utilization datasets for two drugs A and B. Each row in the datasets represents a prescription, described by patient.id, drug.name, start.date , and days.supply. Both datasets have been filtered so they contain only seven-day-long prescriptions (days.supply = 7 for all rows) and only patients that consumed both drugs. All values of start.date are positive integers.

Question Using IRanges but without resorting to loops, how could one find the patients who simultaneously consumed both drugs? More precisely, find a no-loop IRanges algorithm that identifies every patient.id with at least one prescription for A and one prescription for B that mutually overlap by at least one day. Note that two A-prescriptions for the same patient should be considered one longer A-prescription; same goes for B-prescriptions.

Initial code and error message This question seems to me to be a time to use reduce() and either findOverlaps() or intersect(). While it is easy enough to apply these functions to each dataset as a whole, they however must instead be applied patient by patient.

ir.A <- IRanges(start = as.integer(A$start.date), width = as.integer(A$days.supply))
ir.B <- IRanges(start = as.integer(B$start.date), width = as.integer(B$days.supply))

split.A <- split(ir.A, A$patient.id)
split.B <- split(ir.B, B$patient.id)

red.A <- reduce(ir.A)
red.B <- reduce(ir.B)

x <- findOverlaps(red.A, red.B)

#Warning in View :
#'optional' and arguments in '...' are ignored
#Error in View : arguments imply differing number of rows: 1, 0, 2, 3, 4, 6

Session information

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] XVector_0.10.0      IRanges_2.4.4       S4Vectors_0.8.2     BiocGenerics_0.16.1
[5] data.table_1.9.6    bit64_0.9-5         bit_1.1-12         

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 tools_3.2.2     chron_2.3-47

iranges findoverlaps intersect group • 3.2k views

ADD COMMENT • link 9.4 years ago pesquivel • 0

0

Entering edit mode

Would you please provide a reproducible example of this error? That code should work.

ADD REPLY • link 9.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi,

I've added to my 'Initial code and error message' section the error-causing line of code, which I originally forgot:

x <- findOverlaps(red.A, red.B)

As for data that works, I have CSVs that become the data.tables A and B for drugs A and B (respectively). Would you want me to provide these data.tables as part of my reproducible error? If so, what is the best way for me to provide them?

ADD REPLY • link 9.4 years ago pesquivel • 0

0

Entering edit mode

Just a minimal example would suffice, i.e., a minimal subset of your full dataset. Ideally something that can be constructed directly in code.

ADD REPLY • link 9.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael,

Here is a minimal example:

a <- data.frame(as.character(c(seq(1:4), 4)),
                c("A", "A", "A", "A", "A"),
                c(as.Date("2013-10-30"), as.Date("2013-11-15"),
                  as.Date("2013-09-05"), as.Date("2013-11-29"), as.Date("2013-12-27")),
                c(as.Date("2013-11-06"), as.Date("2013-11-22"),
                  as.Date("2013-09-12"), as.Date("2013-12-06"), as.Date("2014-01-03")))
names(a) <- c("patient.id", "drug", "start.date", "end.date")
a.ir <- IRanges(as.integer(a$start.date), as.integer(a$end.date))
a.split <- split(a.ir, a$patient.id)
a.red <- reduce(a.split)

b <- data.frame(as.character(seq(1:4)),
                c("B", "B", "B", "B"),
                c(as.Date("2013-10-30"), as.Date("2013-11-15"),
                  as.Date("2013-09-05"), as.Date("2013-11-29")),
                c(as.Date("2013-11-06"), as.Date("2013-11-22"),
                  as.Date("2013-09-12"), as.Date("2013-12-06")))
names(b) <- c("patient.id", "drug", "start.date", "end.date")
b.ir <- IRanges(as.integer(b$start.date), as.integer(b$end.date))
b.split <- split(b.ir, b$patient.id)
b.red <- reduce(b.split)

x <- findOverlaps(a.red, b.red)
View(x)

Weirdly enough, this is the new message that arises from this minimal example (the full datasets produce the error message that I noted originally):

Warning messages: 1: In as.data.frame.Hits(x[[i]], optional = TRUE) : 'optional' and arguments in '...' are ignored 2: In as.data.frame.Hits(x[[i]], optional = TRUE) : 'optional' and arguments in '...' are ignored 3: In as.data.frame.Hits(x[[i]], optional = TRUE) : 'optional' and arguments in '...' are ignored 4: In as.data.frame.Hits(x[[i]], optional = TRUE) : 'optional' and arguments in '...' are ignored

What is weird is that the final output has unexpected columns such as queryHits.1, queryHits.2, and subjectHits.1.

ADD REPLY • link 9.4 years ago pesquivel • 0

0

Entering edit mode

The warning is unnecessary, so I will remove it. As for the weird columns, this comes down to there being no c,Hits method, and it is not obvious how to implement one. So we would need to do something at coercion to DataFrame.

ADD REPLY • link 9.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, Maybe this is useful to note in the context of the 'different number of rows' error: The number of rows in data.table A can be expected to differ from the number of rows in data.table B. But the number of patient.id values should be identical in each data.table because both data.tables refer to the same patient population.

ADD REPLY • link 9.4 years ago pesquivel • 0

score 0 · Answer 1 · 2015-12-17

0

Entering edit mode

pesquivel • 0

@pesquivel-9246

Last seen 9.4 years ago

United States

Michael, thanks for providing technical details.

Turns out that the answer to my particular question is easy to find. Instead of using findOverlaps, I simply used overlapsAny():

overlaps <- as.data.table(overlapsAny(a, b))
names(overlaps) <- c("group", "patient.id", "has.overlaps")
overlaps[, has.overlaps := as.numeric(has.overlaps)]
overlaps[, has.overlaps := max(has.overlaps), by = patient.id]
overlappers <- overlaps[has.overlaps == TRUE]

overlappers lists all the patients who simultaneously consumed drugs A and B, which was what we wanted.

ADD COMMENT • link 9.4 years ago pesquivel • 0

0

Entering edit mode

Yea, or just:

any(overlapsAny(a, b))

But I do want to iron out that issue with HitsList.

ADD REPLY • link 9.4 years ago Michael Lawrence ★ 11k