Hi,
Background I have two drug utilization datasets for two drugs A and B. Each row in the datasets represents a prescription, described by patient.id
, drug.name
, start.date
, and days.supply
. Both datasets have been filtered so they contain only seven-day-long prescriptions (days.supply = 7
for all rows) and only patients that consumed both drugs. All values of start.date
are positive integers.
Question Using IRanges but without resorting to loops, how could one find the patients who simultaneously consumed both drugs? More precisely, find a no-loop IRanges algorithm that identifies every patient.id
with at least one prescription for A and one prescription for B that mutually overlap by at least one day. Note that two A-prescriptions for the same patient should be considered one longer A-prescription; same goes for B-prescriptions.
Initial code and error message This question seems to me to be a time to use reduce()
and either findOverlaps()
or intersect()
. While it is easy enough to apply these functions to each dataset as a whole, they however must instead be applied patient by patient.
ir.A <- IRanges(start = as.integer(A$start.date), width = as.integer(A$days.supply)) ir.B <- IRanges(start = as.integer(B$start.date), width = as.integer(B$days.supply)) split.A <- split(ir.A, A$patient.id) split.B <- split(ir.B, B$patient.id) red.A <- reduce(ir.A) red.B <- reduce(ir.B) x <- findOverlaps(red.A, red.B) #Warning in View : #'optional' and arguments in '...' are ignored #Error in View : arguments imply differing number of rows: 1, 0, 2, 3, 4, 6
Session information
R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11 (El Capitan) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] XVector_0.10.0 IRanges_2.4.4 S4Vectors_0.8.2 BiocGenerics_0.16.1 [5] data.table_1.9.6 bit64_0.9-5 bit_1.1-12 loaded via a namespace (and not attached): [1] zlibbioc_1.16.0 tools_3.2.2 chron_2.3-47
Would you please provide a reproducible example of this error? That code should work.
Hi,
I've added to my 'Initial code and error message' section the error-causing line of code, which I originally forgot:
As for data that works, I have CSVs that become the data.tables
A
andB
for drugs A and B (respectively). Would you want me to provide these data.tables as part of my reproducible error? If so, what is the best way for me to provide them?Just a minimal example would suffice, i.e., a minimal subset of your full dataset. Ideally something that can be constructed directly in code.
Hi Michael,
Here is a minimal example:
Weirdly enough, this is the new message that arises from this minimal example (the full datasets produce the error message that I noted originally):
Warning messages:
1: In as.data.frame.Hits(x[[i]], optional = TRUE) :
'optional' and arguments in '...' are ignored
2: In as.data.frame.Hits(x[[i]], optional = TRUE) :
'optional' and arguments in '...' are ignored
3: In as.data.frame.Hits(x[[i]], optional = TRUE) :
'optional' and arguments in '...' are ignored
4: In as.data.frame.Hits(x[[i]], optional = TRUE) :
'optional' and arguments in '...' are ignored
What is weird is that the final output has unexpected columns such as
queryHits.1, queryHits.2, and subjectHits.1.
The warning is unnecessary, so I will remove it. As for the weird columns, this comes down to there being no
c,Hits
method, and it is not obvious how to implement one. So we would need to do something at coercion to DataFrame.Hi, Maybe this is useful to note in the context of the 'different number of rows' error: The number of rows in data.table
A
can be expected to differ from the number of rows in data.tableB
. But the number ofpatient.id
values should be identical in each data.table because both data.tables refer to the same patient population.