Hi all,
I am an advanced beginner in R who is extremely thankful for IRanges
! It has accelerated the process I describe below by ~10-fold.
Desired advice: Because I am working with millions of records, I wonder if there are further speed improvements that can be obtained by creatively combining IRanges
with functions from the dplyr
package, which are generally fast. Any insights into using the data.table
package would also be appreciated.
Background: My job involves working on the medication records of de-identified health insurance plan members. My team increasingly needs to exclude patients from downstream analyses of our Claims
data if their medication history contains any gap longer than 60 days long. We identify these patients by looking at the startdate
and enddate
of each record in our dataset; each record also contains the member_id
of the member who made the purchase. Unfortunately, identifying 'gappy' patients is not as simple as pairwise comparison of records. This simplified excerpt from Claims
illustrates why; dates are represented as integers below:
member_id startdate enddate
A 1 90
A 14 15
A 121 180
B 1 30
B 2001 2030
... ... ...
Patient B should obviously be removed since he has a gap of length 2001 − 30 − 1 = 1970 > 60. I would like to retain Patient A despite the gap of length 121 − 15 − 1 = 75 > 60 between his second and third prescriptions, however; the only gap in his medication history is the one of length 121 − 90 − 1 = 30 < 60 between his first and third prescriptions.
Current approach and situation: I have been able to take these issues into account using a custom function called smart
(it definitely is smarter than the previous loop-based function we employed). smart
invokes IRanges::
reduce
and IRanges::gaps
.
> smart <- function(Claims)
> {
> MemberClaims_I <- IRanges(start = as.numeric(Claims$startdate), end = as.numeric(Claims$enddate))
> MemberClaims_Red <- reduce(MemberClaims_I)
> MemberGaps <- as.data.table(gaps(MemberClaims_Red))
> }
This custom function is then currently applied to Claims
using plyr:ddply
:
> member_id <- levels(Claims$member_id)
> #system.time(Claims_Smart <- ddply(Claims, .(member_id), smart))
> Claims_Smart <- ddply(Claims, .(member_id), smart)
The hashed-out line tells me that ~20,000 rows for ~1,000 patients are processed in ~8 seconds. A dataset with 3 million rows and 600,000 patients gets processed in ~8 hours. Here are my session details:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3 zoo_1.7-12 stringr_1.0.0 fields_8.3-5 maps_3.0.0-2
[7] spam_1.3-0 shiny_0.12.2 RPostgreSQL_0.4 DBI_0.3.1 zipcode_1.0 visreg_2.2-0
[13] plyr_1.8.3 IRanges_2.4.1 S4Vectors_0.8.2 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 magrittr_1.5 xtable_1.8-0 lattice_0.20-33 R6_2.1.1 tools_3.2.2 htmltools_0.2.6 lazyeval_0.1.10
[9] assertthat_0.1 digest_0.6.8 mime_0.4 stringi_1.0-1 jsonlite_0.9.17 chron_2.3-47 httpuv_1.3.3
Thanks for your help!