I have a large, wide data.table
(20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.
For demonstration purposes - here's a very small sample input:
DT = data.table(
id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)
It looks like this:
id aa bb cc
1: 1 A NA 1
2: 1 NA NA NA
3: 1 B NA NA
4: 1 C NA NA
5: 2 NA NA NA
6: 2 NA NA 4
7: 2 D NA NA
8: 2 E NA 5
9: 3 F NA 6
10: 3 NA NA NA
11: 3 NA NA 7
12: 3 NA NA NA
My expected output looks like this:
id aa bb cc
1: 1 A NA 1
2: 1 A NA 1
3: 1 B NA 1
4: 1 C NA 1
5: 2 NA NA NA
6: 2 NA NA 4
7: 2 D NA 4
8: 2 E NA 5
9: 3 F NA 6
10: 3 F NA 6
11: 3 F NA 7
12: 3 F NA 7
I've found a data.table
solution that works, but it's terribly slow on my large data sets:
DT[, na.locf(.SD, na.rm=FALSE), by=id]
I've found equivalent solutions using dplyr that are equally slow.
GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))
I was hopeful that I could come up with a rolling 'self' join using the data.table
functionality, but I just can't seem to get it right (I suspect I would need to use .N
but I just haven't figured it out).
At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.
I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table
.
See Question&Answers more detail:
os