r - Compute rolling sum by id variables, with missing timepoints

Question

Welcome To Ask or Share your Answers For Others

r - Compute rolling sum by id variables, with missing timepoints

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Compute rolling sum by id variables, with missing timepoints

I'm trying to learn R and there are a few things I've done for 10+ years in SAS that I cannot quite figure out the best way to do in R. Take this data:

 id  class           t  count  desired
 --  -----  ----------  -----  -------
  1      A  2010-01-15      1        1
  1      A  2010-02-15      2        3
  1      B  2010-04-15      3        3
  1      B  2010-09-15      4        4
  2      A  2010-01-15      5        5
  2      B  2010-06-15      6        6
  2      B  2010-08-15      7       13
  2      B  2010-09-15      8       21

I want to calculate the column desired as a rolling sum by id, class, and within a 4 months rolling window. Notice that not all months are present for each combination of id and class.

In SAS I'd typically do this in one of 2 ways:

RETAIN plus a by id & class.
PROC SQL with a left join from df as df1 to df as df2 on id, class and the df1.d-df2.d within the appropriate window

What is the best R approach to this type of problem?

t <- as.Date(c("2010-01-15","2010-02-15","2010-04-15","2010-09-15",
               "2010-01-15","2010-06-15","2010-08-15","2010-09-15"))
class <- c("A","A","B","B","A","B","B","B")
id <- c(1,1,1,1,2,2,2,2)
count <- seq(1,8,length.out=8)
desired <- c(1,3,3,4,5,6,13,21)
df <- data.frame(id,class,t,count,desired)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:38:45+0000

Here are a few solutions:

1) zoo Using ave, for each group create a monthly series, m, by merging the original series, z, with a grid, g. Then calculate the rolling sum and retain only the original time points:

library(zoo)
f <- function(i) { 
    z <- with(df[i, ], zoo(count, t))
    g <- zoo(, seq(start(z), end(z), by = "month"))
    m <- merge(z, g)
    window(rollapplyr(m, 4, sum, na.rm = TRUE, partial = TRUE), time(z))
}
df$desired <- ave(1:nrow(df), df$id, df$class, FUN = f)

which gives:

> df
  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note We have assumed the times are ordered within each group (as in the question). If that is not so then sort df first.

2) sqldf

library(sqldf)
sqldf("select id, class, a.t, a.'count', sum(b.'count') desired 
   from df a join df b 
   using(id, class) 
   where a.t - b.t between 0 and 100
   group by id, class, a.t")

which gives:

  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note: If the merge should be too large to fit into memory then use sqldf("...", dbname = tempfile()) to cause the intermediate results to be stored in a database which it creates on the fly and automatically destroys afterwards.

3) Base R The sqldf solution motivates this base R solution which just translates the SQL into R:

m <- merge(df, df, by = 1:2)
s <- subset(m, t.x - t.y >= 0 & t.x - t.y <= 100)
ag <- aggregate(count.y ~ t.x + class + id, s, sum)
names(ag) <- c("t", "class", "id", "count", "desired")

The result is:

> ag
           t class id count desired
1 2010-01-15     A  1     1       1
2 2010-02-15     A  1     2       3
3 2010-04-15     B  1     3       3
4 2010-09-15     B  1     4       4
5 2010-01-15     A  2     5       5
6 2010-06-15     B  2     6       6
7 2010-08-15     B  2     7      13
8 2010-09-15     B  2     8      21

Note: This does do a merge in memory which might be a problem if the data set is very large.

UPDATE: Minor simplifications of first solution and also added second solution.

UPDATE 2: Added third solution.

Categories

r - Compute rolling sum by id variables, with missing timepoints

r - Compute rolling sum by id variables, with missing timepoints

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags