Performance properties of time-series operations in R at scale (mainly xts and data.table)

Question

Welcome To Ask or Share your Answers For Others

Performance properties of time-series operations in R at scale (mainly xts and data.table)

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

Performance properties of time-series operations in R at scale (mainly xts and data.table)

I am undertaking a new project encompassing large time-series datasets from which dependent calculations are fed into a shiny application. As such, efficiency is of interest to me. The operations are typically restricted to elementary period conversions and subsequent summary statistics for risk metrics.

I am investigating which library/approach to build the calculation scripts with. At present, I am OK with xts and data.table. Although I can resort to libraries as quantmod and TTR, I am hesitant to deploy blackbox functions in production and would prefer to maintain full traceability.

Thus far, I have carried out the following benchmarking exercise in which a data.frame of daily prices is converted into monthly returns. The packages used thus far are xts, data.table and quantmod (as reference). The code is pasted below but can also be found on GitHub.

Benchmark Code

# Simple return exercise: Daily Prices to Monthly Returns
# Input: Nx2 data.frame with columns (N days, price) 
# Output: Mx2 object with columns (M months, return)
# Three different functions: 1. xts, 2. data.table, 3. quantmod

rm(list = ls()); gc()

library(data.table) 
library(zoo)
library(xts)
library(ggplot2)
library(quantmod)

# Asset params
spot = 100
r = 0.01
sigma = 0.02
N = 1e5

# Input data: Nx2 data.frame (date, price)
pmat = data.frame( 
    date = seq.Date(as.Date('1970-01-01'), by = 1, length.out = N),
    price = spot * exp(cumsum((r - 0.5 * sigma**2) * 1/N + (sigma * (sqrt(1/N)) * rnorm(N, mean = 0, sd = 1))))
)

# Output functions

      # 1. xts standalone 
      xtsfun = function(mat){
        xtsdf = as.xts(mat[, 2], order.by = mat[, 1])
        eom_prices = to.monthly(xtsdf)[, 4]
        mret = eom_prices/lag.xts(eom_prices) - 1; mret[1] = eom_prices[1]/xtsdf[1] - 1
        mret
      }
      
      # 2. data.table standalone 
      dtfun = function(mat){
        dt = setNames(as.data.table(mat), c('V1', 'V2'))
        dt[, .(EOM = last(V2)), .(Month = as.yearmon(V1))][, .(Month, Return = EOM/shift(EOM, fill = first(mat[, 2])) - 1)]
      }
      
      # 3. quantmod (black box library)
      qmfun = function(mat){
        qmdf = as.xts(mat[, 2], order.by = mat[, 1])
        monthlyReturn(qmdf)
      }

# Check 1 == 2 == 3:
all.equal(
    unlist(dtfun(pmat[1:1000,])[, Return]),
    as.numeric(xtsfun(pmat[1:1000,])),
    as.numeric(qmfun(pmat[1:1000,])),
    scale = NULL
)
    
# Benchmark
library(microbenchmark)
gc()

mbm = microbenchmark(
  xts = xtsfun(pmat),
  data.table = dtfun(pmat),
  quantmod = qmfun(pmat),
  times = 50
)

mbm

Results

For N = 1e5, the three approaches perform similarly:

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
        xts 20.62520 22.93372 25.14445 23.84235 27.25468  39.29402    50
 data.table 21.23984 22.29121 27.28266 24.05491 26.25416  98.35812    50
   quantmod 14.21228 16.71663 19.54709 17.19368 19.38106 102.56189    50

However, for N = 1e6 I observe a substantial performance divergence for data.table:

Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
        xts  296.8969  380.7494  408.7696  397.4292  431.1306  759.7227    50
 data.table 1562.3613 1637.8787 1669.8513 1651.4729 1688.2312 1969.4942    50
   quantmod  144.1901  244.2427  278.7676  268.4302  331.4777  418.7951    50

I am very curious as to what drives this result, particularly since data.table normally excels at large N. Of course, dtfun could just be poorly written (and I very much appreciate any code improvements), but I achieve similar results using other approaches including self-joins on EOM dates and cumprod on daily returns.

Do xts and/or quantmod benefit from any internal rcpp or eqv calls that improves their performance at scale? Lastly, if you are aware of any other competitive standalone solution (base?, dplyr?) for large-scale TS, I am all ears.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

Performance properties of time-series operations in R at scale (mainly xts and data.table)

Performance properties of time-series operations in R at scale (mainly xts and data.table)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags