There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario
My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?
EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.
| Solution By | Speed compared to dplyr |
|-------------|-----------------------------|
| Metrics v1 | 4.3 times SLOWER (use .SD) |
| Metrics v2 | 5.6 times FASTER |
| ExperimenteR| 15 times FASTER |
| Arun v1 | 3 times FASTER (Map func)|
| Arun v2 | 3 times FASTER (foo func)|
| Ista | 4.5 times FASTER |
EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.
Data Setup
library(data.table)
dt <- data.table(ProductName = c("Lettuce", "Beetroot", "Spinach", "Kale", "Carrot"),
Country = c("CA", "FR", "FR", "CA", "CA"),
Q1 = c(NA, 61, 40, 54, NA), Q2 = c(22, 8, NA, 5, NA),
Q3 = c(51, NA, NA, 16, NA), Q4 = c(79, 10, 49, NA, NA))
# ProductName Country Q1 Q2 Q3 Q4
# 1: Lettuce CA NA 22 51 79
# 2: Beetroot FR 61 8 NA 10
# 3: Spinach FR 40 NA NA 49
# 4: Kale CA 54 5 16 NA
# 5: Carrot CA NA NA NA NA
SOLUTION using dplyr + rowwise()
library(dplyr) ; library(magrittr)
dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))
# ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM NAcnt
# 1 Lettuce CA NA 22 51 79 50.66667 22 79 152 1
# 2 Beetroot FR 61 8 NA 10 26.33333 8 61 79 1
# 3 Spinach FR 40 NA NA 49 44.50000 40 49 89 2
# 4 Kale CA 54 5 16 NA 25.00000 5 54 75 1
# 5 Carrot CA NA NA NA NA NaN Inf -Inf 0 4
ERROR with data.table (compute entire column instead of per-row)
dt[, .(ProductName, Country, Q1, Q2, Q3, Q4,
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))]
# ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM NAcnt
# 1: Lettuce CA NA 22 51 79 35.90909 5 79 395 9
# 2: Beetroot FR 61 8 NA 10 35.90909 5 79 395 9
# 3: Spinach FR 40 NA NA 49 35.90909 5 79 395 9
# 4: Kale CA 54 5 16 NA 35.90909 5 79 395 9
# 5: Carrot CA NA NA NA NA 35.90909 5 79 395 9
ALMOST solution but more complex and missing Q1,Q2,Q3,Q4 output columns
dtmelt <- reshape2::melt(dt, id=c("ProductName", "Country"),
variable.name="Quarter", value.name="Qty")
dtmelt[, .(AVG = mean(Qty, na.rm=TRUE),
MIN = min (Qty, na.rm=TRUE),
MAX = max (Qty, na.rm=TRUE),
SUM = sum (Qty, na.rm=TRUE),
NAcnt= sum(is.na(Qty))), by = list(ProductName, Country)]
# ProductName Country AVG MIN MAX SUM NAcnt
# 1: Lettuce CA 50.66667 22 79 152 1
# 2: Beetroot FR 26.33333 8 61 79 1
# 3: Spinach FR 44.50000 40 49 89 2
# 4: Kale CA 25.00000 5 54 75 1
# 5: Carrot CA NaN Inf -Inf 0 4
See Question&Answers more detail:
os