I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit)
(or the default setting). I'll get averages for each lot over three values for var1
and over four values for var2
.
Unfortunately, this will leave me with a dataset completely missing lot A
in this case:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
name var1 var2 var3
1 B 2 6 2
2 C 2 10 6
If I use na.pass
, however, I also don't get what I want:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
name var1 var2 var3
1 A NA 2.5 NA
2 B NA 6.5 2.5
3 C NA 10.5 6.5
Now I lose the good data I had in var1
since it contained instances of NA
.
What I'd like is:
NA
as the output of mean()
if all unique combinations of varN
~ name
are NA
s
- Output of
mean()
if there are one or more actual values for varN
~ name
I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply
for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate()
like so:
ddply(test, .(name), summarise,
var1 = mean(var1, na.rm = T),
var2 = mean(var2, na.rm = T),
var3 = mean(var3, na.rm = T))
Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate()
or 2) shorter syntax for ddply
.
See Question&Answers more detail:
os