Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
162 views
in Technique[技术] by (71.8m points)

r - 分组功能(tapply,by,aggregate)和* apply系列(Grouping functions (tapply, by, aggregate) and the *apply family)

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

(每当我想在R中做“ map” py任务时,我通常都会尝试在apply系列中使用一个函数。)

However, I've never quite understood the differences between them -- how { sapply , lapply , etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

(但是,我从未完全理解它们之间的区别-{ sapplylapply等}如何将函数应用于输入/分组输入,输出将是什么样,甚至输入是什么-所以我经常只是遍历所有这些,直到得到想要的东西。)

Can someone explain how to use which one when?

(谁能解释什么时候使用哪一个?)

My current (probably incorrect/incomplete) understanding is...

(我目前(可能不正确/不完整)的理解是...)

  1. sapply(vec, f) : input is a vector.

    (sapply(vec, f) :输入是向量。)

    output is a vector/matrix, where element i is f(vec[i]) , giving you a matrix if f has a multi-element output

    (输出是一个向量/矩阵,其中元素if(vec[i]) ,如果f具有多元素输出,则为您提供矩阵)

  2. lapply(vec, f) : same as sapply , but output is a list?

    (lapply(vec, f) :与sapply相同,但是输出是一个列表?)

  3. apply(matrix, 1/2, f) : input is a matrix.

    (apply(matrix, 1/2, f) :输入是一个矩阵。)

    output is a vector, where element i is f(row/col i of the matrix)

    (输出是一个向量,其中元素i为f(矩阵的行/列i))

  4. tapply(vector, grouping, f) : output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names

    (tapply(vector, grouping, f) :输出是一个矩阵/数组,其中矩阵/数组中的元素是向量分组g处的f值,并且g被推到行/列名)

  5. by(dataframe, grouping, f) : let g be a grouping.

    (by(dataframe, grouping, f) :令g为一个分组。)

    apply f to each column of the group/dataframe.

    (将f应用于组/数据框的每一列。)

    pretty print the grouping and the value of f at each column.

    (在每列漂亮地打印分组和f的值。)

  6. aggregate(matrix, grouping, f) : similar to by , but instead of pretty printing the output, aggregate sticks everything into a dataframe.

    (aggregate(matrix, grouping, f) :类似于by ,但是aggregate不会将输出漂亮地打印by ,而是将所有内容粘贴到数据框中。)

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

(侧问题:我还没有学会plyr或重塑-将plyrreshape取代所有这些完全?)

  ask by grautur translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

R has many *apply functions which are ably described in the help files (eg ?apply ).

(R有许多* apply函数,这些函数在帮助文件中都有详细介绍(例如?apply )。)

There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all.

(但是,尽管有足够多的资源,但初次使用的用户可能很难决定哪一个适合他们的情况,甚至难以记住它们。)

They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

(它们可能具有一般意义,即“我应该在这里使用* apply函数”,但是一开始很难保持它们的整齐。)

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

(尽管事实(在其他答案中已指出)* apply系列的许多功能已由极为流行的plyr包所涵盖,但基本功能仍然有用且值得了解。)

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem.

(此答案旨在充当新用户的路标 ,以帮助将其定向到针对其特定问题的正确* apply函数。)

Note, this is not intended to simply regurgitate or replace the R documentation!

(注意,这并不是要简单地反省或替换R文档!)

The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further.

(希望这个答案可以帮助您确定哪个* apply功能适合您的情况,然后由您自己进行进一步的研究。)

With one exception, performance differences will not be addressed.

(除一个例外,将不会解决性能差异。)

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues);

    (apply- 当您要将函数应用于矩阵的行或列(以及更高维的类似物)时;)

    not generally advisable for data frames as it will coerce to a matrix first.

    (通常不建议使用数据帧,因为它将首先强制转换为矩阵。)

     # Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - ie Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - ie Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48 

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans , rowMeans , colSums , rowSums .

    (如果您想要2D矩阵的行/列均值或总和,请务必研究高度优化的,闪电般快速的colMeansrowMeanscolSumsrowSums 。)

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    (lapply- 当您想将功能依次应用于列表的每个元素并返回列表时。)

    This is the workhorse of many of the other *apply functions.

    (这是许多其他* apply函数的主力军。)

    Peel back their code and you will often find lapply underneath.

    (剥离他们的代码,您经常会在下面发现lapply的代码。)

     x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005 
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    (sapply- 当您想将函数依次应用于列表的每个元素,但又要返回向量而不是列表时。)

    If you find yourself typing unlist(lapply(...)) , stop and consider sapply .

    (如果您发现自己输入了unlist(lapply(...)) ,请停下来考虑sapply 。)

     x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) abc 1 3 91 sapply(x, FUN = sum) abc 1 6 5005 

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate.

    (在更高级的sapply使用中,如果合适,它将尝试将结果强制为多维数组。)

    For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

    (例如,如果我们的函数返回相同长度的向量,则sapply会将它们用作矩阵的列:)

     sapply(1:5,function(x) rnorm(3,x)) 

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

    (如果我们的函数返回二维矩阵,则sapply基本上会做同样的事情,将每个返回的矩阵视为单个长向量:)

     sapply(1:5,function(x) matrix(x,2,2)) 

    Unless we specify simplify = "array" , in which case it will use the individual matrices to build a multi-dimensional array:

    (除非我们指定simplify = "array" ,否则在这种情况下它将使用各个矩阵构建多维数组:)

     sapply(1:5,function(x) matrix(x,2,2), simplify = "array") 

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

    (这些行为中的每一个当然都取决于我们的函数返回相同长度或尺寸的向量或矩阵。)

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

    (vapply- 当您想使用sapply但可能需要从代码中挤出更多速度时。)

    For vapply , you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

    (对于vapply ,您基本上为R提供了一个示例,说明您的函数将返回哪种类型的东西,这可以节省一些时间来强制将返回值适合单个原子向量。)

     x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) abc 1 3 91 
  • mapply - For when you have several data structures (eg vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply .

    (mapply- 当您具有多个数据结构(例如,向量,列表)并且想要将函数应用于每个的第一个元素,然后将其应用于每个的第二个元素等时,将结果强制为向量/数组sapply)

    This is multivariate in the sense that your function must accept multiple arguments.

    (在您的函数必须接受多个参数的意义上说,这是多变量的。)

     #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 
  • Map - A wrapper to mapply with SIMPLIFY = FALSE , so it is guaranteed to return a list.

    (mapply使用SIMPLIFY = FALSE进行 映射 的包装器,因此可以确保返回列表。)

     Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15 
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    (rapply- 用于当您想将函数递归应用于嵌套列表结构的每个元素时。)

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer!

    (为了让您了解重新启动的罕见rapply ,我在首次发布此答案时就忘记了它!)

    Obviously, I'm sure many people use it, but YMMV.

    (显然,我敢肯定会有很多人使用它,但是YMMV。)

    rapply is best illustrated with a user-defined function to apply:

    (最好使用用户定义的函数来说明rapply :)

     # Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace") 
  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    (tapply - 当你想给一个函数应用到向量的子集和子集是由一些其它载体,通常是一个因素确定。)

    The black sheep of the *apply family, of sorts.

    (* apply家族的败类。)

    The help file's use of the phrase "ragged array" can be a bit confusing , but it is actually quite simple.

    (帮助文件中使用短语“参差不齐的数组”可能会有些混乱 ,但实际上非常简单。)

    A vector:

    (一个向量:)

     x <- 1:20 

    A factor (of the same length!) defining groups:

    (定义组的因素(长度相同!):)

     y <- factor(rep(letters[1:5], each = 4)) 

    Add up the values in x within each subgroup defined by y :

    (将y定义的每个子组中x的值y :)

     tapply(x, y, sum) abcde 10 26 42 58 74 

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.

    (可以处理更复杂的示例,其中子组由几个因素列表的唯一组合定义。)

    tapply is similar in spirit to the split-apply-combine functions that are common in R ( aggregate , by , ave , ddply , etc.) Hence its black sheep status.

    (tapply是在本质上与分割应用-结合,在R 2是常用的功能(类似于aggregatebyaveddply等)因此,它的黑色羊状态。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...