R语言中reshape2函数族
前言
前几天放出来的那个R的展示中,有说到其实学R的过程更多的就是熟悉各种函数的过程(学习统计模型不在此列...我个人还是倾向于不要借助软件来学习理论知识,虽然可以直接看codes...笔和纸上的推导还是不可或缺的基本功),然后各种基础函数熟悉了之后很多被打包好的函数就是缩短代码长度的利器了。
excel里面有神奇的“数据透视表(pivot table)”,其实很多时候真的已经很神奇了....不过我还是喜欢R,喜欢R直接输出csv或者xlsx的简洁。揉数据呢(学名貌似叫数据整理),我也还是喜欢写出来代码的形式,而不是直接向excel那样面对结果。只是感觉更加不容易出错吧。
揉数据,顾名思义,就是在原有的数据格式基础上,变化出来其他的形式。比如,长长的时间序列变成宽一点的~当然这个可以简单的借助reshape()函数了。可惜我还是不死心,想找一个更好用的,于是就自然而然的看到了reshape2这个包。
这个包里面函数精华在melt()和*cast()。说实话melt()耗了我一段时间来理解,尤其是为什么需要先melt再cast...后来发现这个步骤简直是无敌啊,什么样的形状都变得更加容易揉了,大赞。
warm-up完毕,还是回到正题吧,怎么用reshape2揉数据呢?虽然reshape2支持array, list和data.frame,但是我一般还是习惯于用data.frame,所以还是说说这东西怎么揉吧。揉数据的第一步就是调用melt()函数,不用担心你的input是什么格式,这个函数array, list和data.frame通吃。然后,要告诉他哪些变量是(唯一)识别一个个体的,这句话是什么意思呢?我们先看melt()的参数:
melt(data, id.vars, measure.vars,
variable.name = "variable", ..., na.rm = FALSE,
value.name = "value")
其中id.vars可以指定一系列变量,然后measure.vars就可以留空了,这样生成的新数据会保留id.vars的所有列,然后增加两个新列:variable和value,一个存储变量的名称一个存储变量值。这样就相当于面板数据的长格式了。直接拷一个作者给出的例子:
原数据:
head(airquality)
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
dim(airquality)
[1] 153 6
然后我们将month和day作为识别个体记录的变量,调用melt(airquality, id=c("month", "day")):
require(reshape2)
head(melt(airquality, id=c("month", "day")))
month day variable value
1 5 1 ozone 41
2 5 2 ozone 36
3 5 3 ozone 12
4 5 4 ozone 18
5 5 5 ozone NA
6 5 6 ozone 28
dim(melt(airquality, id=c("month", "day")))
[1] 612 4
嗯,这样数据就变长了~然后,就可以随意的cast了...dcast()会给出宽格式的数据,比如我们想把day作为唯一的识别,那么:
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
head(dcast(aqm, day ~ variable+month))
day ozone_5 ozone_6 ozone_7 ozone_8 ozone_9 solar.r_5 solar.r_6 solar.r_7 solar.r_8 solar.r_9 wind_5 wind_6 wind_7
1 1 41 NA 135 39 96 190 286 269 83 167 7.4 8.6 4.1
2 2 36 NA 49 9 78 118 287 248 24 197 8.0 9.7 9.2
3 3 12 NA 32 16 73 149 242 236 77 183 12.6 16.1 9.2
4 4 18 NA NA 78 91 313 186 101 NA 189 11.5 9.2 10.9
5 5 NA NA 64 35 47 NA 220 175 NA 95 14.3 8.6 4.6
6 6 28 NA 40 66 32 NA 264 314 NA 92 14.9 14.3 10.9
wind_8 wind_9 temp_5 temp_6 temp_7 temp_8 temp_9
1 6.9 6.9 67 78 84 81 91
2 13.8 5.1 72 74 85 81 92
3 7.4 2.8 74 67 81 82 93
4 6.9 4.6 62 84 84 86 93
5 7.4 7.4 56 85 83 85 87
6 4.6 15.5 66 79 83 87 84
或者对于每个月,求平均数:
head(dcast(aqm, month ~ variable, mean, margins = c("month", "variable")))
month ozone solar.r wind temp (all)
1 5 23.61538 181.2963 11.622581 65.54839 68.70696
2 6 29.44444 190.1667 10.266667 79.10000 87.38384
3 7 59.11538 216.4839 8.941935 83.90323 93.49748
4 8 59.96154 171.8571 8.793548 83.96774 79.71207
5 9 31.44828 167.4333 10.180000 76.90000 71.82689
6 (all) 42.12931 185.9315 9.957516 77.88235 80.05722
当然还有更强大的acast(),配合.函数:
library(plyr) # needed to access . function
acast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
5 6 7 8 9
ozone 23.61538 29.44444 59.11538 59.96154 31.44828
嗯,基本上数据就可以这么揉来揉去了...哈哈。怎么感觉有点像数据透视表捏?只是更加灵活,还可以自定义函数。
此外还有recast()可以一步到位,只是返回的是list;colsplit()可以分割变量名...函数不多,却精华的很啊。
Example_1
# code_1
require(reshape2)
x = data.frame(subject = c("John", "Mary"),
time = c(1,1),
age = c(33,NA),
weight = c(90, NA),
height = c(2,2))
x
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
------------------------------------------------------
# code_2
molten = melt(x, id = c("subject", "time"))
molten
subject time variable value
1 John 1 age 33
2 Mary 1 age NA
3 John 1 weight 90
4 Mary 1 weight NA
5 John 1 height 2
6 Mary 1 height 2
------------------------------------------------------
# code_3
molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
molten
subject time variable value
1 John 1 age 33
3 John 1 weight 90
5 John 1 height 2
6 Mary 1 height 2
------------------------------------------------------
# 语句
dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time ~ variable)
dcast(molten, formula = subject ~ variable)
dcast(molten, formula = ... ~ variable)
# 结果
> dcast(molten, formula = time + subject ~ variable)
time subject age weight height
1 1 John 33 90 2
2 1 Mary NA NA 2
> dcast(molten, formula = subject + time ~ variable)
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
> dcast(molten, formula = subject ~ variable)
subject age weight height
1 John 33 90 2
2 Mary NA NA 2
> dcast(molten, formula = ... ~ variable)
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
------------------------------------------------------
# 语句
acast(molten, formula = subject ~ time ~ variable)
# 结果
> acast(molten, formula = subject ~ time ~ variable)
, , age
1
John 33
Mary NA
, , weight
1
John 90
Mary NA
, , height
1
John 2
Mary 2
------------------------------------------------------
# Melt French Fries dataset
data(french_fries)
head(french_fries)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
head(ffm)
# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length)
# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)
------------------------------------------------------
Example_2
Converting data between wide and long format
Problem
You want to do convert data from a wide format to a long format.
Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.
Solution
There are two sets of methods that are explained below:
gather()
andspread()
from the tidyr package. This is a newer interface to the reshape2 package.melt()
anddcast()
from the reshape2 package.
There are a number of other methods which aren’t covered here, since they are not as easy to use:
- The
reshape()
function, which is confusingly not part of the reshape2 package; it is part of the base install of R. stack()
andunstack()
Sample data
These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.
olddata_wide <- read.table(header=TRUE, text=\'
subject sex control cond1 cond2
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
\')
# Make sure the subject column is a factor
olddata_wide$subject <- factor(olddata_wide$subject)
olddata_long <- read.table(header=TRUE, text=\'
subject sex condition measurement
1 M control 7.9
1 M cond1 12.3
1 M cond2 10.7
2 F control 6.3
2 F cond1 10.6
2 F cond2 11.1
3 F control 9.5
3 F cond1 13.1
3 F cond2 13.8
4 M control 11.5
4 M cond1 13.4
4 M cond2 12.9
\')
# Make sure the subject column is a factor
olddata_long$subject <- factor(olddata_long$subject)
tidyr
From wide to long
Use gather
:
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(tidyr)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
data_long <- gather(olddata_wide, condition, measurement, control:cond2)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
In this example, the source columns that are gathered are specified with control:cond2
. This means to use all the columns, positionally, between control
and cond2
. Another way of doing it is to name the columns individually, as in:
gather(olddata_wide, condition, measurement, control, cond1, cond2)
If you need to use gather()
programmatically, you may need to use variables containing column names. To do this, you should use the gather_()
function instead, which takes strings instead of bare (unquoted) column names.
keycol <- "condition"
valuecol <- "measurement"
gathercols <- c("control", "cond1", "cond2")
gather_(olddata_wide, keycol, valuecol, gathercols)
Optional: Rename the factor levels of the variable column, and sort.
# Rename factor names from "cond1" and "cond2" to "first" and "second"
levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
# Sort by subject first, then by condition
data_long <- data_long[order(data_long$subject, data_long$condition), ]
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 5 1 M first 12.3
#> 9 1 M second 10.7
#> 2 2 F control 6.3
#> 6 2 F first 10.6
#> 10 2 F second 11.1
#> 3 3 F control 9.5
#> 7 3 F first 13.1
#> 11 3 F second 13.8
#> 4 4 M control 11.5
#> 8 4 M first 13.4
#> 12 4 M second 12.9
From long to wide
Use spread
:
olddata_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 1 M cond1 12.3
#> 3 1 M cond2 10.7
#> 4 2 F control 6.3
#> 5 2 F cond1 10.6
#> 6 2 F cond2 11.1
#> 7 3 F control 9.5
#> 8 3 F cond1 13.1
#> 9 3 F cond2 13.8
#> 10 4 M control 11.5
#> 11 4 M cond1 13.4
#> 12 4 M cond2 12.9
library(tidyr)
# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values
data_wide <- spread(olddata_long, condition, measurement)
data_wide
#> subject sex cond1 cond2 control
#> 1 1 M 12.3 10.7 7.9
#> 2 2 F 10.6 11.1 6.3
#> 3 3 F 13.1 13.8 9.5
#> 4 4 M 13.4 12.9 11.5
Optional: A few things to make the data look nicer.
# Rename cond1 to first, and cond2 to second
names(data_wide)[names(data_wide)=="cond1"] <- "first"
names(data_wide)[names(data_wide)=="cond2"] <- "second"
# Reorder the columns
data_wide <- data_wide[, c(1,2,5,3,4)]
data_wide
#> subject sex control first second
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.
reshape2
From wide to long
Use melt
:
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(reshape2)
# Specify id.vars: the variables to keep but not split apart on
melt(olddata_wide, id.vars=c("subject", "sex"))
#> subject sex variable value
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
There are options for melt
that can make the output a little easier to work with:
data_long <- melt(olddata_wide,
# ID variables - all the variables to keep but not split apart on
id.vars=c("subject", "sex"),
# The source columns
measure.vars=c("control", "cond1", "cond2" ),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="condition",
value.name="measurement"
)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
If you leave out the measure.vars
, melt
will automatically use all the other variables as the id.vars
. The reverse is true if you leave out id.vars
.
If you don’t specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement".
Optional: Rename the factor levels of the variable column.
# Rename factor names from "cond1" and "cond2" to "first" and "second"
levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
# Sort by subject first, then by condition
data_long <- data_long[ order(data_long$subject, data_long$condition), ]
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 5 1 M first 12.3
#> 9 1 M second 10.7
#> 2 2 F control 6.3
#> 6 2 F first 10.6
#> 10 2 F second 11.1
#> 3 3 F control 9.5
#> 7 3 F first 13.1
#> 11 3 F second 13.8
#> 4 4 M control 11.5
#> 8 4 M first 13.4
#> 12 4 M second 12.9
From long to wide
The following code uses dcast
to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast
instead.
olddata_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 1 M cond1 12.3
#> 3 1 M cond2 10.7
#> 4 2 F control 6.3
#> 5 2 F cond1 10.6
#> 6 2 F cond2 11.1
#> 7 3 F control 9.5
#> 8 3 F cond1 13.1
#> 9 3 F cond2 13.8
#> 10 4 M control 11.5
#> 11 4 M cond1 13.4
#> 12 4 M cond2 12.9
# From the source:
# "subject" and "sex" are columns we want to keep the same
# "condition" is the column that contains the names of the new column to put things in
# "measurement" holds the measurements
library(reshape2)
data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement")
data_wide
#> subject sex cond1 cond2 control
#> 1 1 M 12.3 10.7 7.9
#> 2 2 F 10.6 11.1 6.3
#> 3 3 F 13.1 13.8 9.5
#> 4 4 M 13.4 12.9 11.5
Optional: A few things to make the data look nicer.
# Rename cond1 to first, and cond2 to second
names(data_wide)[names(data_wide)=="cond1"] <- "first"
names(data_wide)[names(data_wide)=="cond2"] <- "second"
# Reorder the columns
data_wide <- data_wide[, c(1,2,5,3,4)]
data_wide
#> subject sex control first second
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.
请发表评论