R语言数据重塑cbind+rbind+merge+ melt+cast

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› R语言›R语言教程

原作者: [db:作者] 来自: [db:来源] 收藏邀请

R语言中的数据重塑是关于变化的数据分为行和列的方式。大多数R地数据处理的时候是通过将输入的数据作为一个数据帧进行。这是很容易提取一个数据帧的行和列数据，但在某些情况，当我们需要的数据帧的格式是不同的来自收到它的格式。 R有许多函数用来分割，合并，改变行列，反之亦然在一个数据帧。

接合列和行中的数据帧

我们可以加入多个向量创建使用 cbind()函数返回数据帧。同时，我们也可以使用 rbind()函数合并两个数据帧。

cbind:重点是将多个向量合并成一个数据帧和 data.frame 还是有一定的差别（cbind数据都带有双引号，data.frame数据去双引号）

rbind:重点是将多个振进行拼接联在一起，主要是追加和拼接

merge：重点是多个数据帧筛选共同项

melt：主要讲一张表格拆分成多个数据（以列为区分项）表

case：将多行相同数据合并等操作

# Create vector objects.
city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n") 

# Print the data frame.
print(addresses)

# Create another data frame with similar columns
new.address <- data.frame(
   city = c("Lowry","Charlotte"),
   state = c("CO","FL"),
   zipcode = c("80230","33949"),
   stringsAsFactors=FALSE
)

# Print a header.
cat("# # # The Second data frame\n") 

# Print the data frame.
print(new.address)

# Combine rows form both the data frames.
all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n") 

# Print the result.
print(all.addresses)

当我们上面的代码执行时，它产生以下结果：

# # # # The First data frame
     city       state zipcode
[1,] "Tampa"    "FL"  "33602"
[2,] "Seattle"  "WA"  "98104"
[3,] "Hartford" "CT"  "6161" 
[4,] "Denver"   "CO"  "80294"
# # # The Second data frame
       city state zipcode
1     Lowry    CO   80230
2 Charlotte    FL   33949
# # # The combined data frame
       city state zipcode
1     Tampa    FL   33602
2   Seattle    WA   98104
3  Hartford    CT    6161
4    Denver    CO   80294
5     Lowry    CO   80230
6 Charlotte    FL   33949

合并数据帧

我们可以通过使用 merge()函数合并两个数据帧。该数据帧必须在其上合并发生相同的列名。

在下面的例子中，我们考虑对皮马印第安人妇女的糖尿病在可用的数据集库名称 "MASS". 我们合并基础血压（“BP”）和身体质量指数（“BMI”）的值，两个数据集。上用于合并选择这两列，其中，这两个变量的值匹配在两个数据集组合在一起的记录，以形成一个单一的数据帧。

library(MASS)
merged.Pima <- merge(x=Pima.te, y=Pima.tr,
                    by.x=c("bp", "bmi"),
                    by.y=c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)

当我们上面的代码执行时，它产生以下结果：

   bp  bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1  60 33.8       1   117     23 0.466    27     No       2   125     20 0.088
2  64 29.7       2    75     24 0.370    33     No       2   100     23 0.368
3  64 31.2       5   189     33 0.583    29    Yes       3   158     13 0.295
4  64 33.2       4   117     27 0.230    24     No       1    96     27 0.289
5  66 38.1       3   115     39 0.150    28     No       1   114     36 0.289
6  68 38.5       2   100     25 0.324    26     No       7   129     49 0.439
7  70 27.4       1   116     28 0.204    21     No       0   124     20 0.254
8  70 33.1       4    91     32 0.446    22     No       9   123     44 0.374
9  70 35.4       9   124     33 0.282    34     No       6   134     23 0.542
10 72 25.6       1   157     21 0.123    24     No       4    99     17 0.294
11 72 37.7       5    95     33 0.370    27     No       6   103     32 0.324
12 74 25.9       9   134     33 0.460    81     No       8   126     38 0.162
13 74 25.9       1    95     21 0.673    36     No       8   126     38 0.162
14 78 27.6       5    88     30 0.258    37     No       6   125     31 0.565
15 78 27.6      10   122     31 0.512    45     No       6   125     31 0.565
16 78 39.4       2   112     50 0.175    24     No       4   112     40 0.236
17 88 34.5       1   117     24 0.403    40    Yes       4   127     11 0.598
   age.y type.y
1     31     No
2     21     No
3     24     No
4     21     No
5     21     No
6     43    Yes
7     36    Yes
8     40     No
9     29    Yes
10    28     No
11    55     No
12    39     No
13    39     No
14    49    Yes
15    49    Yes
16    38     No
17    28     No
[1] 17

熔化和转换

R语言编程的最有趣的地方是关于改变多个步骤中的数据的形状来获得所希望的形状。用来做这种函数被称为 melt() 和 cast()。

我们认为数据集被称为 ships 出现在库被称为 "MASS".

library(MASS)
print(ships)

当我们上面的代码执行时，它产生以下结果：

   type year period service incidents
1     A   60     60     127         0
2     A   60     75      63         0
3     A   65     60    1095         3
4     A   65     75    1095         4
5     A   70     60    1512         6
.............
.............
8     A   75     75    2244        11
9     B   60     60   44882        39
10    B   60     75   17176        29
11    B   65     60   28609        58
............
............
17    C   60     60    1179         1
18    C   60     75     552         1
19    C   65     60     781         0
............
............

融化数据

现在，我们融化数据需要组织其转换类型(type), 并且 year 到多行以外的所有列。

molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)

当我们上面的代码执行时，它产生以下结果：

    type year  variable value
1      A   60    period    60
2      A   60    period    75
3      A   65    period    60
4      A   65    period    75
............
............
9      B   60    period    60
10     B   60    period    75
11     B   65    period    60
12     B   65    period    75
13     B   70    period    60
...........
...........
41     A   60   service   127
42     A   60   service    63
43     A   65   service  1095
...........
...........
70     D   70   service  1208
71     D   75   service     0
72     D   75   service  2051
73     E   60   service    45
74     E   60   service     0
75     E   65   service   789
...........
...........
101    C   70 incidents     6
102    C   70 incidents     2
103    C   75 incidents     0
104    C   75 incidents     1
105    D   60 incidents     0
106    D   60 incidents     0
...........
...........

转换数据

我们可以转化数据转换成在创建每种类型的 ships 每年的汇总的新形式。它是通过使用 case()函数。

recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)

当我们上面的代码执行时，它产生以下结果：

   type year period service incidents
1     A   60    135     190         0
2     A   65    135    2190         7
3     A   70    135    4865        24
4     A   75    135    2244        11
5     B   60    135   62058        68
6     B   65    135   48979       111
7     B   70    135   20163        56
8     B   75    135    7117        18
9     C   60    135    1731         2
10    C   65    135    1457         1
11    C   70    135    2731         8
12    C   75    135     274         1
13    D   60    135     356         0
14    D   65    135     480         0
15    D   70    135    1557        13
16    D   75    135    2051         4
17    E   60    135      45         0
18    E   65    135    1226        14
19    E   70    135    3318        17
20    E   75    135     542         1