在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
不同的行业对于数据集的行和列叫法不同。统计学家称它们为观测(observation)和变量(variable),数据库分析师则称其为记录(record)和字段(field),数据挖掘和机器学习学科的研究者则把它们叫作示例(example)和属性(attribute)。
R中有许多用于存储数据的结构,包括标量、向量、数组、数据框和列表。多样化的数据结构赋予了R极其灵活的数据处理能力。
R可以处理的数据类型(模式)包括数值型、字符型、逻辑型(TRUE/FALSE)、复数型(虚数)和原生型(字节)。
2.2 数据结构2.2.1 向量
a <- c("k", "j", "h", "a", "c", "m") a[3] ## [1] "h" a[c(1, 3, 5)] ## [1] "k" "h" "c" 2.2.2 矩阵
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = list(char_vector_rownames, char_vector_colnames)) nrow: the desired number of rows. ncol: the desired number of columns. byrow: logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows. dimnames:A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.
rownames.force:logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
is.matrix returns TRUE if x is a vector and has a "dim" attribute of length 2 and FALSE otherwise. Note that a data.frame is not a matrix by this test. rnames <- c("R1", "R2") cnames <- c("C1", "C2") y <- matrix(1:4, nrow=2, ncol=2, byrow=TRUE,dimnames=list(rnames, cnames)) y ## C1 C2 ## R1 1 2 ## R2 3 4 is.matrix(y) ## [1] TRUE as.matrix is a generic function. The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise, the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc. da <- data.frame( lot1 = c(1,2), lot2 = c("a","b")) ma<-as.matrix(da) da ## lot1 lot2 ## 1 1 a ## 2 2 b ma ## lot1 lot2 ## [1,] "1" "a" ## [2,] "2" "b" str(da[1,1]) ## num 1 str(ma[1,1]) ## Named chr "1" ## - attr(*, "names")= chr "lot1" 如上例所示,数值型被转换为了字符型。 If you just want to convert a vector to a matrix, something like
x<-1:6 dim(x)<-c(2,3) dimnames(x)<-list(c("a","b"),c("c","d","e")) x ## c d e ## a 1 3 5 ## b 2 4 6
如x[2,]或者x[1,4],不需要像MATLAB用冒号表示整行或整列。 2.2.3 数组数组(array)与矩阵类似,但是维度可以大于2。
2.2.4 数据框
patientID <- c(1, 2, 3, 4) age <- c(25, 34, 28, 52) diabetes <- c("Type1", "Type2", "Type1", "Type1") status <- c("Poor", "Improved", "Excellent", "Poor") patientdata <- data.frame(patientID, age, diabetes, status) patientdata ## patientID age diabetes status ## 1 1 25 Type1 Poor ## 2 2 34 Type2 Improved ## 3 3 28 Type1 Excellent ## 4 4 52 Type1 Poor table(patientdata$diabetes, patientdata$status) ## ## Excellent Improved Poor ## Type1 1 0 2 ## Type2 0 1 0 引用方法可以用列号patientdata[1:2],也可以用列名patientdata[c("diabetes", "status")],可以用$符号patientdata$age。 table用来生成列联表。 attach()减少在每个变量名前键入数据框名的繁琐,将数据框添加到R的搜索路径中。R在遇到一个变量名以后,将检查搜索路径中的数据框。
例如: attach(mtcars) summary(mpg) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 15.43 19.20 20.09 22.80 33.90 plot(mpg, disp) detach(mtcars)
datach()(将数据框从搜索路径中移除。 注意这样可能出现同名对象之间的屏蔽(mask)。 with()with(mtcars, { print(summary(mpg)) plot(mpg, disp) }) 花括号{ }之间的语句都针对数据框mtcars执行,无需担心名称冲突。 如果你需要创建在with()结构以外存在的对象,使用特殊赋值符<<-替代标准赋值符(<-)即可,它可将对象保存到with()之外的全局环境中。
实例标识符在病例数据中,病人编号(patientID)用于区分数据集中不同的个体。在R中,实例标识符(case identifier)可通过数据框操作函数中的rowname选项指定。
patientdata <- data.frame(patientID, age, diabetes, status, row.names=patientID) 2.2.5 因子
类别(名义型)变量和有序类别(有序型)变量在R中称为因子(factor)。
factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) levels: an optional vector of the unique values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)). labels: either an optional character vector of labels for the levels (in the same order as levels after removing those in exclude), or a character string of length 1. Duplicated values in labels can be used to map different values of x to the same factor level. exclude: a vector of values to be excluded when forming the set of levels. This may be factor with the same level set as x or should be a character. ordered: logical flag to determine if the levels should be regarded as ordered (in the order given), TRUE or FALSE. nmax: an upper bound on the number of levels. diabetes <- c("Type1", "Type2", "Type1", "Type1") x<-factor(diabetes) x ## [1] Type1 Type2 Type1 Type1 ## Levels: Type1 Type2 str(x) ## Factor w/ 2 levels "Type1","Type2": 1 2 1 1 is.factor(x) ## [1] TRUE as.integer(x) ## [1] 1 2 1 1 y<-factor(diabetes,levels=c("Type2","Type1")) y ## [1] Type1 Type2 Type1 Type1 ## Levels: Type2 Type1 z<-factor(diabetes,labels=c(2,1)) z ## [1] 2 1 2 2 ## Levels: 2 1 ex<-factor(diabetes,exclude=c("Type1")) ex ## [1] <NA> Type2 <NA> <NA> ## Levels: Type2
|
请发表评论