R语言学习笔记之十

摘要: 仅用于记录R语言学习过程：

内容提要：

描述性统计；t检验；数据转换；方差分析；卡方检验；回归分析与模型诊断；生存分析；COX回归

写在正文前的话，关于基础知识，此篇为终结篇，笔记来自医学方的课程，仅用于学习R的过程。

正文：

描述性统计

n 如何去生成table1 用table()函数，快速汇总频数

u 生成四格表：table(行名，列名)

> table(tips$sex,tips$smoker)

No Yes

Female 54 33

Male 97 60

u addmargins()函数：对生成的table表格进行计算

> table(esoph$agegp,esoph$ncases)

0 1 2 3 4 5 6 8 9 17

25-34 14 1 0 0 0 0 0 0 0 0

35-44 10 2 2 1 0 0 0 0 0 0

45-54 3 2 2 2 3 2 2 0 0 0

55-64 0 0 2 4 3 2 2 1 2 0

65-74 1 4 2 2 2 2 1 0 0 1

75+ 1 7 3 0 0 0 0 0 0 0

> tt <- table(esoph$agegp,esoph$ncases)

> addmargins(tt,margin = c(1,2)) # margin 1表示行，2表示列

0 1 2 3 4 5 6 8 9 17 Sum

25-34 14 1 0 0 0 0 0 0 0 0 15

35-44 10 2 2 1 0 0 0 0 0 0 15

45-54 3 2 2 2 3 2 2 0 0 0 16

55-64 0 0 2 4 3 2 2 1 2 0 16

65-74 1 4 2 2 2 2 1 0 0 1 15

75+ 1 7 3 0 0 0 0 0 0 0 11

Sum 29 16 11 9 8 6 5 1 2 1 88

n xtabs()函数：在数据集中取子集，生成表格

> hightip <- tips[,\'tip\'] > mean(tips[,\'tip\'] )

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker==\'No\'))

#as.data.frame转换成数据框；~后面的公式类似table括号中的内容，为分类变量；~左边需添加的是连续型变量；有一个子集subset可进行提取

tips.sex hightip Freq

1 Female FALSE 31

2 Male FALSE 49

3 Female TRUE 23

4 Male TRUE 48

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c(\'Sun\',\'Sat\')))

tips.sex hightip Freq

1 Female FALSE 21

2 Male FALSE 53

3 Female TRUE 25

4 Male TRUE 64

示例2：

> xtabs(ncontrols~agegp + alcgp,data = esoph)

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 61 45 5 5

35-44 89 80 20 10

45-54 78 81 39 15

55-64 89 84 43 26

65-74 71 53 29 8

75+ 27 12 2 3

> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))

alcgp

agegp 0-39g/day 40-79 80-119 120+ Sum

25-34 61 45 5 5 116

35-44 89 80 20 10 199

45-54 78 81 39 15 213

55-64 89 84 43 26 242

65-74 71 53 29 8 161

75+ 27 12 2 3 44

Sum 415 355 138 67 975

#cbind(数值型，数值型)

> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)

, , = ncases

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 0 0 0 1

35-44 1 4 0 4

45-54 1 20 12 13

55-64 12 22 24 18

65-74 11 25 13 6

75+ 4 4 2 3

, , = ncontrols

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 61 45 5 5

35-44 89 80 20 10

45-54 78 81 39 15

55-64 89 84 43 26

65-74 71 53 29 8

75+ 27 12 2 3

n ftable()函数：扁平表格，接受xtabs对象，进行调整

> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))

ncases ncontrols

agegp alcgp

25-34 0-39g/day 0 61

40-79 0 45

80-119 0 5

120+ 1 5

35-44 0-39g/day 1 89

40-79 4 80

80-119 0 20

120+ 4 10

45-54 0-39g/day 1 78

40-79 20 81

80-119 12 39

120+ 13 15

55-64 0-39g/day 12 89

40-79 22 84

80-119 24 43

120+ 18 26

65-74 0-39g/day 11 71

40-79 25 53

80-119 13 29

120+ 6 8

75+ 0-39g/day 4 27

40-79 4 12

80-119 2 2

120+ 3 3

n 数据汇总可结合前面学习的函数，如：

u summary(数据集)函数

u describe（数据集）函数（psych包里的）

u describe.data.frame(数据集)函数（Hmisc包里的）

u apply()家族等

t检验

n 假设检验:服从正态分布，方差齐性（如果不齐，可以用t’检验），

n 示例：

u 生成随机数据，并检验是否符合正态分布：（shapiro.test()函数）

> data1 <- sample(1:100,50)

> #检验正态性 shapiro.test()函数

> shapiro.test(data1)

Shapiro-Wilk normality test

data: data1

W = 0.96424, p-value = 0.1338 #不能拒绝原假设，服从正态分布

除此之外，还有以下5个函数可用于检验是否符合正态分布：

library(nortest)

lillie.test(data1)

ad.test(data1)

cvm.test(data1)

pearson.test(data1)

sf.test(data1)

如：

> lillie.test(data1)

Lilliefors (Kolmogorov-Smirnov) normality test

data: data1

D = 0.064492, p-value = 0.8693

> ad.test(data1)

Anderson-Darling normality test

data: data1

A = 0.40918, p-value = 0.333

> cvm.test(data1)

Cramer-von Mises normality test

data: data1

W = 0.054212, p-value = 0.4437

> pearson.test(data1)

Pearson chi-square normality test

data: data1

P = 4, p-value = 0.7798

> sf.test(data1)

Shapiro-Francia normality test

data: data1

W = 0.97496, p-value = 0.3075

u 生成服从正态分布的数据：rnorm()函数

> #生成服从正态分布的数据

> data3 <- rnorm(100,3,5)

> data4 <- rnorm(200,3.4,8)

> shapiro.test(data3)

Shapiro-Wilk normality test

data: data3

W = 0.99369, p-value = 0.9258

> shapiro.test(data4)

Shapiro-Wilk normality test

data: data4

W = 0.98946, p-value = 0.149

u 检验方差齐性：var.test()函数：仅适用于两组

> var.test(data3,data4)

F test to compare two variances

data: data3 and data4

F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.2893355 0.5739700

sample estimates:

ratio of variances

0.4034794 #p值小于0.05，说明方差不齐性

u 进行t检验函数：t.test()函数

u 两组均数t检验

> t.test(data3,data4,var.equal = F) #方差不齐则var.equal设置为F，默认也为FALSE，若其设置为TRUE，如为FALSE,则会使用t’检验

Welch Two Sample t-test

data: data3 and data4

t = -1.3681, df = 281.41, p-value = 0.1724 #说明两组均数无显著统计学差异

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-2.587004 0.465494

sample estimates:

mean of x mean of y

2.468549 3.529304

u 样本均数与总体均数t检验

> #样本均数与总体均数的比较

> t.test(data3,mu =3.2)

One Sample t-test

data: data3

t = -1.4117, df = 99, p-value = 0.1612 #样本与总体无统计学差异

alternative hypothesis: true mean is not equal to 3.2

95 percent confidence interval:

1.440424 3.496675

sample estimates:

mean of x

2.468549

u 配对数据的t检验

data3 <- rnorm(200,3,5)

> data4 <- rnorm(200,3.4,5)

> #配对数据

> #进行配对t检验

> t.test(data3,data4,paired = TRUE)

Paired t-test

data: data3 and data4

t = 0.59079, df = 199, p-value = 0.5553 #p大于0.05，说明无显著差异

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.6946177 1.2888581

sample estimates:

mean of the differences

0.2971202

数据变换

n 数据不满足正态分布时该如何处理：有时可采用取log()、sqrt()、1/x等等方式，

n runif()平均分布函数

> mydata <- runif(100,min = 1,max = 2)

> shapiro.test(mydata)

Shapiro-Wilk normality test

data: mydata

W = 0.93149, p-value = 6.05e-05

> shapiro.test(log(mydata))

Shapiro-Wilk normality test

data: log(mydata)

W = 0.93082, p-value = 5.54e-05

> shapiro.test(sqrt(mydata))

Shapiro-Wilk normality test

data: sqrt(mydata)

W = 0.93236, p-value = 6.794e-05

> shapiro.test(1/mydata)

Shapiro-Wilk normality test

data: 1/mydata

W = 0.9195, p-value = 1.325e-05

#几种方式均不可行。

n boxcox转换---对公式的，加载MASS包，运用里面的boxcox()函数，#boxcox()转换，核心为找到lambda的值进行相应的转换

bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,

lambda = seq(-0.25,0.25,length =10))

#Volume为拟操作的变量，Height和Girth为用此两个数据进行估计，但要取log，trees为数据集，lambda后面的公式为取最佳值

u 用公式：(x^lambda-1)/lambda 进行数据转换（lambda 不等于1），新的Volume_bc就服从正态分布了。

> Volume_bc <- (trees$Volume^lambda-1)/lambda

> shapiro.test(Volume_bc)

Shapiro-Wilk normality test

data: Volume_bc

W = 0.96431, p-value = 0.3776

u 可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看图，是否符合正态分布

n boxcox转换---对数值的，加载forecast包，利用包中的BoxCox.lambda()函数

> BoxCox.lambda(trees$Volume)

[1] -0.4954451 #lambda的值，采用1/sqrt(x)

分以下几种情况：

ü lambda接近-1时 1/x

ü -0.5 1/sqrt(x)

ü 0 ln(x) 或log(x)

ü 0.5 sqrt()

ü 1 不用变换了

完整示例：

> BoxCox.lambda(trees$Volume)

[1] -0.4954451

> new_volum <- 1/sqrt(trees$Volume)

> shapiro.test(new_volum)

Shapiro-Wilk normality test

data: new_volum

W = 0.94523, p-value = 0.1152

n 利用car包中powerTransform得到lambda值，再进行正态分布分析

> library(car)

> powerTransform(trees$Volume)

Estimated transformation parameter

trees$Volume

-0.07476608

> new_volum <- trees$Volume^-0.07476608

> shapiro.test(new_volum)

Shapiro-Wilk normality test

data: new_volum

W = 0.96428, p-value = 0.3768

方差分析

n 用于多组均值的比较。两组均值的t检验与方差分析等价，但是对于>=3组的均数比较，t检验不适用，需用方差分析。

n 假设检验，需满足的条件：

u 正态性

u 方差齐性

u 独立性

n 方差齐性的检验

u 安装multcomp包，加载cholesterol数据集(里面包含response组和trt组)

u 方差齐性的检验：

l R语言的内置包：bartlett.test()

> bartlett.test(response~trt,data = cholesterol)

Bartlett test of homogeneity of variances

data: response by trt

Bartlett\'s K-squared = 0.57975, df = 4, p-value =

0.9653

> #p = 0.9653 方差是齐性的

> #正态性检验

> shapiro.test(cholesterol$response)

Shapiro-Wilk normality test

data: cholesterol$response

W = 0.97722, p-value = 0.4417

l R语言的内置包：fligner.test() 与bartlett.test()检验原理不同，但公式一样

> #方差齐性检验

> fligner.test(response~trt,data = cholesterol)

Fligner-Killeen test of homogeneity of variances

data: response by trt

Fligner-Killeen:med chi-squared = 0.74277, df = 4,

p-value = 0.946

l car包中的ncvTest()函数：

> ncvTest(lm(response~trt,data = cholesterol))

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.1338923 Df = 1 p = 0.71443

l car包中的leveneTest()函数：

> leveneTest(response~trt,data = cholesterol)

Levene\'s Test for Homogeneity of Variance (center = median)

Df F value Pr(>F)

group 4 0.0755 0.9893

n 方差分析

u 单因素方差分析：aov()函数

> aov(response~trt,data = cholesterol)

Call:

aov(formula = response ~ trt, data = cholesterol)

Terms:

trt Residuals

Sum of Squares 1351.3690 468.7504

Deg. of Freedom 4 45

Residual standard error: 3.227488

Estimated effects may be unbalanced

> summary(fit)

Df Sum Sq Mean Sq F value Pr(>F)

trt 4 1351.4 337.8 32.43 9.82e-13 ***

Residuals 45 468.8 10.4

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#用gplots包中的plotmeans可视化

plotmeans(response~trt,data = cholesterol)

u 单因素方差分析：oneway.test()函数

> oneway.test(response~trt,data = cholesterol)

One-way analysis of means (not assuming equal

variances)

data: response and trt

F = 30.709, num df = 4.00, denom df = 22.46, p-value =

8.227e-09

> oneway.test(response~trt,data = cholesterol,var.equal = T)

One-way analysis of means

data: response and trt

F = 32.433, num df = 4, denom df = 45, p-value =

9.819e-13

u 两两比较：TukeyHSD()函数

> TukeyHSD(fit)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = response ~ trt, data = cholesterol)

$trt

diff lwr upr p adj

2times-1time 3.44300 -0.6582817 7.544282 0.1380949

4times-1time 6.59281 2.4915283 10.694092 0.0003542

drugD-1time 9.57920 5.4779183 13.680482 0.0000003

drugE-1time 15.16555 11.0642683 19.266832 0.0000000

4times-2times 3.14981 -0.9514717 7.251092 0.2050382

drugD-2times 6.13620 2.0349183 10.237482 0.0009611

drugE-2times 11.72255 7.6212683 15.823832 0.0000000

drugD-4times 2.98639 -1.1148917 7.087672 0.2512446

drugE-4times 8.57274 4.4714583 12.674022 0.0000037

drugE-drugD 5.58635 1.4850683 9.687632 0.0030633

#可视化

plot(TukeyHSD(fit))

u 多因素方差分析：aov()函数

> data(\'ToothGrowth\')

> head(\'ToothGrowth\')

len supp dose

1 4.2 VC 0.5

2 11.5 VC 0.5

3 7.3 VC 0.5

4 5.8 VC 0.5

5 6.4 VC 0.5

6 10.0 VC 0.5

> levels(ToothGrowth$supp)

[1] "OJ" "VC"

> levels(ToothGrowth$dose)

NULL

> levels(as.factor(ToothGrowth$dose))

[1] "0.5" "1" "2"

> table(ToothGrowth$supp,ToothGrowth$dose)

0.5 1 2

OJ 10 10 10

VC 10 10 10

#公式的写法关注交互作用

> fangcha <- aov (len~supp+dose,data = ToothGrowth)

> summary(fangcha)

Df Sum Sq Mean Sq F value Pr(>F)

supp 1 205.4 205.4 11.45 0.0013 **

dose 1 2224.3 2224.3 123.99 6.31e-16 ***

Residuals 57 1022.6 17.9

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#交互作用

> fangcha <- aov (len~supp*dose,data = ToothGrowth)

> summary(fangcha)

Df Sum Sq Mean Sq F value Pr(>F)

supp 1 205.4 205.4 12.317 0.000894 ***

dose 1 2224.3 2224.3 133.415 < 2e-16 ***

supp:dose 1 88.9 88.9 5.333 0.024631 * #交互作用不能忽视

Residuals 56 933.6 16.7

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#可视化交互作用图

>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = \'b\',

+ pch = c(1,10),col = c(\'red\',\'green\'))

u 单因素协方差分析：aov()函数

> data(\'litter\')

> head(litter)

dose weight gesttime number

1 0 28.05 22.5 15

2 0 33.33 22.5 14

3 0 36.37 22.0 14

4 0 35.52 22.0 13

5 0 36.77 21.5 15

6 0 29.60 23.0 5

> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)

> summary(facn)

Df Sum Sq Mean Sq F value Pr(>F)

gesttime 1 134.3 134.30 8.289 0.00537 ** #协变量确实有影响，该如何去除？

dose 3 137.1 45.71 2.821 0.04556 *

gesttime:dose 3 81.9 27.29 1.684 0.17889

Residuals 66 1069.4 16.20

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

u 去除协变量的影响：加载effects包，用其中的effect()函数

> effect(\'dose\',facn)

NOTE: dose is not a high-order term in the model

dose effect

dose

0 5 50 500

32.30722 28.50625 30.61078 29.29005

#常见研究设计的表达式

卡方检验

n 对分类变量的检验方法

n 分类：

u 拟合优度检验：用chisq.test()函数

（针对样本数据的分布与已知总体分布是否一致）

> men <- c(11,120,60,45)

> women <- c(20,102,39,30)

> df <- as.data.frame(rbind(men,women))

> colnames(df) <- c(\'AB\',\'O\',\'A\',\'B\')

> chisq.test(men)

Chi-squared test for given probabilities

data: men

X-squared = 105.46, df = 3, p-value < 2.2e-16

> chisq.test(men,p = c(0.1,0.5,0.2,0.2)) #p 中为总体人群中各血型的比例

Chi-squared test for given probabilities

data: men

X-squared = 10.335, df = 3, p-value = 0.01592

u 卡方齐性检验：用于比较不同分类水平下，各个类型的比例是否一致。

> chisq.test(df) #行变量与列变量的相关性检验

Pearson\'s Chi-squared test

data: df

X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致

u 卡方独立性检验：

> chisq.test(df)

Pearson\'s Chi-squared test

data: df

X-squared = 6.8607, df = 3, p-value = 0.07647 # 行变量和列变量无关联，血型分布和性别无关

u CMH检验：分层检验三维，行变量为无序，列变量为有序数据；判断是否有混杂因素

#采用mantelhaen.test()函数

> Rabbits <-

+ array(c(0,0,6,5,

+ 3,0,3,6,

+ 6,2,0,4,

+ 5,6,1,0,

+ 2,5,0,0),

+ dim = c(2,2,5),

+ dimnames = list(

+ Delay = c(\'None\',\'1.5h\'),

+ Response = c(\'Cured\',\'Died\'),

+ Penicillin.Level = c(\'1/8\',\'1/4\',\'1/2\',\'1\',\'4\')))

> Rabbits

, , Penicillin.Level = 1/8

Response

Delay Cured Died

None 0 6

1.5h 0 5

, , Penicillin.Level = 1/4

Response

Delay Cured Died

None 3 3

1.5h 0 6

, , Penicillin.Level = 1/2

Response

Delay Cured Died

None 6 0

1.5h 2 4

, , Penicillin.Level = 1

Response

Delay Cured Died

None 5 1

1.5h 6 0

, , Penicillin.Level = 4

Response

Delay Cured Died

None 2 0

1.5h 5 0

> mantelhaen.test(Rabbits)

Mantel-Haenszel chi-squared test with continuity

correction

data: Rabbits

Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =

0.04747

alternative hypothesis: true common odds ratio is not equal to 1

95 percent confidence interval:

1.026713 47.725133

sample estimates:

common odds ratio

7 #可以判定盘尼西林是否为混杂因素

u CMH检验：有序分类的卡方检验：连续型变量

#采用mantelhaen.test()函数

> Satisfaction <-

+ as.table(array(c(1,2,0,0,3,3,1,2,

+ 11,17,8,4,2,3,5,2,

+ 1,0,0,0,1,3,0,1,

+ 2,5,7,9,1,1,3,6),

+ dim = c(4,4,2),

+ dimnames =

+ list(Income =

+ c(\'<5000\',\'5000-15000\',

+ \'15000-25000\',\'>25000\'),

+ \'Job Satisfaction\' =

+ c(\'V_D\',\'L_S\',\'M_S\',\'V_S\'),

+ Gender = c(\'Female\',\'Male\'))))

> Satisfaction

, , Gender = Female

Job Satisfaction

Income V_D L_S M_S V_S

<5000 1 3 11 2

5000-15000 2 3 17 3

15000-25000 0 1 8 5

>25000 0 2 4 2

, , Gender = Male

Job Satisfaction

Income V_D L_S M_S V_S

<5000 1 1 2 1

5000-15000 0 3 5 1

15000-25000 0 0 7 3

>25000 0 1 9 6

> mantelhaen.test(Satisfaction)

Cochran-Mantel-Haenszel test

data: Satisfaction

Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =

0.3345 #工资水平与满意度无关

u 配对四格表的卡方检验：采用mcnemar.test()函数

> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c(\'A\',\'B\'),control = c(\'A\',\'B\'))))

> paired

control

case A B

A 157 69

B 24 18

> mcnemar.test(paired)

McNemar\'s Chi-squared test with continuity correction

data: paired

McNemar\'s chi-squared = 20.817, df = 1, p-value =

5.053e-06

> chisq.test(paired)

Pearson\'s Chi-squared test with Yates\' continuity

correction

data: paired

X-squared = 1.9244, df = 1, p-value = 0.1654

u 的

回归分析与模型诊断

前提：是否适合做线性回归，是否满足正态分布

只要因变量是连续型变量即可做线性回归，因变量是分类变量需要做Logistic回归；自变量是连续型或者分类变量无影响

n 一元回归分析：lm(因变量~自变量)

u x为连续型变量

> x<- seq(1,5,len =100)

> noise <- rnorm(n=100,mean = 0,sd = 1)

> beta0 = 1

> beta1 <-2

> y <- beta0 + beta1*x + noise

> plot(y~x) #看一下是否适合做线性回归

> model <- lm(y~x)

> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-2.49732 -0.64794 0.00215 0.75355 3.06579

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.77938 0.28733 2.712 0.00789 **

x 2.05561 0.08927 23.027 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.041 on 98 degrees of freedom

Multiple R-squared: 0.844, Adjusted R-squared: 0.8424 #模型可解释度

F-statistic: 530.3 on 1 and 98 DF, p-value: < 2.2e-16 #模型可靠性

u x为分类变量

> x <- factor(rep(c(0,1,2),each = 20))

> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))

> model <- lm(y~x)

> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-2.50566 -0.82826 0.01137 0.87966 2.41338

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.05338 0.24331 0.219 0.827

x1 0.81708 0.34409 2.375 0.021 *

x2 1.99903 0.34409 5.810 2.94e-07 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.088 on 57 degrees of freedom

Multiple R-squared: 0.3745, Adjusted R-squared: 0.3525

F-statistic: 17.06 on 2 and 57 DF, p-value: 1.558e-06

> exp( 1.99903) #转化成OR值

[1] 7.381892

n 模型诊断

u 非正态性：shapiro.test()判断

> #检验残差是否服从正态分布

> data(LMdata,package = \'rinds\')

> model <- lm(y~x,data = LMdata$NonL)

> #找残差

> res1 <- residuals(model)

> shapiro.test(res1) #残差不符合正态分布，需要重新做

Shapiro-Wilk normality test

data: res1

W = 0.93524, p-value = 1e-04 #残差不符合正态分布，需要重新做

u 非线性

> #2.非线性

> model2 <- lm(y~x,data = LMdata$NonL)

> plot(model2)

#y 与x2的关系是否成线性

model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

summary(model2)

#踢掉x

> model3 <- update(model2,y~.-x)

> summary(model3)

Call:

lm(formula = y ~ 1, data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-19.456 -12.907 -2.464 10.725 28.697

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.829 1.429 15.28 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.29 on 99 degrees of freedom

> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

> summary(model2)

Call:

lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-2.32348 -0.60702 0.00701 0.56018 2.27346

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.98702 0.62216 1.586 0.116

x 0.11085 0.45405 0.244 0.808

I(x^2) 1.97966 0.07456 26.552 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.907 on 97 degrees of freedom

Multiple R-squared: 0.9961, Adjusted R-squared: 0.996

F-statistic: 1.224e+04 on 2 and 97 DF, p-value: < 2.2e-16

> model3 <- update(model2,y~.-x)

> summary(model3)

Call:

lm(formula = y ~ I(x^2), data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-2.34289 -0.60692 0.01722 0.58186 2.29601

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.13378 0.15963 7.103 1.97e-10 ***

I(x^2) 1.99760 0.01271 157.195 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9026 on 98 degrees of freedom

Multiple R-squared: 0.996, Adjusted R-squared: 0.996

F-statistic: 2.471e+04 on 1 and 98 DF, p-value: < 2.2e-16

> AIC(model,model2,model3)

df AIC

model 3 478.4558

model2 4 269.2121

model3 3 267.2736 #拟合效果越来越好

plot(model3$residuals~LMdata$NonL) #在0左右分布

u 异方差：考虑的是噪声对模型的影响

model4 <- lm(y~x,data = LMdata$Hetero)

plot(model4$residuals~LMdata$Hetero$x)

#处理方法：使用加权最小二乘法：对于不同的样本点给予不同的权重

> summary(model5)

Call:

lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

Weighted Residuals:

Min 1Q Median 3Q Max

-2.25132 -0.68409 -0.03924 0.63997 2.61098

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.2611 0.5139 2.454 0.0159 *

x 1.8973 0.2317 8.190 9.98e-13 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 98 degrees of freedom

Multiple R-squared: 0.4063, Adjusted R-squared: 0.4003

F-statistic: 67.07 on 1 and 98 DF, p-value: 9.977e-13

> summary(model4)

Call:

lm(formula = y ~ x, data = LMdata$Hetero)

Residuals:

Min 1Q Median 3Q Max

-8.3371 -1.6383 -0.1533 1.4075 12.3423

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6175 1.0019 1.615 0.11

x 1.7671 0.3113 5.677 1.4e-07 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.63 on 98 degrees of freedom

Multiple R-squared: 0.2475, Adjusted R-squared: 0.2398

F-statistic: 32.23 on 1 and 98 DF, p-value: 1.396e-07

#处理方法：使用迭代重复加权最小二乘法采用nlme包中的gls()函数，找到最合适的weights公式，得到最小AIC

> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)

> summary(glsmodel)

Generalized least squares fit by REML

Model: y ~ x

Data: LMdata$Hetero

AIC BIC logLik

517.5286 525.2835 -255.7643

Variance function:

Structure: fixed weights

Formula: ~x

Coefficients:

Value Std.Error t-value p-value

(Intercept) 1.421324 0.7091780 2.004184 0.0478

x 1.832543 0.2603653 7.038351 0.0000

Correlation:

(Intr)

x -0.908

Standardized residuals:

Min Q1 Med Q3

-2.17451000 -0.55322766 -0.03454023 0.51060224

Max

2.93969900

Residual standard error: 1.890127

Degrees of freedom: 100 total; 98 residual

u 自相关：利用lmtest包中的DW检验函数：dwtest()函数

> model6 <- lm(y~x,data = LMdata$AC)

> dwtest(model6)

Durbin-Watson test

data: model6

DW = 0.65556, p-value = 2.683e-12 #存在自相关，不能使用最小二乘法，可使用gls()函数，实现广义最小二乘法，注意gls中的correlation 参数，设置为corAR1()

alternative hypothesis: true autocorrelation is greater than 0

> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)

> summary(glsmodel2)

Generalized least squares fit by REML

Model: y ~ x

Data: LMdata$AC

AIC BIC logLik

268.7617 279.1016 -130.3809

Correlation Structure: AR(1)

Formula: ~1

Parameter estimate(s):

Phi

0.7041665

Coefficients:

Value Std.Error t-value p-value

(Intercept) 1.335059 0.7792019 1.713367 0.0898

x 2.072936 0.2405513 8.617438 0.0000

Correlation:

(Intr)

x -0.926

Standardized residuals:

Min Q1 Med Q3

-1.667086282 -0.571601900 0.001724114 0.568360354

Max

2.306177988

Residual standard error: 1.25329

Degrees of freedom: 100 total; 98 residual

> summary(model6)

Call:

lm(formula = y ~ x, data = LMdata$AC)

Residuals:

Min 1Q Median 3Q Max

-2.10131 -0.72894 -0.03329 0.66138 2.77461

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.2626 0.3303 3.822 0.000232 ***

x 2.1118 0.1026 20.578 < 2e-16 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.197 on 98 degrees of freedom

Multiple R-squared: 0.8121, Adjusted R-squared: 0.8101

F-statistic: 423.4 on 1 and 98 DF, p-value: < 2.2e-16

u 异常值：离群点，杠杆点、高影响点

model7 <- lm(y~x,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)

abline(model7) #画出回归直线

#利用car包中的infuencePlot()函数进行判断，圆圈越大，对模型影响越大，做线性回归模型时需踢掉该点，用update函数

model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier) #比较去除前后的差异

abline(model7,col = \'red\') #比较去除前后的差异

abline(model8,col = \'green\') #比较去除前后的差异

u 多重共线性的判断：自变量之间存在线性关系

#x1,x2,x3与y之间p值无显著差异，但是总体上的p值和R值非常显著，说明x1,x2,x3之间可能是存在相关性的，但与y没有相关性，不能进行回归分析，需要用函数进行检验确认：vif()函数计算方差膨胀因子

> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)

> summary(model9)

Call:

lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

Residuals:

Min 1Q Median 3Q Max

-1.29100 -0.26369 0.00141 0.32529 0.91182

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.3848 0.5813 0.662 0.5095

x1 7.2022 4.8552 1.483 0.1412

x2 -14.0916 12.1385 -1.161 0.2486

x3 8.2312 4.8559 1.695 0.0933 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.499 on 96 degrees of freedom

Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984

F-statistic: 2.057e+04 on 3 and 96 DF, p-value: < 2.2e-16

#vif计算方差膨胀因子，一般而言，大于10则认为存在共线性

> vif(model9)

x1 x2 x3

7560.819 214990.752 222630.742

#运用自动回归判断是x中的哪些变量存在共线性：step()函数

> model10 <- step(model9)

Start: AIC=-135.11

y ~ x1 + x2 + x3

Df Sum of Sq RSS AIC

- x2 1 0.33560 24.241 -135.71

<none> 23.905 -135.11

- x1 1 0.54795 24.453 -134.84

- x3 1 0.71550 24.621 -134.16

Step: AIC=-135.71

y ~ x1 + x3

Df Sum of Sq RSS AIC

<none> 24.2 -135.71

- x1 1 189.2 213.4 79.81

- x3 1 15276.4 15300.7 507.05

> summary(model10)

Call:

lm(formula = y ~ x1 + x3, data = LMdata$Mult)

Residuals:

Min 1Q Median 3Q Max

-1.24046 -0.27519 -0.02751 0.32824 0.91344

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.38158 0.58230 0.655 0.514

x1 1.56614 0.05692 27.514 <2e-16 ***

x3 2.59393 0.01049 247.241 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4999 on 97 degrees of freedom

Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984

F-statistic: 3.074e+04 on 2 and 97 DF, p-value: < 2.2e-16

n Logistic回归

与线性回归的区别：y为分类变量。

u 使用glm()函数：下载HSAUR2包的数据集plasma

> library(HSAUR2)

> data(\'plasma\')

> head(plasma)

fibrinogen globulin ESR

1 2.52 38 ESR < 20

2 2.56 31 ESR < 20

3 2.19 33 ESR < 20

4 2.18 31 ESR < 20

5 3.41 37 ESR < 20

6 2.46 36 ESR < 20

> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,

+ family = binomial()) #family

客服电话

电子邮件

请发表评论

全部评论

上一篇：

下一篇：

关于我们

产品与服务

解决方案

139-2527-9053