摘要: 仅用于记录R语言学习过程:
内容提要:
描述性统计;t检验;数据转换;方差分析;卡方检验;回归分析与模型诊断;生存分析;COX回归
写在正文前的话,关于基础知识,此篇为终结篇,笔记来自医学方的课程,仅用于学习R的过程。
正文:
描述性统计
n 如何去生成table1 用table()函数,快速汇总频数
u 生成四格表:table(行名,列名)
> table(tips$sex,tips$smoker)
No Yes
Female 54 33
Male 97 60
u addmargins()函数:对生成的table表格进行计算
> table(esoph$agegp,esoph$ncases)
0 1 2 3 4 5 6 8 9 17
25-34 14 1 0 0 0 0 0 0 0 0
35-44 10 2 2 1 0 0 0 0 0 0
45-54 3 2 2 2 3 2 2 0 0 0
55-64 0 0 2 4 3 2 2 1 2 0
65-74 1 4 2 2 2 2 1 0 0 1
75+ 1 7 3 0 0 0 0 0 0 0
> tt <- table(esoph$agegp,esoph$ncases)
> addmargins(tt,margin = c(1,2)) # margin 1表示行,2表示列
0 1 2 3 4 5 6 8 9 17 Sum
25-34 14 1 0 0 0 0 0 0 0 0 15
35-44 10 2 2 1 0 0 0 0 0 0 15
45-54 3 2 2 2 3 2 2 0 0 0 16
55-64 0 0 2 4 3 2 2 1 2 0 16
65-74 1 4 2 2 2 2 1 0 0 1 15
75+ 1 7 3 0 0 0 0 0 0 0 11
Sum 29 16 11 9 8 6 5 1 2 1 88
n xtabs()函数:在数据集中取子集,生成表格
> hightip <- tips[,\'tip\'] > mean(tips[,\'tip\'] )
> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker==\'No\'))
#as.data.frame转换成数据框;~后面的公式类似table括号中的内容,为分类变量;~左边需添加的是连续型变量;有一个子集subset可进行提取
tips.sex hightip Freq
1 Female FALSE 31
2 Male FALSE 49
3 Female TRUE 23
4 Male TRUE 48
> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c(\'Sun\',\'Sat\')))
tips.sex hightip Freq
1 Female FALSE 21
2 Male FALSE 53
3 Female TRUE 25
4 Male TRUE 64
示例2:
> xtabs(ncontrols~agegp + alcgp,data = esoph)
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 61 45 5 5
35-44 89 80 20 10
45-54 78 81 39 15
55-64 89 84 43 26
65-74 71 53 29 8
75+ 27 12 2 3
> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))
alcgp
agegp 0-39g/day 40-79 80-119 120+ Sum
25-34 61 45 5 5 116
35-44 89 80 20 10 199
45-54 78 81 39 15 213
55-64 89 84 43 26 242
65-74 71 53 29 8 161
75+ 27 12 2 3 44
Sum 415 355 138 67 975
#cbind(数值型,数值型)
> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)
, , = ncases
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 0 0 0 1
35-44 1 4 0 4
45-54 1 20 12 13
55-64 12 22 24 18
65-74 11 25 13 6
75+ 4 4 2 3
, , = ncontrols
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 61 45 5 5
35-44 89 80 20 10
45-54 78 81 39 15
55-64 89 84 43 26
65-74 71 53 29 8
75+ 27 12 2 3
n ftable()函数:扁平表格,接受xtabs对象,进行调整
> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))
ncases ncontrols
agegp alcgp
25-34 0-39g/day 0 61
40-79 0 45
80-119 0 5
120+ 1 5
35-44 0-39g/day 1 89
40-79 4 80
80-119 0 20
120+ 4 10
45-54 0-39g/day 1 78
40-79 20 81
80-119 12 39
120+ 13 15
55-64 0-39g/day 12 89
40-79 22 84
80-119 24 43
120+ 18 26
65-74 0-39g/day 11 71
40-79 25 53
80-119 13 29
120+ 6 8
75+ 0-39g/day 4 27
40-79 4 12
80-119 2 2
120+ 3 3
n 数据汇总可结合前面学习的函数,如:
u summary(数据集)函数
u describe(数据集)函数(psych包里的)
u describe.data.frame(数据集)函数 (Hmisc包里的)
u apply()家族等
t检验
n 假设检验:服从正态分布,方差齐性(如果不齐,可以用t’检验),
n 示例:
u 生成随机数据,并检验是否符合正态分布:(shapiro.test()函数)
> data1 <- sample(1:100,50)
>
> #检验正态性 shapiro.test()函数
> shapiro.test(data1)
Shapiro-Wilk normality test
data: data1
W = 0.96424, p-value = 0.1338 #不能拒绝原假设,服从正态分布
除此之外,还有以下5个函数可用于检验是否符合正态分布:
library(nortest)
lillie.test(data1)
ad.test(data1)
cvm.test(data1)
pearson.test(data1)
sf.test(data1)
如:
> lillie.test(data1)
Lilliefors (Kolmogorov-Smirnov) normality test
data: data1
D = 0.064492, p-value = 0.8693
> ad.test(data1)
Anderson-Darling normality test
data: data1
A = 0.40918, p-value = 0.333
> cvm.test(data1)
Cramer-von Mises normality test
data: data1
W = 0.054212, p-value = 0.4437
> pearson.test(data1)
Pearson chi-square normality test
data: data1
P = 4, p-value = 0.7798
> sf.test(data1)
Shapiro-Francia normality test
data: data1
W = 0.97496, p-value = 0.3075
u 生成服从正态分布的数据:rnorm()函数
> #生成服从正态分布的数据
> data3 <- rnorm(100,3,5)
> data4 <- rnorm(200,3.4,8)
>
> shapiro.test(data3)
Shapiro-Wilk normality test
data: data3
W = 0.99369, p-value = 0.9258
>
> shapiro.test(data4)
Shapiro-Wilk normality test
data: data4
W = 0.98946, p-value = 0.149
u 检验方差齐性:var.test()函数:仅适用于两组
> var.test(data3,data4)
F test to compare two variances
data: data3 and data4
F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2893355 0.5739700
sample estimates:
ratio of variances
0.4034794 #p值小于0.05,说明方差不齐性
u 进行t检验函数:t.test()函数
u 两组均数t检验
> t.test(data3,data4,var.equal = F) #方差不齐则var.equal设置为F,默认也为FALSE,若其设置为TRUE,如为FALSE,则会使用t’检验
Welch Two Sample t-test
data: data3 and data4
t = -1.3681, df = 281.41, p-value = 0.1724 #说明两组均数无显著统计学差异
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.587004 0.465494
sample estimates:
mean of x mean of y
2.468549 3.529304
u 样本均数与总体均数t检验
> #样本均数与总体均数的比较
> t.test(data3,mu =3.2)
One Sample t-test
data: data3
t = -1.4117, df = 99, p-value = 0.1612 #样本与总体无统计学差异
alternative hypothesis: true mean is not equal to 3.2
95 percent confidence interval:
1.440424 3.496675
sample estimates:
mean of x
2.468549
u 配对数据的t检验
data3 <- rnorm(200,3,5)
> data4 <- rnorm(200,3.4,5)
>
> #配对数据
> #进行配对t检验
> t.test(data3,data4,paired = TRUE)
Paired t-test
data: data3 and data4
t = 0.59079, df = 199, p-value = 0.5553 #p大于0.05,说明无显著差异
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6946177 1.2888581
sample estimates:
mean of the differences
0.2971202
数据变换
n 数据不满足正态分布时该如何处理:有时可采用取log()、sqrt()、1/x等等方式,
n runif()平均分布函数
> mydata <- runif(100,min = 1,max = 2)
> shapiro.test(mydata)
Shapiro-Wilk normality test
data: mydata
W = 0.93149, p-value = 6.05e-05
> shapiro.test(log(mydata))
Shapiro-Wilk normality test
data: log(mydata)
W = 0.93082, p-value = 5.54e-05
> shapiro.test(sqrt(mydata))
Shapiro-Wilk normality test
data: sqrt(mydata)
W = 0.93236, p-value = 6.794e-05
> shapiro.test(1/mydata)
Shapiro-Wilk normality test
data: 1/mydata
W = 0.9195, p-value = 1.325e-05
#几种方式均不可行。
n boxcox转换---对公式的,加载MASS包,运用里面的boxcox()函数,#boxcox()转换,核心为找到lambda的值进行相应的转换
bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,
lambda = seq(-0.25,0.25,length =10))
#Volume为拟操作的变量,Height和Girth为用此两个数据进行估计,但要取log,trees为数据集,lambda后面的公式为取最佳值
u 用公式:(x^lambda-1)/lambda 进行数据转换(lambda 不等于1),新的Volume_bc就服从正态分布了。
> Volume_bc <- (trees$Volume^lambda-1)/lambda
> shapiro.test(Volume_bc)
Shapiro-Wilk normality test
data: Volume_bc
W = 0.96431, p-value = 0.3776
u 可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看图,是否符合正态分布
n boxcox转换---对数值的,加载forecast包,利用包中的BoxCox.lambda()函数
> BoxCox.lambda(trees$Volume)
[1] -0.4954451 #lambda的值 ,采用1/sqrt(x)
分以下几种情况:
ü lambda接近-1时 1/x
ü -0.5 1/sqrt(x)
ü 0 ln(x) 或log(x)
ü 0.5 sqrt()
ü 1 不用变换了
完整示例:
> BoxCox.lambda(trees$Volume)
[1] -0.4954451
> new_volum <- 1/sqrt(trees$Volume)
> shapiro.test(new_volum)
Shapiro-Wilk normality test
data: new_volum
W = 0.94523, p-value = 0.1152
n 利用car包中powerTransform得到lambda值,再进行正态分布分析
> library(car)
> powerTransform(trees$Volume)
Estimated transformation parameter
trees$Volume
-0.07476608
> new_volum <- trees$Volume^-0.07476608
> shapiro.test(new_volum)
Shapiro-Wilk normality test
data: new_volum
W = 0.96428, p-value = 0.3768
方差分析
n 用于多组均值的比较。两组均值的t检验与方差分析等价,但是对于>=3组的均数比较,t检验不适用,需用方差分析。
n 假设检验,需满足的条件:
u 正态性
u 方差齐性
u 独立性
n 方差齐性的检验
u 安装multcomp包,加载cholesterol数据集(里面包含response组和trt组)
u 方差齐性的检验:
l R语言的内置包:bartlett.test()
> bartlett.test(response~trt,data = cholesterol)
Bartlett test of homogeneity of variances
data: response by trt
Bartlett\'s K-squared = 0.57975, df = 4, p-value =
0.9653
> #p = 0.9653 方差是齐性的
>
> #正态性检验
> shapiro.test(cholesterol$response)
Shapiro-Wilk normality test
data: cholesterol$response
W = 0.97722, p-value = 0.4417
l R语言的内置包:fligner.test() 与bartlett.test()检验原理不同,但公式一样
> #方差齐性检验
> fligner.test(response~trt,data = cholesterol)
Fligner-Killeen test of homogeneity of variances
data: response by trt
Fligner-Killeen:med chi-squared = 0.74277, df = 4,
p-value = 0.946
l car包中的ncvTest()函数:
> ncvTest(lm(response~trt,data = cholesterol))
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.1338923 Df = 1 p = 0.71443
l car包中的leveneTest()函数:
> leveneTest(response~trt,data = cholesterol)
Levene\'s Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 0.0755 0.9893
45
n 方差分析
u 单因素方差分析:aov()函数
> aov(response~trt,data = cholesterol)
Call:
aov(formula = response ~ trt, data = cholesterol)
Terms:
trt Residuals
Sum of Squares 1351.3690 468.7504
Deg. of Freedom 4 45
Residual standard error: 3.227488
Estimated effects may be unbalanced
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
trt 4 1351.4 337.8 32.43 9.82e-13 ***
Residuals 45 468.8 10.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#用gplots包中的plotmeans可视化
plotmeans(response~trt,data = cholesterol)
u 单因素方差分析:oneway.test()函数
> oneway.test(response~trt,data = cholesterol)
One-way analysis of means (not assuming equal
variances)
data: response and trt
F = 30.709, num df = 4.00, denom df = 22.46, p-value =
8.227e-09
> oneway.test(response~trt,data = cholesterol,var.equal = T)
One-way analysis of means
data: response and trt
F = 32.433, num df = 4, denom df = 45, p-value =
9.819e-13
u 两两比较:TukeyHSD()函数
> TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = response ~ trt, data = cholesterol)
$trt
diff lwr upr p adj
2times-1time 3.44300 -0.6582817 7.544282 0.1380949
4times-1time 6.59281 2.4915283 10.694092 0.0003542
drugD-1time 9.57920 5.4779183 13.680482 0.0000003
drugE-1time 15.16555 11.0642683 19.266832 0.0000000
4times-2times 3.14981 -0.9514717 7.251092 0.2050382
drugD-2times 6.13620 2.0349183 10.237482 0.0009611
drugE-2times 11.72255 7.6212683 15.823832 0.0000000
drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
drugE-4times 8.57274 4.4714583 12.674022 0.0000037
drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
#可视化
plot(TukeyHSD(fit))
u 多因素方差分析:aov()函数
> data(\'ToothGrowth\')
> head(\'ToothGrowth\')
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
> levels(ToothGrowth$supp)
[1] "OJ" "VC"
> levels(ToothGrowth$dose)
NULL
> levels(as.factor(ToothGrowth$dose))
[1] "0.5" "1" "2"
> table(ToothGrowth$supp,ToothGrowth$dose)
0.5 1 2
OJ 10 10 10
VC 10 10 10
#公式的写法 关注交互作用
> fangcha <- aov (len~supp+dose,data = ToothGrowth)
> summary(fangcha)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 11.45 0.0013 **
dose 1 2224.3 2224.3 123.99 6.31e-16 ***
Residuals 57 1022.6 17.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#交互作用
> fangcha <- aov (len~supp*dose,data = ToothGrowth)
> summary(fangcha)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 12.317 0.000894 ***
dose 1 2224.3 2224.3 133.415 < 2e-16 ***
supp:dose 1 88.9 88.9 5.333 0.024631 * #交互作用不能忽视
Residuals 56 933.6 16.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#可视化 交互作用图
>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = \'b\',
+ pch = c(1,10),col = c(\'red\',\'green\'))
u 单因素协方差分析:aov()函数
> data(\'litter\')
> head(litter)
dose weight gesttime number
1 0 28.05 22.5 15
2 0 33.33 22.5 14
3 0 36.37 22.0 14
4 0 35.52 22.0 13
5 0 36.77 21.5 15
6 0 29.60 23.0 5
> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)
> summary(facn)
Df Sum Sq Mean Sq F value Pr(>F)
gesttime 1 134.3 134.30 8.289 0.00537 ** #协变量确实有影响,该如何去除?
dose 3 137.1 45.71 2.821 0.04556 *
gesttime:dose 3 81.9 27.29 1.684 0.17889
Residuals 66 1069.4 16.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
u 去除协变量的影响:加载effects包,用其中的effect()函数
> effect(\'dose\',facn)
NOTE: dose is not a high-order term in the model
dose effect
dose
0 5 50 500
32.30722 28.50625 30.61078 29.29005
#常见研究设计的表达式
卡方检验
n 对分类变量的检验方法
n 分类:
u 拟合优度检验:用chisq.test()函数
(针对样本数据的分布与已知总体分布是否一致)
> men <- c(11,120,60,45)
> women <- c(20,102,39,30)
> df <- as.data.frame(rbind(men,women))
> colnames(df) <- c(\'AB\',\'O\',\'A\',\'B\')
> chisq.test(men)
Chi-squared test for given probabilities
data: men
X-squared = 105.46, df = 3, p-value < 2.2e-16
> chisq.test(men,p = c(0.1,0.5,0.2,0.2)) #p 中为总体人群中各血型的比例
Chi-squared test for given probabilities
data: men
X-squared = 10.335, df = 3, p-value = 0.01592
u 卡方齐性检验:用于比较不同分类水平下,各个类型的比例是否一致。
> chisq.test(df) #行变量与列变量的相关性检验
Pearson\'s Chi-squared test
data: df
X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致
u 卡方独立性检验:
> chisq.test(df)
Pearson\'s Chi-squared test
data: df
X-squared = 6.8607, df = 3, p-value = 0.07647 # 行变量和列变量无关联,血型分布和性别无关
u CMH检验:分层检验 三维,行变量为无序,列变量为有序数据;判断是否有混杂因素
#采用mantelhaen.test()函数
> Rabbits <-
+ array(c(0,0,6,5,
+ 3,0,3,6,
+ 6,2,0,4,
+ 5,6,1,0,
+ 2,5,0,0),
+ dim = c(2,2,5),
+ dimnames = list(
+ Delay = c(\'None\',\'1.5h\'),
+ Response = c(\'Cured\',\'Died\'),
+ Penicillin.Level = c(\'1/8\',\'1/4\',\'1/2\',\'1\',\'4\')))
> Rabbits
, , Penicillin.Level = 1/8
Response
Delay Cured Died
None 0 6
1.5h 0 5
, , Penicillin.Level = 1/4
Response
Delay Cured Died
None 3 3
1.5h 0 6
, , Penicillin.Level = 1/2
Response
Delay Cured Died
None 6 0
1.5h 2 4
, , Penicillin.Level = 1
Response
Delay Cured Died
None 5 1
1.5h 6 0
, , Penicillin.Level = 4
Response
Delay Cured Died
None 2 0
1.5h 5 0
> mantelhaen.test(Rabbits)
Mantel-Haenszel chi-squared test with continuity
correction
data: Rabbits
Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =
0.04747
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
1.026713 47.725133
sample estimates:
common odds ratio
7 #可以判定盘尼西林是否为混杂因素
u CMH检验:有序分类的卡方检验:连续型变量
#采用mantelhaen.test()函数
> Satisfaction <-
+ as.table(array(c(1,2,0,0,3,3,1,2,
+ 11,17,8,4,2,3,5,2,
+ 1,0,0,0,1,3,0,1,
+ 2,5,7,9,1,1,3,6),
+ dim = c(4,4,2),
+ dimnames =
+ list(Income =
+ c(\'<5000\',\'5000-15000\',
+ \'15000-25000\',\'>25000\'),
+ \'Job Satisfaction\' =
+ c(\'V_D\',\'L_S\',\'M_S\',\'V_S\'),
+ Gender = c(\'Female\',\'Male\'))))
> Satisfaction
, , Gender = Female
Job Satisfaction
Income V_D L_S M_S V_S
<5000 1 3 11 2
5000-15000 2 3 17 3
15000-25000 0 1 8 5
>25000 0 2 4 2
, , Gender = Male
Job Satisfaction
Income V_D L_S M_S V_S
<5000 1 1 2 1
5000-15000 0 3 5 1
15000-25000 0 0 7 3
>25000 0 1 9 6
> mantelhaen.test(Satisfaction)
Cochran-Mantel-Haenszel test
data: Satisfaction
Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =
0.3345 #工资水平与满意度无关
u 配对四格表的卡方检验:采用mcnemar.test()函数
> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c(\'A\',\'B\'),control = c(\'A\',\'B\'))))
> paired
control
case A B
A 157 69
B 24 18
> mcnemar.test(paired)
McNemar\'s Chi-squared test with continuity correction
data: paired
McNemar\'s chi-squared = 20.817, df = 1, p-value =
5.053e-06
> chisq.test(paired)
Pearson\'s Chi-squared test with Yates\' continuity
correction
data: paired
X-squared = 1.9244, df = 1, p-value = 0.1654
u 的
回归分析与模型诊断
前提:是否适合做线性回归,是否满足正态分布
只要因变量是连续型变量即可做线性回归,因变量是分类变量需要做Logistic回归;自变量是连续型或者分类变量无影响
n 一元回归分析:lm(因变量~自变量)
u x为连续型变量
> x<- seq(1,5,len =100)
> noise <- rnorm(n=100,mean = 0,sd = 1)
> beta0 = 1
> beta1 <-2
> y <- beta0 + beta1*x + noise
> plot(y~x) #看一下是否适合做线性回归
> model <- lm(y~x)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.49732 -0.64794 0.00215 0.75355 3.06579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.77938 0.28733 2.712 0.00789 **
x 2.05561 0.08927 23.027 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.041 on 98 degrees of freedom
Multiple R-squared: 0.844, Adjusted R-squared: 0.8424 #模型可解释度
F-statistic: 530.3 on 1 and 98 DF, p-value: < 2.2e-16 #模型可靠性
u x为分类变量
> x <- factor(rep(c(0,1,2),each = 20))
> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))
> model <- lm(y~x)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.50566 -0.82826 0.01137 0.87966 2.41338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05338 0.24331 0.219 0.827
x1 0.81708 0.34409 2.375 0.021 *
x2 1.99903 0.34409 5.810 2.94e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.088 on 57 degrees of freedom
Multiple R-squared: 0.3745, Adjusted R-squared: 0.3525
F-statistic: 17.06 on 2 and 57 DF, p-value: 1.558e-06
> exp( 1.99903) #转化成OR值
[1] 7.381892
n 模型诊断
u 非正态性:shapiro.test()判断
> #检验残差是否服从正态分布
> data(LMdata,package = \'rinds\')
> model <- lm(y~x,data = LMdata$NonL)
> #找残差
>
> res1 <- residuals(model)
> shapiro.test(res1) #残差不符合正态分布,需要重新做
Shapiro-Wilk normality test
data: res1
W = 0.93524, p-value = 1e-04 #残差不符合正态分布,需要重新做
u 非线性
> #2.非线性
> model2 <- lm(y~x,data = LMdata$NonL)
> plot(model2)
#y 与x2的关系是否成线性
model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)
summary(model2)
#踢掉x
> model3 <- update(model2,y~.-x)
> summary(model3)
Call:
lm(formula = y ~ 1, data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-19.456 -12.907 -2.464 10.725 28.697
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.829 1.429 15.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.29 on 99 degrees of freedom
> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)
> summary(model2)
Call:
lm(formula = y ~ x + I(x^2), data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-2.32348 -0.60702 0.00701 0.56018 2.27346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.98702 0.62216 1.586 0.116
x 0.11085 0.45405 0.244 0.808
I(x^2) 1.97966 0.07456 26.552 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.907 on 97 degrees of freedom
Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
F-statistic: 1.224e+04 on 2 and 97 DF, p-value: < 2.2e-16
> model3 <- update(model2,y~.-x)
> summary(model3)
Call:
lm(formula = y ~ I(x^2), data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-2.34289 -0.60692 0.01722 0.58186 2.29601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.13378 0.15963 7.103 1.97e-10 ***
I(x^2) 1.99760 0.01271 157.195 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9026 on 98 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.996
F-statistic: 2.471e+04 on 1 and 98 DF, p-value: < 2.2e-16
> AIC(model,model2,model3)
df AIC
model 3 478.4558
model2 4 269.2121
model3 3 267.2736 #拟合效果越来越好
plot(model3$residuals~LMdata$NonL) #在0左右分布
u 异方差:考虑的是噪声对模型的影响
model4 <- lm(y~x,data = LMdata$Hetero)
plot(model4$residuals~LMdata$Hetero$x)
#处理方法:使用加权最小二乘法:对于不同的样本点给予不同的权重
> summary(model5)
Call:
lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.25132 -0.68409 -0.03924 0.63997 2.61098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2611 0.5139 2.454 0.0159 *
x 1.8973 0.2317 8.190 9.98e-13 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025 on 98 degrees of freedom
Multiple R-squared: 0.4063, Adjusted R-squared: 0.4003
F-statistic: 67.07 on 1 and 98 DF, p-value: 9.977e-13
> summary(model4)
Call:
lm(formula = y ~ x, data = LMdata$Hetero)
Residuals:
Min 1Q Median 3Q Max
-8.3371 -1.6383 -0.1533 1.4075 12.3423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6175 1.0019 1.615 0.11
x 1.7671 0.3113 5.677 1.4e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.63 on 98 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2398
F-statistic: 32.23 on 1 and 98 DF, p-value: 1.396e-07
#处理方法:使用迭代重复加权最小二乘法 采用nlme包中的gls()函数,找到最合适的weights公式,得到最小AIC
> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)
> summary(glsmodel)
Generalized least squares fit by REML
Model: y ~ x
Data: LMdata$Hetero
AIC BIC logLik
517.5286 525.2835 -255.7643
Variance function:
Structure: fixed weights
Formula: ~x
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.421324 0.7091780 2.004184 0.0478
x 1.832543 0.2603653 7.038351 0.0000
Correlation:
(Intr)
x -0.908
Standardized residuals:
Min Q1 Med Q3
-2.17451000 -0.55322766 -0.03454023 0.51060224
Max
2.93969900
Residual standard error: 1.890127
Degrees of freedom: 100 total; 98 residual
u 自相关:利用lmtest包中的DW检验函数:dwtest()函数
> model6 <- lm(y~x,data = LMdata$AC)
> dwtest(model6)
Durbin-Watson test
data: model6
DW = 0.65556, p-value = 2.683e-12 #存在自相关,不能使用最小二乘法,可使用gls()函数,实现广义最小二乘法,注意gls中的correlation 参数,设置为corAR1()
alternative hypothesis: true autocorrelation is greater than 0
> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)
> summary(glsmodel2)
Generalized least squares fit by REML
Model: y ~ x
Data: LMdata$AC
AIC BIC logLik
268.7617 279.1016 -130.3809
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.7041665
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.335059 0.7792019 1.713367 0.0898
x 2.072936 0.2405513 8.617438 0.0000
Correlation:
(Intr)
x -0.926
Standardized residuals:
Min Q1 Med Q3
-1.667086282 -0.571601900 0.001724114 0.568360354
Max
2.306177988
Residual standard error: 1.25329
Degrees of freedom: 100 total; 98 residual
> summary(model6)
Call:
lm(formula = y ~ x, data = LMdata$AC)
Residuals:
Min 1Q Median 3Q Max
-2.10131 -0.72894 -0.03329 0.66138 2.77461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2626 0.3303 3.822 0.000232 ***
x 2.1118 0.1026 20.578 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.197 on 98 degrees of freedom
Multiple R-squared: 0.8121, Adjusted R-squared: 0.8101
F-statistic: 423.4 on 1 and 98 DF, p-value: < 2.2e-16
u 异常值:离群点,杠杆点、高影响点
model7 <- lm(y~x,data = LMdata$Outlier)
plot(y~x,data = LMdata$Outlier)
abline(model7) #画出回归直线
#利用car包中的infuencePlot()函数进行判断,圆圈越大,对模型影响越大,做线性回归模型时需踢掉该点,用update函数
model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)
plot(y~x,data = LMdata$Outlier) #比较去除前后的差异
abline(model7,col = \'red\') #比较去除前后的差异
abline(model8,col = \'green\') #比较去除前后的差异
u 多重共线性的判断:自变量之间存在线性关系
#x1,x2,x3与y之间p值无显著差异,但是总体上的p值和R值非常显著,说明x1,x2,x3之间可能是存在相关性的,但与y没有相关性,不能进行回归分析,需要用函数进行检验确认:vif()函数计算方差膨胀因子
> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)
> summary(model9)
Call:
lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)
Residuals:
Min 1Q Median 3Q Max
-1.29100 -0.26369 0.00141 0.32529 0.91182
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3848 0.5813 0.662 0.5095
x1 7.2022 4.8552 1.483 0.1412
x2 -14.0916 12.1385 -1.161 0.2486
x3 8.2312 4.8559 1.695 0.0933 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.499 on 96 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 2.057e+04 on 3 and 96 DF, p-value: < 2.2e-16
#vif计算方差膨胀因子,一般而言,大于10则认为存在共线性
> vif(model9)
x1 x2 x3
7560.819 214990.752 222630.742
#运用自动回归判断是x中的哪些变量存在共线性:step()函数
> model10 <- step(model9)
Start: AIC=-135.11
y ~ x1 + x2 + x3
Df Sum of Sq RSS AIC
- x2 1 0.33560 24.241 -135.71
<none> 23.905 -135.11
- x1 1 0.54795 24.453 -134.84
- x3 1 0.71550 24.621 -134.16
Step: AIC=-135.71
y ~ x1 + x3
Df Sum of Sq RSS AIC
<none> 24.2 -135.71
- x1 1 189.2 213.4 79.81
- x3 1 15276.4 15300.7 507.05
> summary(model10)
Call:
lm(formula = y ~ x1 + x3, data = LMdata$Mult)
Residuals:
Min 1Q Median 3Q Max
-1.24046 -0.27519 -0.02751 0.32824 0.91344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.38158 0.58230 0.655 0.514
x1 1.56614 0.05692 27.514 <2e-16 ***
x3 2.59393 0.01049 247.241 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4999 on 97 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.074e+04 on 2 and 97 DF, p-value: < 2.2e-16
n Logistic回归
与线性回归的区别:y为分类变量。
u 使用glm()函数:下载HSAUR2包的数据集plasma
> library(HSAUR2)
> data(\'plasma\')
> head(plasma)
fibrinogen globulin ESR
1 2.52 38 ESR < 20
2 2.56 31 ESR < 20
3 2.19 33 ESR < 20
4 2.18 31 ESR < 20
5 3.41 37 ESR < 20
6 2.46 36 ESR < 20
> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,
+ family = binomial()) #family
请发表评论