93、R语言教程详解

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› R语言›R语言教程
原作者: [db:作者] 来自: [db:来源] 收藏邀请
加载数据
> w<-read.table("test.prn",header = T)
> w
  X.. X...1
1   A     2
2   B     3
3   C     5
4   D     5
> library(readxl)
> dat<-read_excel("test.xlsx")
> dat
# A tibble: 4 x 2
  `商品` `价格`
   <chr>  <dbl>
1      A      2
2      B      3
3      C      5
4      D      5
> bank=read.table("bank-full.csv",header = TRUE,sep=",")
查看数据结构
> str(bank)
'data.frame':    41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
查看数据的最小值，最大值，中位数，平均数，分位数
> summary(bank)
      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
 Mean   :40.02   services   : 3969   unknown :   80  
 3rd Qu.:47.00   management : 2924                   
 Max.   :98.00   retired    : 1720                   
                 (Other)    : 6156                   
               education        default         housing     
 university.degree  :12168   no     :32588   no     :18622  
 high.school        : 9515   unknown: 8597   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576  
 professional.course: 5243                                  
 basic.4y           : 4176                                  
 basic.6y           : 2292                                  
 (Other)            : 1749                                  
      loan            contact          month       day_of_week
 no     :33950   cellular :26144   may    :13769   fri:7827   
 unknown:  990   telephone:15044   jul    : 7174   mon:8514   
 yes    : 6248                     aug    : 6178   thu:8623   
                                   jun    : 5318   tue:8090   
                                   nov    : 4101   wed:8134   
                                   apr    : 2632              
                                   (Other): 2016              
    duration         campaign          pdays      
 Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
 1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
 Median : 180.0   Median : 2.000   Median :999.0  
 Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
 3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
 Max.   :4918.0   Max.   :56.000   Max.   :999.0  
                                                  
    previous            poutcome      emp.var.rate     
 Min.   :0.000   failure    : 4252   Min.   :-3.40000  
 1st Qu.:0.000   nonexistent:35563   1st Qu.:-1.80000  
 Median :0.000   success    : 1373   Median : 1.10000  
 Mean   :0.173                       Mean   : 0.08189  
 3rd Qu.:0.000                       3rd Qu.: 1.40000  
 Max.   :7.000                       Max.   : 1.40000  
                                                       
 cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median :93.75   Median :-41.8   Median :4.857  
 Mean   :93.58   Mean   :-40.5   Mean   :3.621  
 3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
 Max.   :94.77   Max.   :-26.9   Max.   :5.045  
                                                
  nr.employed     y        
 Min.   :4964   no :36548  
 1st Qu.:5099   yes: 4640  
 Median :5191              
 Mean   :5167              
 3rd Qu.:5228              
 Max.   :5228              
                           
> psych::describe(bank)
        方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度
                                  绝对偏差
                             最小值
                            之后
                            的平均数

               vars     n    mean     sd  median trimmed   mad     min     max   range  skew    kurtosis
age               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00   81.00  0.78     0.79
job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00   11.00  0.45    -1.39
marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00    3.00 -0.06    -0.34
education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00    7.00 -0.24    -1.21
default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00    2.00  1.44     0.07
housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00    2.00 -0.14    -1.95
loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00    2.00  1.82     1.38
contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00    1.00  0.56    -1.69
month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00    9.00 -0.31    -1.03
day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00    4.00  0.01    -1.27
duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00 4918.00  3.26    20.24
campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00   55.00  4.76    36.97
pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00  999.00 -4.92    22.23
previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00    7.00  3.83    20.11
poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00    2.00 -0.88     3.98
emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40    4.80 -0.72    -1.06
cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77    2.57 -0.23    -0.83
cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90   23.90  0.30    -0.36
euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04    4.41 -0.71    -1.41
nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00
y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00    1.00  2.45     4.00

               se
age            0.05
job*           0.02
marital*       0.00
education*     0.01
default*       0.00
housing*       0.00
loan*          0.00
contact*       0.00
month*         0.01
day_of_week*   0.01
duration       1.28
campaign       0.01
pdays          0.92
previous       0.00
poutcome*      0.00
emp.var.rate   0.01
cons.price.idx 0.00
cons.conf.idx  0.02
euribor3m      0.01
nr.employed    0.36
y*             0.00

查看数据是否有缺失值
> sapply(bank,anyNA)
           age            job        marital      education 
         FALSE          FALSE          FALSE          FALSE 
       default        housing           loan        contact 
         FALSE          FALSE          FALSE          FALSE 
         month    day_of_week       duration       campaign 
         FALSE          FALSE          FALSE          FALSE 
         pdays       previous       poutcome   emp.var.rate 
         FALSE          FALSE          FALSE          FALSE 
cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
         FALSE          FALSE          FALSE          FALSE 
             y 
         FALSE 

成功与不成功的个数
> table(bank$y)

   no   yes 
36548  4640 

在是否结婚这个属性的取值与
是否成功的数量比较
> table(bank$y,bank$marital)
     
      divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12

> xtabs(~y+marital,data=bank)
     marital
y     divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12
> tab=table(bank$y,bank$marital)
> tab
     
      divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12

在是否结婚这个属性上的取值
> margin.table(tab,2)

divorced  married   single  unknown 
    4612    24928    11568       80 
> margin.table(tab,1)

   no   yes 
36548  4640 

在是否结婚这个属性上横向看概率
> prop.table(tab,1)
     
         divorced     married      single     unknown
  no  0.113166247 0.612783189 0.272189997 0.001860567
  yes 0.102586207 0.545689655 0.349137931 0.002586207
在是否结婚这个属性上纵向看概率

> prop.table(tab,2)
     
       divorced   married    single   unknown
  no  0.8967910 0.8984275 0.8599585 0.8500000
  yes 0.1032090 0.1015725 0.1400415 0.1500000


平的列联表
以第一列和第二列，展开分类group by 1,2
以col.vars 的取值 进行次数统计
> ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")
                             y   no  yes
marital  education                      
divorced basic.4y               406   83
         basic.6y               169   13
         basic.9y               534   31
         high.school           1086  107
         illiterate               1    1
         professional.course    596   61
         university.degree     1177  160
         unknown                167   20
married  basic.4y              2915  313
         basic.6y              1628  139
         basic.9y              3858  298
         high.school           4683  475
         illiterate              12    3
         professional.course   2799  357
         university.degree     5573  821
         unknown                928  126
single   basic.4y               422   31
         basic.6y               301   36
         basic.9y              1174  142
         high.school           2702  448
         illiterate               1    0
         professional.course   1247  177
         university.degree     3723  683
         unknown                378  103
unknown  basic.4y                 5    1
         basic.6y                 6    0
         basic.9y                 6    2
         high.school             13    1
         illiterate               0    0
         professional.course      6    0
         university.degree       25    6
         unknown                  7    2

卡方检验，在p值小于2.2e-16时，拒绝原假设，认为数据不服从卡方分布
> chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 122.66, df = 3, p-value < 2.2e-16

画直方图
> hist(bank$age)
> library(lattice)

画连续变量的分布，就是把直方图的中位数连接起来
以年龄为横轴，y为纵轴，数据是bank,画图，auto.key是否有图例
> densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE)

画Box图
> boxplot(age~y,data=bank)

双样本t分布检验，p值小于0.05时拒绝原假设
这里的原假设是两个样本没有相关性
得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设
这里认为两个样本有相关性
> t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)

    Welch Two Sample t-test

data:  age by y
t = -4.7795, df = 5258.5, p-value = 1.805e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.4129336 -0.5909889
sample estimates:
 mean in group no mean in group yes 
         39.91119          40.91315 


数据可视化
画饼图
> tab=table(bank$marital)
> pie(tab)

画直方图
> tab=table(bank$marital)
> barplot(tab)

画下面这个图
> tab=table(bank$marital,bank$y)
> plot(tab)
 


画层叠直方图
> tab=table(bank$marital,bank$y)
> lattice::barchart(tab,auto.key=TRUE)
 


加载这个包，准备画图
> library(dplyr)
> data=group_by(bank,marital,y)
> data=tally(data)
！！！！！！！！！！！！！
> ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")



数据预处理
分组之后再画图
> labels=c('青年','中年','老年')
> bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)
> library(ggplot2)
> ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count")
 






衍生变量
直接使用$符向原数据框添加新的变量
> bank$log.cons.price.idx=log(bank$cons.price.idx)
使用transform函数向原数据框添加变量
> bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))
使用dplyr包里的mutate函数增加变量
> bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))
使用dplyr包里的transmute函数只保留新生成的变量
> bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

中心化

> v=1:10
> v1=v-mean(v)
> v2=scale(v,center=TRUE,scale = FALSE)

无量纲化

> V1=v/sqrt(sum(v^2)/(length(v)-1))
> v2=scale(v,center=FALSE,scale=TRUE)

根据最大最小值进行归一化

> v3=(v-min(v))/(max(v)-min(v))


进行标准正态化


> v1=(v-mean(v))/sd(v)
> v2=scale(v,center = TRUE,scale=TRUE)




Box-Cox变换
使用car包里的boxCox函数
> install.packages("car")
> library(car)
> boxCox(age~.,data=bank)
 
 






使用caret包，做Box-Cox变换
> install.packages("caret")
> library(caret)
> dat<-subset(bank,select="age")
> trans<-preProcess(dat,method=C("BoxCox"))


数据预处理下
违反常识的异常值
基于数据分布的异常值(离群点)识别
bank.dirty=read.csv("bank-dirty.csv")
summary(bank.dirty)

     age                  job            marital                    education    
 Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165  
 1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515  
 Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043  
 Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242  
 3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175  
 Max.   :123.00   (Other)    : 7546                    (Other)            : 2310  
 NA's   :2        NA's       :  330                    NA's               : 1738  
 default      housing        loan            contact          month      
 no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769  
 yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174  
 NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178  
                                                          jun    : 5318  
                                                          nov    : 4101  
                                                          apr    : 2632  
                                                          (Other): 2016  
 day_of_week    duration         campaign          pdays          previous    
 fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
 mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
 thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
 tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
 wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
             Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                              
        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
 failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
 nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
 success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                     Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                       
   euribor3m      nr.employed     y        
 Min.   :0.634   Min.   :4964   no :36548  
 1st Qu.:1.344   1st Qu.:5099   yes: 4640  
 Median :4.857   Median :5191              
 Mean   :3.621   Mean   :5167              
 3rd Qu.:4.961   3rd Qu.:5228              
 Max.   :5.045   Max.   :5228              


常识告诉我们，虽然123岁的老人存在，但概率也极低，也不太可能是银行的客户
找出在年龄这一列的上离群值和下离群值

> head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)
      age
39494 123
38453  98
38456  98
27827  95
38922  94
> tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)
      age
37559  17
37580  17
38275  17
120    NA
156    NA

异常值的处理
当作缺失值处理
> bank.dirty$age[which(bank.dirty$age>98)]<-NA
删除或者插补


重编码
职业类型有12个分类，不利于后续分析，把除了unknown以外的分类进行重新编码，简化成4类
Month有12个分类，把它转化成季度
Education的分类，除了unknow之外有7类

进行重编码
levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",
                        "management","unemployed",  "entrepreneur","services",
                        "unemployed","services","unemployed","unknown" )
> levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",
                         "Q1","Q2","Q4","Q4","Q3")
> 
> levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",
                              "primary","tertiary","tertiary","unknown")


缺失值
分类较多，分类是unknown,不能给我们提供信息
有些模型不能处理缺失值，比如Logistic回归
缺失值插补的方法
1、    用中位数或众数插补
> library(imputeMissings)
> bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))
2、    最邻近(knn)插补
library(DMwR)
bank.clean=knnImputation(bank.dirty,k=5)

3、    随机森林插补
library(missForest)
 Imp = missForest(bank.dirty)
 bank.clean = Imp$ximp

缺失值插补的R包
1、    imputeMissings包
2、    DMwR包






用Logistic回归建立客户响应模型
1、    广义线性模型
广义线性模型擅长于处理因变量不是连续变量的问题
1)    Y是分类变量
2)    Y是定序变量
3)    Y是离散取值
2、    当Y取值是0-1二分类变量是，就是Logistic回归

Logistic回归在R中的实现
数据重编码
bank$y=ifelse(bank$y=='yes',1,0)
改成以Q1为参考因子
bank$month<-relevel(bank$month,ref="Q1")
构建Logistic回归模型
> model<-glm(y~.,data=bank,family = 'binomial')
> summary(model)

Call:
glm(formula = y ~ ., family = "binomial", data = bank)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9958  -0.3082  -0.1887  -0.1333   3.4283  

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***
age                           1.851e-03  2.415e-03   0.767 0.443289    
jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***
jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    
jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    
jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    
jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** 
jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    
jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  
jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  
jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    
jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    
jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    
maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    
maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    
maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    
educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    
educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    
educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    
educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    
educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    
educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  
educationunknown              1.361e-01  1.196e-01   1.138 0.255314    
defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***
defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    
housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    
housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    
loanunknown                          NA         NA      NA       NA    
loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    
contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***
monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***
monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***
monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***
day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  
day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    
day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    
day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  
duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***
campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***
pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***
previous                     -5.806e-02  5.879e-02  -0.988 0.323369    
poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***
poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***
emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***
cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***
cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***
euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    
nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28999  on 41187  degrees of freedom
Residual deviance: 17199  on 41141  degrees of freedom
AIC: 17293

Number of Fisher Scoring iterations: 10


> exp(coef(model))
                 (Intercept)                          age               jobblue-collar 
                9.856544e-86                 1.001853e+00                 7.665077e-01 
             jobentrepreneur                 jobhousemaid                jobmanagement 
                8.163314e-01                 9.643733e-01                 9.226187e-01 
                  jobretired             jobself-employed                  jobservices 
                1.340142e+00                 8.453874e-01                 8.609387e-01 
                  jobstudent                jobtechnician                jobunemployed 
                1.306514e+00                 1.003468e+00                 1.008550e+00 
                  jobunknown               maritalmarried                maritalsingle 
                9.226922e-01                 1.015789e+00                 1.068445e+00 
              maritalunknown            educationbasic.6y            educationbasic.9y 
                1.065061e+00                 1.101276e+00                 9.786948e-01 
        educationhigh.school          educationilliterate educationprofessional.course 
                1.034388e+00                 3.101297e+00                 1.120248e+00 
  educationuniversity.degree             educationunknown               defaultunknown 
                1.237856e+00                 1.145744e+00                 7.367445e-01 
                  defaultyes               housingunknown                   housingyes 
                7.851906e-04                 9.288126e-01                 9.962671e-01 
                 loanunknown                      loanyes             contacttelephone 
                          NA                 9.383587e-01                 5.450980e-01 
                     monthQ2                      monthQ3                      monthQ4 
                1.116739e-01                 2.314802e-01                 1.360620e-01 
              day_of_weekmon               day_of_weekthu               day_of_weektue 
                8.854888e-01                 1.065828e+00                 1.071082e+00 
              day_of_weekwed                     duration                     campaign 
                1.154380e+00                 1.004678e+00                 9.555850e-01 
                       pdays                     previous          poutcomenonexistent 
                9.990378e-01                 9.435960e-01                 1.569466e+00 
             poutcomesuccess                 emp.var.rate               cons.price.idx 
                2.552531e+00                 2.493091e-01                 6.140533e+00 
               cons.conf.idx                    euribor3m                  nr.employed 
                1.034103e+00                 1.062408e+00                 1.004949e+00 


Job变量的基准水平是management,从上面的结果看，服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍，未就业人员购买银行产品的几率是管理岗人员的1.25倍


> summary(model.step)
向前逐步回归
> model.step=step(model,direction = "backward")
向后逐步回归
> model.step = step(model, direction = "forward")
双向逐步回归
> model.step = step(model, direction = "both")
> summary(model.step)

Call:
glm(formula = y ~ job + education + default + contact + month + 
    day_of_week + duration + campaign + pdays + poutcome + emp.var.rate + 
    cons.price.idx + cons.conf.idx + nr.employed, family = "binomial", 
    data = bank)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9884  -0.3088  -0.1887  -0.1332   3.4026  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***
jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***
jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003    
jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590    
jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670    
jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***
jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435    
jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 .  
jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 *  
jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665    
jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715    
jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057    
educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024    
educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992    
educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573    
educationilliterate           1.118e+00  7.398e-01   1.511 0.130744    
educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686    
educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 *  
educationunknown              1.363e-01  1.195e-01   1.140 0.254110    
defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***
defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831    
contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***
monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***
monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***
monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***
day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 .  
day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066    
day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    
day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 *  
duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***
campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***
pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***
poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***
poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***
emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***
cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***
cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***
nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28999  on 41187  degrees of freedom
Residual deviance: 17203  on 41150  degrees of freedom
AIC: 17279

Number of Fisher Scoring iterations: 10





模型预测
用predict函数，参数type=’response’
Newdata参数是要预测的数据集

> prob<-predict(model.step,type = 'response')
> head(prob)
          1           2           3           4           5           6 
0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 

假设以0.5为临界值
> pre<-ifelse(prob>0.5,1,0)
> table(pre,bank$y)
   
pre     0     1
  0 35596  2667
  1   952  1973

> 

预测的准确率
> (35592+1964)/(35592+2676+956+1964)
[1] 0.911819



实际有响应的客户被识别出了多少
> 1964/(1964+2676)
[1] 0.4232759




模型评估

> confusionMatrix(bank$y,pre,pos='1')
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 35596   952
         1  2667  1973
                                          
               Accuracy : 0.9121          
                 95% CI : (0.9094, 0.9149)
    No Information Rate : 0.929           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.476           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.67453 
                       
                    
                     
                      



鲜花




握手




雷人




路过




鸡蛋