在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
加载数据 > w<-read.table("test.prn",header = T) > w X.. X...1 1 A 2 2 B 3 3 C 5 4 D 5 > library(readxl) > dat<-read_excel("test.xlsx") > dat # A tibble: 4 x 2 `商品` `价格` <chr> <dbl> 1 A 2 2 B 3 3 C 5 4 D 5 > bank=read.table("bank-full.csv",header = TRUE,sep=",") 查看数据结构 > str(bank) 'data.frame': 41188 obs. of 21 variables: $ age : int 56 57 37 40 56 45 59 41 24 25 ... $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ... $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ... $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ... $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ... $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ... $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ... $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ... $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ... $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ... $ duration : int 261 149 226 151 307 198 139 217 380 50 ... $ campaign : int 1 1 1 1 1 1 1 1 1 1 ... $ pdays : int 999 999 999 999 999 999 999 999 999 999 ... $ previous : int 0 0 0 0 0 0 0 0 0 0 ... $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ... $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ... $ cons.price.idx: num 94 94 94 94 94 ... $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ... $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ... $ nr.employed : num 5191 5191 5191 5191 5191 ... $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... 查看数据的最小值,最大值,中位数,平均数,分位数 > summary(bank) age job marital Min. :17.00 admin. :10422 divorced: 4612 1st Qu.:32.00 blue-collar: 9254 married :24928 Median :38.00 technician : 6743 single :11568 Mean :40.02 services : 3969 unknown : 80 3rd Qu.:47.00 management : 2924 Max. :98.00 retired : 1720 (Other) : 6156 education default housing university.degree :12168 no :32588 no :18622 high.school : 9515 unknown: 8597 unknown: 990 basic.9y : 6045 yes : 3 yes :21576 professional.course: 5243 basic.4y : 4176 basic.6y : 2292 (Other) : 1749 loan contact month day_of_week no :33950 cellular :26144 may :13769 fri:7827 unknown: 990 telephone:15044 jul : 7174 mon:8514 yes : 6248 aug : 6178 thu:8623 jun : 5318 tue:8090 nov : 4101 wed:8134 apr : 2632 (Other): 2016 duration campaign pdays Min. : 0.0 Min. : 1.000 Min. : 0.0 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 Median : 180.0 Median : 2.000 Median :999.0 Mean : 258.3 Mean : 2.568 Mean :962.5 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0 Max. :4918.0 Max. :56.000 Max. :999.0 previous poutcome emp.var.rate Min. :0.000 failure : 4252 Min. :-3.40000 1st Qu.:0.000 nonexistent:35563 1st Qu.:-1.80000 Median :0.000 success : 1373 Median : 1.10000 Mean :0.173 Mean : 0.08189 3rd Qu.:0.000 3rd Qu.: 1.40000 Max. :7.000 Max. : 1.40000 cons.price.idx cons.conf.idx euribor3m Min. :92.20 Min. :-50.8 Min. :0.634 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 Median :93.75 Median :-41.8 Median :4.857 Mean :93.58 Mean :-40.5 Mean :3.621 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 Max. :94.77 Max. :-26.9 Max. :5.045 nr.employed y Min. :4964 no :36548 1st Qu.:5099 yes: 4640 Median :5191 Mean :5167 3rd Qu.:5228 Max. :5228 > psych::describe(bank) 方差 个数 平均值 标准差 均值 去掉最大 中位数 最小值 最大值 极差 偏差 峰度 绝对偏差 最小值 之后 的平均数 vars n mean sd median trimmed mad min max range skew kurtosis age 1 41188 40.02 10.42 38.00 39.30 10.38 17.00 98.00 81.00 0.78 0.79 job* 2 41188 4.72 3.59 3.00 4.48 2.97 1.00 12.00 11.00 0.45 -1.39 marital* 3 41188 2.17 0.61 2.00 2.21 0.00 1.00 4.00 3.00 -0.06 -0.34 education* 4 41188 4.75 2.14 4.00 4.88 2.97 1.00 8.00 7.00 -0.24 -1.21 default* 5 41188 1.21 0.41 1.00 1.14 0.00 1.00 3.00 2.00 1.44 0.07 housing* 6 41188 2.07 0.99 3.00 2.09 0.00 1.00 3.00 2.00 -0.14 -1.95 loan* 7 41188 1.33 0.72 1.00 1.16 0.00 1.00 3.00 2.00 1.82 1.38 contact* 8 41188 1.37 0.48 1.00 1.33 0.00 1.00 2.00 1.00 0.56 -1.69 month* 9 41188 5.23 2.32 5.00 5.31 2.97 1.00 10.00 9.00 -0.31 -1.03 day_of_week* 10 41188 3.00 1.40 3.00 3.01 1.48 1.00 5.00 4.00 0.01 -1.27 duration 11 41188 258.29 259.28 180.00 210.61 139.36 0.00 4918.00 4918.00 3.26 20.24 campaign 12 41188 2.57 2.77 2.00 1.99 1.48 1.00 56.00 55.00 4.76 36.97 pdays 13 41188 962.48 186.91 999.00 999.00 0.00 0.00 999.00 999.00 -4.92 22.23 previous 14 41188 0.17 0.49 0.00 0.05 0.00 0.00 7.00 7.00 3.83 20.11 poutcome* 15 41188 1.93 0.36 2.00 2.00 0.00 1.00 3.00 2.00 -0.88 3.98 emp.var.rate 16 41188 0.08 1.57 1.10 0.27 0.44 -3.40 1.40 4.80 -0.72 -1.06 cons.price.idx 17 41188 93.58 0.58 93.75 93.58 0.56 92.20 94.77 2.57 -0.23 -0.83 cons.conf.idx 18 41188 -40.50 4.63 -41.80 -40.60 6.52 -50.80 -26.90 23.90 0.30 -0.36 euribor3m 19 41188 3.62 1.73 4.86 3.81 0.16 0.63 5.04 4.41 -0.71 -1.41 nr.employed 20 41188 5167.04 72.25 5191.00 5178.43 55.00 4963.60 5228.10 264.50 -1.04 0.00 y* 21 41188 1.11 0.32 1.00 1.02 0.00 1.00 2.00 1.00 2.45 4.00 se age 0.05 job* 0.02 marital* 0.00 education* 0.01 default* 0.00 housing* 0.00 loan* 0.00 contact* 0.00 month* 0.01 day_of_week* 0.01 duration 1.28 campaign 0.01 pdays 0.92 previous 0.00 poutcome* 0.00 emp.var.rate 0.01 cons.price.idx 0.00 cons.conf.idx 0.02 euribor3m 0.01 nr.employed 0.36 y* 0.00 查看数据是否有缺失值 > sapply(bank,anyNA) age job marital education FALSE FALSE FALSE FALSE default housing loan contact FALSE FALSE FALSE FALSE month day_of_week duration campaign FALSE FALSE FALSE FALSE pdays previous poutcome emp.var.rate FALSE FALSE FALSE FALSE cons.price.idx cons.conf.idx euribor3m nr.employed FALSE FALSE FALSE FALSE y FALSE 成功与不成功的个数 > table(bank$y) no yes 36548 4640 在是否结婚这个属性的取值与 是否成功的数量比较 > table(bank$y,bank$marital) divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12 > xtabs(~y+marital,data=bank) marital y divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12 > tab=table(bank$y,bank$marital) > tab divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12 在是否结婚这个属性上的取值 > margin.table(tab,2) divorced married single unknown 4612 24928 11568 80 > margin.table(tab,1) no yes 36548 4640 在是否结婚这个属性上横向看概率 > prop.table(tab,1) divorced married single unknown no 0.113166247 0.612783189 0.272189997 0.001860567 yes 0.102586207 0.545689655 0.349137931 0.002586207 在是否结婚这个属性上纵向看概率 > prop.table(tab,2) divorced married single unknown no 0.8967910 0.8984275 0.8599585 0.8500000 yes 0.1032090 0.1015725 0.1400415 0.1500000 平的列联表 以第一列和第二列,展开分类group by 1,2 以col.vars 的取值 进行次数统计 > ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y") y no yes marital education divorced basic.4y 406 83 basic.6y 169 13 basic.9y 534 31 high.school 1086 107 illiterate 1 1 professional.course 596 61 university.degree 1177 160 unknown 167 20 married basic.4y 2915 313 basic.6y 1628 139 basic.9y 3858 298 high.school 4683 475 illiterate 12 3 professional.course 2799 357 university.degree 5573 821 unknown 928 126 single basic.4y 422 31 basic.6y 301 36 basic.9y 1174 142 high.school 2702 448 illiterate 1 0 professional.course 1247 177 university.degree 3723 683 unknown 378 103 unknown basic.4y 5 1 basic.6y 6 0 basic.9y 6 2 high.school 13 1 illiterate 0 0 professional.course 6 0 university.degree 25 6 unknown 7 2 卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布 > chisq.test(tab) Pearson's Chi-squared test data: tab X-squared = 122.66, df = 3, p-value < 2.2e-16 画直方图 > hist(bank$age) > library(lattice) 画连续变量的分布,就是把直方图的中位数连接起来 以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例 > densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE) 画Box图 > boxplot(age~y,data=bank) 双样本t分布检验,p值小于0.05时拒绝原假设 这里的原假设是两个样本没有相关性 得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设 这里认为两个样本有相关性 > t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE) Welch Two Sample t-test data: age by y t = -4.7795, df = 5258.5, p-value = 1.805e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.4129336 -0.5909889 sample estimates: mean in group no mean in group yes 39.91119 40.91315 数据可视化 画饼图 > tab=table(bank$marital) > pie(tab) 画直方图 > tab=table(bank$marital) > barplot(tab) 画下面这个图 > tab=table(bank$marital,bank$y) > plot(tab) 画层叠直方图 > tab=table(bank$marital,bank$y) > lattice::barchart(tab,auto.key=TRUE) 加载这个包,准备画图 > library(dplyr) > data=group_by(bank,marital,y) > data=tally(data) !!!!!!!!!!!!! > ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity") 数据预处理 分组之后再画图 > labels=c('青年','中年','老年') > bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels) > library(ggplot2) > ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count") 衍生变量 直接使用$符向原数据框添加新的变量 > bank$log.cons.price.idx=log(bank$cons.price.idx) 使用transform函数向原数据框添加变量 > bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed)) 使用dplyr包里的mutate函数增加变量 > bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx)) 使用dplyr包里的transmute函数只保留新生成的变量 > bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed)) 中心化 > v=1:10 > v1=v-mean(v) > v2=scale(v,center=TRUE,scale = FALSE) 无量纲化 > V1=v/sqrt(sum(v^2)/(length(v)-1)) > v2=scale(v,center=FALSE,scale=TRUE) 根据最大最小值进行归一化 > v3=(v-min(v))/(max(v)-min(v)) 进行标准正态化 > v1=(v-mean(v))/sd(v) > v2=scale(v,center = TRUE,scale=TRUE) Box-Cox变换 使用car包里的boxCox函数 > install.packages("car") > library(car) > boxCox(age~.,data=bank) 使用caret包,做Box-Cox变换 > install.packages("caret") > library(caret) > dat<-subset(bank,select="age") > trans<-preProcess(dat,method=C("BoxCox")) 数据预处理下 违反常识的异常值 基于数据分布的异常值(离群点)识别 bank.dirty=read.csv("bank-dirty.csv") summary(bank.dirty) age job marital education Min. : 17.00 admin. :10422 divorced: 4612 university.degree :12165 1st Qu.: 32.00 blue-collar: 9254 married :24928 high.school : 9515 Median : 38.00 technician : 6743 single :11568 basic.9y : 6043 Mean : 40.03 services : 3969 NA's : 80 professional.course: 5242 3rd Qu.: 47.00 management : 2924 basic.4y : 4175 Max. :123.00 (Other) : 7546 (Other) : 2310 NA's :2 NA's : 330 NA's : 1738 default housing loan contact month no :32588 no :18622 no :33950 cellular :26144 may :13769 yes : 3 yes :21576 yes : 6248 telephone:15044 jul : 7174 NA's: 8597 NA's: 990 NA's: 990 aug : 6178 jun : 5318 nov : 4101 apr : 2632 (Other): 2016 day_of_week duration campaign pdays previous fri:7827 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.000 mon:8514 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 thu:8623 Median : 180.0 Median : 2.000 Median :999.0 Median :0.000 tue:8090 Mean : 258.3 Mean : 2.568 Mean :962.5 Mean :0.173 wed:8134 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000 Max. :4918.0 Max. :56.000 Max. :999.0 Max. :7.000 poutcome emp.var.rate cons.price.idx cons.conf.idx failure : 4252 Min. :-3.40000 Min. :92.20 Min. :-50.8 nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 success : 1373 Median : 1.10000 Median :93.75 Median :-41.8 Mean : 0.08189 Mean :93.58 Mean :-40.5 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 Max. : 1.40000 Max. :94.77 Max. :-26.9 euribor3m nr.employed y Min. :0.634 Min. :4964 no :36548 1st Qu.:1.344 1st Qu.:5099 yes: 4640 Median :4.857 Median :5191 Mean :3.621 Mean :5167 3rd Qu.:4.961 3rd Qu.:5228 Max. :5.045 Max. :5228 常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户 找出在年龄这一列的上离群值和下离群值 > head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5) age 39494 123 38453 98 38456 98 27827 95 38922 94 > tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5) age 37559 17 37580 17 38275 17 120 NA 156 NA 异常值的处理 当作缺失值处理 > bank.dirty$age[which(bank.dirty$age>98)]<-NA 删除或者插补 重编码 职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类 Month有12个分类,把它转化成季度 Education的分类,除了unknow之外有7类 进行重编码 levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur", "management","unemployed", "entrepreneur","services", "unemployed","services","unemployed","unknown" ) > levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2", "Q1","Q2","Q4","Q4","Q3") > > levels(bank.dirty$education) <- c( "primary","primary","primary","secondary", "primary","tertiary","tertiary","unknown") 缺失值 分类较多,分类是unknown,不能给我们提供信息 有些模型不能处理缺失值,比如Logistic回归 缺失值插补的方法 1、 用中位数或众数插补 > library(imputeMissings) > bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode")) 2、 最邻近(knn)插补 library(DMwR) bank.clean=knnImputation(bank.dirty,k=5) 3、 随机森林插补 library(missForest) Imp = missForest(bank.dirty) bank.clean = Imp$ximp 缺失值插补的R包 1、 imputeMissings包 2、 DMwR包 用Logistic回归建立客户响应模型 1、 广义线性模型 广义线性模型擅长于处理因变量不是连续变量的问题 1) Y是分类变量 2) Y是定序变量 3) Y是离散取值 2、 当Y取值是0-1二分类变量是,就是Logistic回归 Logistic回归在R中的实现 数据重编码 bank$y=ifelse(bank$y=='yes',1,0) 改成以Q1为参考因子 bank$month<-relevel(bank$month,ref="Q1") 构建Logistic回归模型 > model<-glm(y~.,data=bank,family = 'binomial') > summary(model) Call: glm(formula = y ~ ., family = "binomial", data = bank) Deviance Residuals: Min 1Q Median 3Q Max -5.9958 -0.3082 -0.1887 -0.1333 3.4283 Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.957e+02 1.935e+01 -10.116 < 2e-16 *** age 1.851e-03 2.415e-03 0.767 0.443289 jobblue-collar -2.659e-01 7.942e-02 -3.348 0.000814 *** jobentrepreneur -2.029e-01 1.248e-01 -1.626 0.103924 jobhousemaid -3.628e-02 1.475e-01 -0.246 0.805705 jobmanagement -8.054e-02 8.501e-02 -0.947 0.343423 jobretired 2.928e-01 1.067e-01 2.743 0.006092 ** jobself-employed -1.680e-01 1.176e-01 -1.428 0.153332 jobservices -1.497e-01 8.552e-02 -1.751 0.079969 . jobstudent 2.674e-01 1.106e-01 2.416 0.015680 * jobtechnician 3.462e-03 7.096e-02 0.049 0.961086 jobunemployed 8.514e-03 1.273e-01 0.067 0.946686 jobunknown -8.046e-02 2.390e-01 -0.337 0.736420 maritalmarried 1.567e-02 6.824e-02 0.230 0.818420 maritalsingle 6.620e-02 7.791e-02 0.850 0.395473 maritalunknown 6.303e-02 4.113e-01 0.153 0.878211 educationbasic.6y 9.647e-02 1.202e-01 0.803 0.422195 educationbasic.9y -2.154e-02 9.494e-02 -0.227 0.820557 educationhigh.school 3.381e-02 9.188e-02 0.368 0.712895 educationilliterate 1.132e+00 7.395e-01 1.531 0.125887 educationprofessional.course 1.136e-01 1.013e-01 1.121 0.262175 educationuniversity.degree 2.134e-01 9.188e-02 2.322 0.020211 * educationunknown 1.361e-01 1.196e-01 1.138 0.255314 defaultunknown -3.055e-01 6.712e-02 -4.552 5.32e-06 *** defaultyes -7.150e+00 1.135e+02 -0.063 0.949784 housingunknown -7.385e-02 1.390e-01 -0.531 0.595260 housingyes -3.740e-03 4.121e-02 -0.091 0.927695 loanunknown NA NA NA NA loanyes -6.362e-02 5.725e-02 -1.111 0.266454 contacttelephone -6.068e-01 7.124e-02 -8.518 < 2e-16 *** monthQ2 -2.192e+00 1.125e-01 -19.479 < 2e-16 *** monthQ3 -1.463e+00 1.148e-01 -12.747 < 2e-16 *** monthQ4 -1.995e+00 1.240e-01 -16.088 < 2e-16 *** day_of_weekmon -1.216e-01 6.588e-02 -1.846 0.064887 . day_of_weekthu 6.375e-02 6.382e-02 0.999 0.317842 day_of_weektue 6.867e-02 6.545e-02 1.049 0.294118 day_of_weekwed 1.436e-01 6.530e-02 2.199 0.027911 * duration 4.667e-03 7.397e-05 63.092 < 2e-16 *** campaign -4.543e-02 1.158e-02 -3.922 8.77e-05 *** pdays -9.627e-04 2.162e-04 -4.452 8.50e-06 *** previous -5.806e-02 5.879e-02 -0.988 0.323369 poutcomenonexistent 4.507e-01 9.372e-02 4.809 1.51e-06 *** poutcomesuccess 9.371e-01 2.106e-01 4.451 8.56e-06 *** emp.var.rate -1.389e+00 7.693e-02 -18.057 < 2e-16 *** cons.price.idx 1.815e+00 1.193e-01 15.218 < 2e-16 *** cons.conf.idx 3.353e-02 6.664e-03 5.033 4.84e-07 *** euribor3m 6.054e-02 1.126e-01 0.537 0.590987 nr.employed 4.937e-03 1.873e-03 2.635 0.008413 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 28999 on 41187 degrees of freedom Residual deviance: 17199 on 41141 degrees of freedom AIC: 17293 Number of Fisher Scoring iterations: 10 > exp(coef(model)) (Intercept) age jobblue-collar 9.856544e-86 1.001853e+00 7.665077e-01 jobentrepreneur jobhousemaid jobmanagement 8.163314e-01 9.643733e-01 9.226187e-01 jobretired jobself-employed jobservices 1.340142e+00 8.453874e-01 8.609387e-01 jobstudent jobtechnician jobunemployed 1.306514e+00 1.003468e+00 1.008550e+00 jobunknown maritalmarried maritalsingle 9.226922e-01 1.015789e+00 1.068445e+00 maritalunknown educationbasic.6y educationbasic.9y 1.065061e+00 1.101276e+00 9.786948e-01 educationhigh.school educationilliterate educationprofessional.course 1.034388e+00 3.101297e+00 1.120248e+00 educationuniversity.degree educationunknown defaultunknown 1.237856e+00 1.145744e+00 7.367445e-01 defaultyes housingunknown housingyes 7.851906e-04 9.288126e-01 9.962671e-01 loanunknown loanyes contacttelephone NA 9.383587e-01 5.450980e-01 monthQ2 monthQ3 monthQ4 1.116739e-01 2.314802e-01 1.360620e-01 day_of_weekmon day_of_weekthu day_of_weektue 8.854888e-01 1.065828e+00 1.071082e+00 day_of_weekwed duration campaign 1.154380e+00 1.004678e+00 9.555850e-01 pdays previous poutcomenonexistent 9.990378e-01 9.435960e-01 1.569466e+00 poutcomesuccess emp.var.rate cons.price.idx 2.552531e+00 2.493091e-01 6.140533e+00 cons.conf.idx euribor3m nr.employed 1.034103e+00 1.062408e+00 1.004949e+00 Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍 > summary(model.step) 向前逐步回归 > model.step=step(model,direction = "backward") 向后逐步回归 > model.step = step(model, direction = "forward") 双向逐步回归 > model.step = step(model, direction = "both") > summary(model.step) Call: glm(formula = y ~ job + education + default + contact + month + day_of_week + duration + campaign + pdays + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + nr.employed, family = "binomial", data = bank) Deviance Residuals: Min 1Q Median 3Q Max -5.9884 -0.3088 -0.1887 -0.1332 3.4026 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.031e+02 1.426e+01 -14.246 < 2e-16 *** jobblue-collar -2.700e-01 7.917e-02 -3.411 0.000648 *** jobentrepreneur -2.043e-01 1.242e-01 -1.645 0.100003 jobhousemaid -2.832e-02 1.464e-01 -0.193 0.846590 jobmanagement -8.368e-02 8.409e-02 -0.995 0.319670 jobretired 3.234e-01 9.130e-02 3.542 0.000397 *** jobself-employed -1.670e-01 1.176e-01 -1.421 0.155435 jobservices -1.528e-01 8.545e-02 -1.789 0.073666 . jobstudent 2.682e-01 1.046e-01 2.565 0.010316 * jobtechnician 4.389e-03 7.093e-02 0.062 0.950665 jobunemployed 8.975e-03 1.271e-01 0.071 0.943715 jobunknown -6.363e-02 2.378e-01 -0.268 0.789057 educationbasic.6y 8.993e-02 1.196e-01 0.752 0.452024 educationbasic.9y -2.716e-02 9.416e-02 -0.288 0.772992 educationhigh.school 2.890e-02 9.053e-02 0.319 0.749573 educationilliterate 1.118e+00 7.398e-01 1.511 0.130744 educationprofessional.course 1.084e-01 1.004e-01 1.079 0.280686 educationuniversity.degree 2.103e-01 9.017e-02 2.332 0.019678 * educationunknown 1.363e-01 1.195e-01 1.140 0.254110 defaultunknown -3.017e-01 6.666e-02 -4.526 6.02e-06 *** defaultyes -7.141e+00 1.135e+02 -0.063 0.949831 contacttelephone -6.011e-01 7.069e-02 -8.504 < 2e-16 *** monthQ2 -2.210e+00 1.108e-01 -19.939 < 2e-16 *** monthQ3 -1.475e+00 1.146e-01 -12.869 < 2e-16 *** monthQ4 -1.982e+00 1.183e-01 -16.755 < 2e-16 *** day_of_weekmon -1.210e-01 6.584e-02 -1.837 0.066174 . day_of_weekthu 6.208e-02 6.374e-02 0.974 0.330066 day_of_weektue 6.851e-02 6.538e-02 1.048 0.294651 day_of_weekwed 1.420e-01 6.525e-02 2.176 0.029592 * duration 4.667e-03 7.396e-05 63.099 < 2e-16 *** campaign -4.587e-02 1.158e-02 -3.960 7.49e-05 *** pdays -8.822e-04 2.024e-04 -4.358 1.31e-05 *** poutcomenonexistent 5.219e-01 6.356e-02 8.211 < 2e-16 *** poutcomesuccess 9.996e-01 2.028e-01 4.928 8.31e-07 *** emp.var.rate -1.376e+00 6.885e-02 -19.980 < 2e-16 *** cons.price.idx 1.845e+00 1.041e-01 17.725 < 2e-16 *** cons.conf.idx 3.622e-02 4.853e-03 7.464 8.42e-14 *** nr.employed 5.883e-03 9.765e-04 6.024 1.70e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 28999 on 41187 degrees of freedom Residual deviance: 17203 on 41150 degrees of freedom AIC: 17279 Number of Fisher Scoring iterations: 10 模型预测 用predict函数,参数type=’response’ Newdata参数是要预测的数据集 > prob<-predict(model.step,type = 'response') > head(prob) 1 2 3 4 5 6 0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 假设以0.5为临界值 > pre<-ifelse(prob>0.5,1,0) > table(pre,bank$y) pre 0 1 0 35596 2667 1 952 1973 > 预测的准确率 > (35592+1964)/(35592+2676+956+1964) [1] 0.911819 实际有响应的客户被识别出了多少 > 1964/(1964+2676) [1] 0.4232759 模型评估 > confusionMatrix(bank$y,pre,pos='1') Confusion Matrix and Statistics Reference Prediction 0 1 0 35596 952 1 2667 1973 Accuracy : 0.9121 95% CI : (0.9094, 0.9149) No Information Rate : 0.929 P-Value [Acc > NIR] : 1 Kappa : 0.476 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.67453 全部评论
专题导读
上一篇:拓端tecdat|R语言代写实现层次聚类模型发布时间:2022-07-18下一篇:图解OpenAI的秘密武器GPT-2:可视化Transformer语言模型发布时间:2022-07-18热门推荐
热门话题
阅读排行榜
|
请发表评论