xgboost学习
xgboost的模型和传统的GBDT相比加入了对于模型复杂度的控制以及后期的剪枝处理,使得学习出来的模型更加不容易过拟合。
XGBoost仅适用于数值型向量。需要使用中区分数据类型。
因此,您需要将所有其他形式的数据转换为数值型向量。一个简单的方法将类别变量转换成数值向量是一个"独热编码"。这个词源于数字电路语言,这意味着一个数组的二进制信号,只有合法的值是0和1。
在R中,一个独热编码非常简单。这一步(如下所示)会在每一个可能值的变量使用标志建立一个稀疏矩阵。稀疏矩阵是一个矩阵的零的值。稀疏矩阵是一个大多数值为零的矩阵。相反,一个稠密矩阵是大多数值非零的矩阵。
假设,你有一个叫“竞选”的数据集,除了反应变量,想将所有分类变量转换成一些标志。如下所示:
sparse_matrix
<-
Matrix::sparse.model.matrix(response
~
.-1,
data
=
campaign)
现在让我们分解这个代码如下:sparse.model.matrix这条命令的圆括号里面包含了所有其他输入参数。参数“反应”说这句话应该忽略“响应”变量。“-1”意味着该命令会删除矩阵的第一列。最后你需要指定数据集名称。想要转化目标变量,你可以使用下面的代码:
output_vector
=
df[,response]
==
"Responder"
代码解释: 设
output_vector 初值为0。
在 output_vector 中,将响应变量的值为
"Responder" 的数值设为1;
返回 output_vector。
运行下面的代码载入样例数据
require(xgboost)
##
Loading required package: xgboost
data(agaricus.train,
package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
这份数据需要我们通过一些蘑菇的若干属性判断这个品种是否有毒。数据以1或0来标记某个属性存在与否,所以样例数据为稀疏矩阵类型:
class(train$data)
##
[1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
## attr(,"package")
## [1] "Matrix"
不用担心,xgboost支持稀疏矩阵作为输入。下面就是训练模型的命令
bst
<-
xgboost(data
=
train$data, label
=
train$label, max.depth
=
2,
eta
=
1,
nround = 2, objective = "binary:logistic")
nround = 2, objective = "binary:logistic")
##
[1] train-error:0.046522
## [2] train-error:0.022263
## [2] train-error:0.022263
我们迭代了两次,可以看到函数输出了每一次迭代模型的误差信息。这里的数据是稀疏矩阵,当然也支持普通的稠密矩阵。如果数据文件太大不希望读进R中,我们也可以通过设置参数data
= 'path_to_file'使其直接从硬盘读取数据并分析。目前支持直接从硬盘读取libsvm格式的文件。
做预测只需要一句话:
pred
<-
predict(bst,
test$data)
做交叉验证的函数参数与训练函数基本一致,只需要在原有参数的基础上设置nfold:
cv.res
<-
xgb.cv(data
=
train$data, label
=
train$label, max.depth
=
2,
eta = 1, nround = 2, objective = "binary:logistic", nfold = 5)
eta = 1, nround = 2, objective = "binary:logistic", nfold = 5)
##
[1] train-error:0.046522+0.001438 test-error:0.046522+0.005751
## [2] train-error:0.022263+0.000831 test-error:0.022262+0.003323
## [2] train-error:0.022263+0.000831 test-error:0.022262+0.003323
cv.res
##
##### xgb.cv 5-folds
## iter train_error_mean train_error_std test_error_mean test_error_std
## 1 0.0465224 0.0014377670 0.0465216 0.005750512
## 2 0.0222632 0.0008309254 0.0222622 0.003323332
## iter train_error_mean train_error_std test_error_mean test_error_std
## 1 0.0465224 0.0014377670 0.0465216 0.005750512
## 2 0.0222632 0.0008309254 0.0222622 0.003323332
交叉验证的函数会返回一个data.table类型的结果,方便我们监控训练集和测试集上的表现,从而确定最优的迭代步数。
1.数据导入与包的加载
案例的主要内容是:服用安慰剂对病情康复的情况,其他指标还有年龄、性别。
require(xgboost)
require(Matrix)
require(Matrix)
##
Loading required package: Matrix
require(data.table)
##
Loading required package: data.table
if
(!require('vcd'))
install.packages('vcd')
##
Loading required package: vcd
##
Loading required package: grid
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
#接下来对数据进行一些处理。
#:= 新增加一列
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
df <- data.table(Arthritis, keep.rownames = F)
#接下来对数据进行一些处理。
#:= 新增加一列
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
##
ID Treatment Sex Age Improved AgeDiscret
## 1: 57 Treated Male 27 Some 3
## 2: 46 Treated Male 29 None 3
## 3: 77 Treated Male 30 None 3
## 4: 17 Treated Male 32 Marked 3
## 5: 36 Treated Male 46 Marked 5
## 6: 23 Treated Male 58 Marked 6
## 1: 57 Treated Male 27 Some 3
## 2: 46 Treated Male 29 None 3
## 3: 77 Treated Male 30 None 3
## 4: 17 Treated Male 32 Marked 3
## 5: 36 Treated Male 46 Marked 5
## 6: 23 Treated Male 58 Marked 6
head(df[,AgeCat:=
as.factor(ifelse(Age
>
30,
"Old",
"Young"))])
#ifelse
##
ID Treatment Sex Age Improved AgeDiscret AgeCat
## 1: 57 Treated Male 27 Some 3 Young
## 2: 46 Treated Male 29 None 3 Young
## 3: 77 Treated Male 30 None 3 Young
## 4: 17 Treated Male 32 Marked 3 Old
## 5: 36 Treated Male 46 Marked 5 Old
## 6: 23 Treated Male 58 Marked 6 Old
## 1: 57 Treated Male 27 Some 3 Young
## 2: 46 Treated Male 29 None 3 Young
## 3: 77 Treated Male 30 None 3 Young
## 4: 17 Treated Male 32 Marked 3 Old
## 5: 36 Treated Male 46 Marked 5 Old
## 6: 23 Treated Male 58 Marked 6 Old
df[,ID:=NULL]
2.生成特定的数据格式
#变成稀疏数据,然后0变成.,便于占用内存最小
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
生成了one-hot
encode数据,one-hot
编码。Improved是Y变量,-1是将treament变量(名义变量)拆分。
3.设置因变量(多分类)
output_vector
=
df[,Improved]
==
"Marked"
4.xgboost建模
bst
<-
xgboost(data
=
sparse_matrix, label
=
output_vector, max.depth
=
4,eta
=
1,
nthread
=
2,
nround
=
10,objective
="binary:logistic")
##
[1] train-error:0.202381
## [2] train-error:0.166667
## [3] train-error:0.166667
## [4] train-error:0.166667
## [5] train-error:0.154762
## [6] train-error:0.154762
## [7] train-error:0.154762
## [8] train-error:0.166667
## [9] train-error:0.166667
## [10] train-error:0.166667
## [2] train-error:0.166667
## [3] train-error:0.166667
## [4] train-error:0.166667
## [5] train-error:0.154762
## [6] train-error:0.154762
## [7] train-error:0.154762
## [8] train-error:0.166667
## [9] train-error:0.166667
## [10] train-error:0.166667
其中nround是迭代次数,可以用此来调节过拟合问题;nthread代表运行线程,如果不指定,则表示线程全开;objective代表所使用的方法:binary:logistic是以非线性的方式,分支。reg:linear(默认)、reg:logistic、count:poisson(泊松分布)、multi:softmax
5.特征重要性排名
importance
<-
xgb.importance(sparse_matrix@Dimnames[[2]],
model
=
bst)
head(importance)
head(importance)
##
Feature Gain Cover Frequency
## 1: Age 0.622031651 0.67251706 0.67241379
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
## 3: SexMale 0.048744054 0.04522027 0.08620690
## 4: AgeDiscret6 0.016604647 0.04784637 0.05172414
## 5: AgeDiscret3 0.016373791 0.08028939 0.05172414
## 6: AgeDiscret4 0.009270558 0.02858801 0.01724138
## 1: Age 0.622031651 0.67251706 0.67241379
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
## 3: SexMale 0.048744054 0.04522027 0.08620690
## 4: AgeDiscret6 0.016604647 0.04784637 0.05172414
## 5: AgeDiscret3 0.016373791 0.08028939 0.05172414
## 6: AgeDiscret4 0.009270558 0.02858801 0.01724138
会出来比较多的指标,Gain是增益,树分支的主要参考因素;cover是特征观察的相对数值;Frequence是gain的一种简单版,他是在所有生成树中,特征的数量(慎用!)
6.特征筛选与检验
知道特征的重要性是一回事儿,现在想知道年龄对最后的治疗的影响。所以需要可以用一些方式来反映出来。以下是官方自带的。
importanceRaw
<-
xgb.importance(sparse_matrix@Dimnames[[2]],
model
=
bst, data
=
sparse_matrix, label
=
output_vector)
##
Warning in `[.data.table`(result, , `:=`("RealCover",
as.numeric(vec)), :
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
#
Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)] #同时去掉cover frequence
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)] #同时去掉cover frequence
##
Warning in `[.data.table`(importanceRaw, , `:=`(Cover = NULL,
Frequence =
## NULL)): Adding new column 'Frequence' then assigning NULL (deleting it).
## NULL)): Adding new column 'Frequence' then assigning NULL (deleting it).
head(importanceClean)
##
Feature Split Gain Frequency RealCover
## 1: TreatmentPlacebo -9.53674e-07 0.28575061 0.10344828 7
## 2: Age 61.5 0.16374034 0.05172414 12
## 3: Age 39 0.08705750 0.01724138 8
## 4: Age 57.5 0.06947553 0.03448276 11
## 5: SexMale -9.53674e-07 0.04874405 0.08620690 4
## 6: Age 53.5 0.04620627 0.05172414 10
## RealCover %
## 1: 0.2500000
## 2: 0.4285714
## 3: 0.2857143
## 4: 0.3928571
## 5: 0.1428571
## 6: 0.3571429
## 1: TreatmentPlacebo -9.53674e-07 0.28575061 0.10344828 7
## 2: Age 61.5 0.16374034 0.05172414 12
## 3: Age 39 0.08705750 0.01724138 8
## 4: Age 57.5 0.06947553 0.03448276 11
## 5: SexMale -9.53674e-07 0.04874405 0.08620690 4
## 6: Age 53.5 0.04620627 0.05172414 10
## RealCover %
## 1: 0.2500000
## 2: 0.4285714
## 3: 0.2857143
## 4: 0.3928571
## 5: 0.1428571
## 6: 0.3571429
比第一种方式多了split列,代表此时特征分割的界线,比如特征2:
Age 61.5,代表分割在61.5岁以下治疗了就痊愈了。同时,多了RealCover
和RealCover %列,前者代表在这个特征的个数,后者代表个数的比例。
绘制重要性图谱:
xgb.plot.importance(importance_matrix
=
importanceRaw)
需要加载install.packages("Ckmeans.1d.dp"),其中输出的是两个特征,这个特征数量是可以自定义的,可以定义为10族。
变量之间影响力的检验,官方用的卡方检验:
c2
<-
chisq.test(df$Age,
output_vector)
##
Warning in chisq.test(df$Age, output_vector): Chi-squared
approximation may
## be incorrect
## be incorrect
检验年龄对最终结果的影响。
8.一些进阶功能的尝试
作为比赛型算法,真的超级好。下面列举一些我比较看中的功能:
##8.1交叉验证每一折显示预测情况
挑选比较优质的验证集。
#
do cross validation with prediction values for each fold
res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5,prediction = TRUE)
res$evaluation_log
length(res$pred)
res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5,prediction = TRUE)
res$evaluation_log
length(res$pred)
交叉验证时可以返回模型在每一折作为预测集时的预测结果,方便构建ensemble模型。
8.2循环迭代
允许用户先迭代1000次,查看此时模型的预测效果,然后继续迭代1000次,最后模型等价于一次性迭代2000次。
#
do predict with output_margin=TRUE, will always give you margin
values before logistic transformation
ptrain <- predict(bst, dtrain, outputmargin=TRUE)
ptest <- predict(bst, dtest, outputmargin=TRUE)
ptrain <- predict(bst, dtrain, outputmargin=TRUE)
ptest <- predict(bst, dtest, outputmargin=TRUE)
8.3每棵树将样本分类到哪片叶子上
#
training the model for two rounds
bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
8.4线性模型替代树模型
可以选择使用线性模型替代树模型,从而得到带L1+L2惩罚的线性回归或者logistic回归。
#
you can also set lambda_bias which is L2 regularizer on the bias term
param <- list(objective = "binary:logistic", booster = "gblinear",
nthread = 2, alpha = 0.0001, lambda = 1)
param <- list(objective = "binary:logistic", booster = "gblinear",
nthread = 2, alpha = 0.0001, lambda = 1)
评论
发表评论