forecastxgb包
1.安装
if
(!require(forecastxgb))
devtools::install_github("ellisp/forecastxgb-r-package/pkg")
##
Loading required package: forecastxgb
##
Loading required package: forecast
##
Loading required package: xgboost
2.1 单变量时间序列
以1956年~1995年间澳洲的月度燃气产量数据集作为例:
model
<-
xgbar(gas)
xgbar()
默认通过行交叉检验方法来决定最佳xgboost算法的迭代次数,以避免过拟合的出现;另外,
最终得到模型会拟合整个数据集。每个预测变量的相对重要性可以通过importance_xgb(),或者更简单的summary()
查看
summary(model)
##
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag12 5.095860e-01 0.1107688262 0.066558442
## 2: lag11 2.795726e-01 0.0729943460 0.043831169
## 3: lag13 1.043121e-01 0.0231605838 0.024350649
## 4: lag24 7.806692e-02 0.1017713245 0.056818182
## 5: lag1 1.586581e-02 0.1735447151 0.183441558
## 6: lag23 5.628623e-03 0.0384882694 0.038961039
## 7: lag9 2.610440e-03 0.0848657888 0.061688312
## 8: lag2 6.993670e-04 0.0558634033 0.069805195
## 9: lag14 6.029356e-04 0.0453444033 0.030844156
## 10: lag10 5.519954e-04 0.0350508105 0.045454545
## 11: lag6 3.845262e-04 0.0108571107 0.029220779
## 12: lag4 2.337476e-04 0.0076074910 0.034090909
## 13: lag22 2.248777e-04 0.0091665571 0.017857143
## 14: lag16 2.053888e-04 0.0068561339 0.012987013
## 15: lag21 2.019046e-04 0.0161729624 0.024350649
## 16: lag18 1.971903e-04 0.0218457088 0.029220779
## 17: lag3 1.868054e-04 0.0280256213 0.038961039
## 18: lag5 1.522874e-04 0.0230478802 0.034090909
## 19: lag17 1.495012e-04 0.0321956534 0.025974026
## 20: lag15 1.385763e-04 0.0211131356 0.029220779
## 21: lag8 1.140981e-04 0.0156094446 0.029220779
## 22: season7 1.049438e-04 0.0052595000 0.011363636
## 23: lag19 8.884593e-05 0.0255649266 0.016233766
## 24: lag20 6.416017e-05 0.0080958732 0.012987013
## 25: lag7 4.402814e-05 0.0167928321 0.019480519
## 26: season4 5.631148e-06 0.0008452768 0.003246753
## 27: season5 2.862451e-06 0.0014463625 0.004870130
## 28: season6 2.627577e-06 0.0014087946 0.001623377
## 29: season10 1.064577e-06 0.0059920732 0.001623377
## 30: season9 9.222462e-08 0.0002441911 0.001623377
## Feature Gain Cover Frequency
##
## 35 features considered.
## 476 original observations.
## 452 effective observations after creating lagged features.
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag12 5.095860e-01 0.1107688262 0.066558442
## 2: lag11 2.795726e-01 0.0729943460 0.043831169
## 3: lag13 1.043121e-01 0.0231605838 0.024350649
## 4: lag24 7.806692e-02 0.1017713245 0.056818182
## 5: lag1 1.586581e-02 0.1735447151 0.183441558
## 6: lag23 5.628623e-03 0.0384882694 0.038961039
## 7: lag9 2.610440e-03 0.0848657888 0.061688312
## 8: lag2 6.993670e-04 0.0558634033 0.069805195
## 9: lag14 6.029356e-04 0.0453444033 0.030844156
## 10: lag10 5.519954e-04 0.0350508105 0.045454545
## 11: lag6 3.845262e-04 0.0108571107 0.029220779
## 12: lag4 2.337476e-04 0.0076074910 0.034090909
## 13: lag22 2.248777e-04 0.0091665571 0.017857143
## 14: lag16 2.053888e-04 0.0068561339 0.012987013
## 15: lag21 2.019046e-04 0.0161729624 0.024350649
## 16: lag18 1.971903e-04 0.0218457088 0.029220779
## 17: lag3 1.868054e-04 0.0280256213 0.038961039
## 18: lag5 1.522874e-04 0.0230478802 0.034090909
## 19: lag17 1.495012e-04 0.0321956534 0.025974026
## 20: lag15 1.385763e-04 0.0211131356 0.029220779
## 21: lag8 1.140981e-04 0.0156094446 0.029220779
## 22: season7 1.049438e-04 0.0052595000 0.011363636
## 23: lag19 8.884593e-05 0.0255649266 0.016233766
## 24: lag20 6.416017e-05 0.0080958732 0.012987013
## 25: lag7 4.402814e-05 0.0167928321 0.019480519
## 26: season4 5.631148e-06 0.0008452768 0.003246753
## 27: season5 2.862451e-06 0.0014463625 0.004870130
## 28: season6 2.627577e-06 0.0014087946 0.001623377
## 29: season10 1.064577e-06 0.0059920732 0.001623377
## 30: season9 9.222462e-08 0.0002441911 0.001623377
## Feature Gain Cover Frequency
##
## 35 features considered.
## 476 original observations.
## 452 effective observations after creating lagged features.
我们可以清楚看到影响燃气产量的最重要预测变量是12个月前的燃气产量(lag12);在仔细看可以得知这个模型中总共用了35个预测变量,另外原本数据集包含476个时点,由于maxlag
= 24,因此在生成y的滞后项后,最终有452个时点参与到xgboost()中计算。
预测是forecastxgb包的重要功能之一,通过forecast()
便可实现预测,最后通过plot()绘制预测图:
fc
<-
forecast(model,
h
=
12)
plot(fc)
plot(fc)
2.2多变量面板数据
与单变量时间序列的操作类似,处理多预测变量的情况只需通过设定xreg
= X即可。另外xreg的对象,forecastxgb包的作者建议使用矩阵格式,就算X自变量数据集只有一列也是。
以下的例子数据集usconsumption来自于Athanasopoulos
和 Hyndman撰写的
一书中的配套数据包fpp。该数据集包含‘income’以及‘consumption’
两个指标。本例子使用‘income’作为自变量对‘consumption’进行预测,预测变量集中除了包含‘consumption’的滞后项,同时还包含了’income’及其滞后项
library(fpp)
##
Loading required package: fma
##
Loading required package: expsmooth
##
Loading required package: lmtest
##
Loading required package: zoo
##
## Attaching package: 'zoo'
## Attaching package: 'zoo'
##
The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## as.Date, as.Date.numeric
##
Loading required package: tseries
consumption
<-
usconsumption[
,1]
income <- matrix(usconsumption[ ,2], dimnames = list(NULL, "Income"))
consumption_model <- xgbar(y = consumption, xreg = income)
summary(consumption_model)
income <- matrix(usconsumption[ ,2], dimnames = list(NULL, "Income"))
consumption_model <- xgbar(y = consumption, xreg = income)
summary(consumption_model)
##
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag2 2.473931e-01 0.050257495 0.082585278
## 2: lag1 2.136168e-01 0.062994298 0.170556553
## 3: Income_lag0 1.172917e-01 0.153209491 0.084380610
## 4: lag3 6.405769e-02 0.069891484 0.070017953
## 5: lag8 5.518218e-02 0.094353504 0.050269300
## 6: Income_lag8 4.953923e-02 0.047268714 0.048473968
## 7: Income_lag1 4.760696e-02 0.081524738 0.046678636
## 8: Income_lag6 4.258574e-02 0.046992827 0.055655296
## 9: lag6 3.363114e-02 0.048096377 0.068222621
## 10: Income_lag2 2.079926e-02 0.049705720 0.044883303
## 11: Income_lag5 1.900147e-02 0.046533015 0.043087971
## 12: lag7 1.834403e-02 0.062120655 0.046678636
## 13: lag5 1.780785e-02 0.034302005 0.032315978
## 14: Income_lag4 1.752774e-02 0.024967813 0.035906643
## 15: Income_lag7 1.154118e-02 0.060051499 0.035906643
## 16: lag4 9.921818e-03 0.039451904 0.039497307
## 17: Income_lag3 8.983342e-03 0.013150635 0.025134650
## 18: season4 3.274300e-03 0.003172706 0.007181329
## 19: season3 1.885444e-03 0.010299798 0.010771993
## 20: season2 9.052612e-06 0.001655325 0.001795332
##
## 20 features considered.
## 164 original observations.
## 156 effective observations after creating lagged features.
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag2 2.473931e-01 0.050257495 0.082585278
## 2: lag1 2.136168e-01 0.062994298 0.170556553
## 3: Income_lag0 1.172917e-01 0.153209491 0.084380610
## 4: lag3 6.405769e-02 0.069891484 0.070017953
## 5: lag8 5.518218e-02 0.094353504 0.050269300
## 6: Income_lag8 4.953923e-02 0.047268714 0.048473968
## 7: Income_lag1 4.760696e-02 0.081524738 0.046678636
## 8: Income_lag6 4.258574e-02 0.046992827 0.055655296
## 9: lag6 3.363114e-02 0.048096377 0.068222621
## 10: Income_lag2 2.079926e-02 0.049705720 0.044883303
## 11: Income_lag5 1.900147e-02 0.046533015 0.043087971
## 12: lag7 1.834403e-02 0.062120655 0.046678636
## 13: lag5 1.780785e-02 0.034302005 0.032315978
## 14: Income_lag4 1.752774e-02 0.024967813 0.035906643
## 15: Income_lag7 1.154118e-02 0.060051499 0.035906643
## 16: lag4 9.921818e-03 0.039451904 0.039497307
## 17: Income_lag3 8.983342e-03 0.013150635 0.025134650
## 18: season4 3.274300e-03 0.003172706 0.007181329
## 19: season3 1.885444e-03 0.010299798 0.010771993
## 20: season2 9.052612e-06 0.001655325 0.001795332
##
## 20 features considered.
## 164 original observations.
## 156 effective observations after creating lagged features.
我们可以看到对‘consumption’
预测重要性最大的指标属于过去两个季度的滞后值,再到当前的‘income’。
使用Y以外的变量来预测都无法避免一个问题:这些预测变量能否提前获得?如果预测变量是月份、日期、星期几、是否有公共假期等等这些可以提前确定的指标就相当好办了;但还有很多指标我们较难提前获知,forecastxgb的作者提供一个小窍门:先预测自变量的未来值,再把自变量的未来值放回原来的预测模型中实现预测Y:
income_future
<-
matrix(forecast(xgbar(usconsumption[,2]),
h
=
10)$mean,
dimnames
=
list(NULL,
"Income"))
plot(forecast(consumption_model, xreg = income_future))
plot(forecast(consumption_model, xreg = income_future))
3 Forecastxgb包核心函数简介
(一).
核心函数xgbar():
forecastxgb使用xgboost算法(简称xgb),基于自回归(autoregression,简称ar)的思路,通过核心函数xgbar(),以因变量Y的滞后项(Yt-1,…Yt-n)以及自变量X及其滞后项(Xt-1…Xt-n)来预测Y值:xgbar(y, xreg = NULL, maxlag = max(8, 2 * frequency(y)), nrounds = 100, nrounds_method = c("cv", "v", "manual"), nfold = ifelse(length(y) > 30,10,5), lambda = 1, verbose = FALSE, seas_method = c("dummies","decompose","fourier", "none"), K = max(1, min(round(f/2 - 1), 10)), trend_method = c("none", "differencing"), ...) 参数解释: y: 因变量(数据格式必须是单变量时间序列)。 xreg:如果出现多个自变量预测因变量时,使用该参数;另外自变量与因变量两者的行数必须一致。 maxlag:因变量y和自变量x(若有的话)的最大滞后项数目。 nrounds:指xgboost()的最大迭代次数。 nrounds_method:指决定xgboost最大迭代次数的方法:当nrounds_method = 'cv',nrounds的值将传送到xgboost()作为交叉检验的次数;当nrounds_method = 'v',xgboost()会把数据拆分成比例为8:2训练集和测试集进行检验;当nrounds_method = 'manual',xgboost()将采用nrounds的值对全部数据进行迭代。 nfold:当nrounds_method = 'cv'时,nfold决定采用多少折检验。 lambda:用于y的转换系数 (与Box-Cox转换类似,但lambda可以包含负值),会在使用xgboost()前进行转换(之后会使用逆转换回到原始值)。默认lambda = 1 ,此时y值不会被转换; 转换只会用于y值,而不作用于x。 verbose:默认Verbose = FALSE,此时仅显示最终迭代次数,当Verbose = TRUE, 显示每次迭代的误差。 seas_method:处理y值季节性特征的方法,包括"dummies", "decompose", "fourier"以及"none":当seas_method = "dummies"(默认)或者 "fourier"时,会产生季节性标识的预测变量,:当seas_method = "decompose",对y进行季节性分解后,再用xgboost()进行预测;当seas_method = "none", 不对y季节性特征做处理。 K:当nrounds_method = "fourier",K值将用于决定傅里叶级数。 trend_method:处理y季节性特征的方法:默认trend_method = "none";当trend_method = "differencing", 采用类似于arma型的差分方法,通过KPSS检验决定差分阶数,以保证剩余序列平稳。 剩余参数设置:当nrounds_method = "cv" 或 "manual"时,xgboost()的参数可以在此使用, xgboost()详细参数参见xgboost包。
(二). 不同Y的季节性特征处理方法出现不同情况:
除了有滞后项外,预测变量集会因参数sea_method的设定而出现不同情况:
当seas_method = ‘none’时,不对Y做季节性特征处理,因此不出现Y的季节性特征变量;当seas_method = ‘decompose’时,会对Y进行季节性分解,并用处理后获得的Y'值作为因变量,因此也不会出现Y的季节性特征变量。
但当seas_method = ‘dummies’ 或者 ‘fourier’时,会通过构造出表达Y的季节性特征的预测变量来参与到xgboost()的计算中,因此在预测变量集中除了滞后项外,还有额外的代表季节性特征的预测变量。
评论
发表评论