1.安装

if (!require(forecastxgb)) devtools::install_github("ellisp/forecastxgb-r-package/pkg")

## Loading required package: forecastxgb

## Loading required package: forecast

## Loading required package: xgboost

2.1 单变量时间序列

以1956年~1995年间澳洲的月度燃气产量数据集作为例：

model <- xgbar(gas)

xgbar() 默认通过行交叉检验方法来决定最佳xgboost算法的迭代次数，以避免过拟合的出现；另外, 最终得到模型会拟合整个数据集。每个预测变量的相对重要性可以通过importance_xgb()，或者更简单的summary() 查看

summary(model)

##
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag12 5.095860e-01 0.1107688262 0.066558442
## 2: lag11 2.795726e-01 0.0729943460 0.043831169
## 3: lag13 1.043121e-01 0.0231605838 0.024350649
## 4: lag24 7.806692e-02 0.1017713245 0.056818182
## 5: lag1 1.586581e-02 0.1735447151 0.183441558
## 6: lag23 5.628623e-03 0.0384882694 0.038961039
## 7: lag9 2.610440e-03 0.0848657888 0.061688312
## 8: lag2 6.993670e-04 0.0558634033 0.069805195
## 9: lag14 6.029356e-04 0.0453444033 0.030844156
## 10: lag10 5.519954e-04 0.0350508105 0.045454545
## 11: lag6 3.845262e-04 0.0108571107 0.029220779
## 12: lag4 2.337476e-04 0.0076074910 0.034090909
## 13: lag22 2.248777e-04 0.0091665571 0.017857143
## 14: lag16 2.053888e-04 0.0068561339 0.012987013
## 15: lag21 2.019046e-04 0.0161729624 0.024350649
## 16: lag18 1.971903e-04 0.0218457088 0.029220779
## 17: lag3 1.868054e-04 0.0280256213 0.038961039
## 18: lag5 1.522874e-04 0.0230478802 0.034090909
## 19: lag17 1.495012e-04 0.0321956534 0.025974026
## 20: lag15 1.385763e-04 0.0211131356 0.029220779
## 21: lag8 1.140981e-04 0.0156094446 0.029220779
## 22: season7 1.049438e-04 0.0052595000 0.011363636
## 23: lag19 8.884593e-05 0.0255649266 0.016233766
## 24: lag20 6.416017e-05 0.0080958732 0.012987013
## 25: lag7 4.402814e-05 0.0167928321 0.019480519
## 26: season4 5.631148e-06 0.0008452768 0.003246753
## 27: season5 2.862451e-06 0.0014463625 0.004870130
## 28: season6 2.627577e-06 0.0014087946 0.001623377
## 29: season10 1.064577e-06 0.0059920732 0.001623377
## 30: season9 9.222462e-08 0.0002441911 0.001623377
## Feature Gain Cover Frequency
##
## 35 features considered.
## 476 original observations.
## 452 effective observations after creating lagged features.

我们可以清楚看到影响燃气产量的最重要预测变量是12个月前的燃气产量(lag12)；在仔细看可以得知这个模型中总共用了35个预测变量，另外原本数据集包含476个时点，由于maxlag = 24,因此在生成y的滞后项后，最终有452个时点参与到xgboost()中计算。

预测是forecastxgb包的重要功能之一，通过forecast() 便可实现预测，最后通过plot()绘制预测图：

fc <- forecast(model, h = 12)
plot(fc)

2.2多变量面板数据

与单变量时间序列的操作类似，处理多预测变量的情况只需通过设定xreg = X即可。另外xreg的对象，forecastxgb包的作者建议使用矩阵格式，就算X自变量数据集只有一列也是。

以下的例子数据集usconsumption来自于Athanasopoulos 和 Hyndman撰写的一书中的配套数据包fpp。该数据集包含‘income’以及‘consumption’ 两个指标。本例子使用‘income’作为自变量对‘consumption’进行预测，预测变量集中除了包含‘consumption’的滞后项，同时还包含了’income’及其滞后项

library(fpp)

## Loading required package: fma

## Loading required package: expsmooth

## Loading required package: lmtest

## Loading required package: zoo

##
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric

## Loading required package: tseries

consumption <- usconsumption[ ,1]
income <- matrix(usconsumption[ ,2], dimnames = list(NULL, "Income"))
consumption_model <- xgbar(y = consumption, xreg = income)
summary(consumption_model)

##
## Importance of features in the xgboost model:
## Feature Gain Cover Frequency
## 1: lag2 2.473931e-01 0.050257495 0.082585278
## 2: lag1 2.136168e-01 0.062994298 0.170556553
## 3: Income_lag0 1.172917e-01 0.153209491 0.084380610
## 4: lag3 6.405769e-02 0.069891484 0.070017953
## 5: lag8 5.518218e-02 0.094353504 0.050269300
## 6: Income_lag8 4.953923e-02 0.047268714 0.048473968
## 7: Income_lag1 4.760696e-02 0.081524738 0.046678636
## 8: Income_lag6 4.258574e-02 0.046992827 0.055655296
## 9: lag6 3.363114e-02 0.048096377 0.068222621
## 10: Income_lag2 2.079926e-02 0.049705720 0.044883303
## 11: Income_lag5 1.900147e-02 0.046533015 0.043087971
## 12: lag7 1.834403e-02 0.062120655 0.046678636
## 13: lag5 1.780785e-02 0.034302005 0.032315978
## 14: Income_lag4 1.752774e-02 0.024967813 0.035906643
## 15: Income_lag7 1.154118e-02 0.060051499 0.035906643
## 16: lag4 9.921818e-03 0.039451904 0.039497307
## 17: Income_lag3 8.983342e-03 0.013150635 0.025134650
## 18: season4 3.274300e-03 0.003172706 0.007181329
## 19: season3 1.885444e-03 0.010299798 0.010771993
## 20: season2 9.052612e-06 0.001655325 0.001795332
##
## 20 features considered.
## 164 original observations.
## 156 effective observations after creating lagged features.

我们可以看到对‘consumption’ 预测重要性最大的指标属于过去两个季度的滞后值，再到当前的‘income’。

使用Y以外的变量来预测都无法避免一个问题：这些预测变量能否提前获得？如果预测变量是月份、日期、星期几、是否有公共假期等等这些可以提前确定的指标就相当好办了；但还有很多指标我们较难提前获知，forecastxgb的作者提供一个小窍门：先预测自变量的未来值，再把自变量的未来值放回原来的预测模型中实现预测Y：

income_future <- matrix(forecast(xgbar(usconsumption[,2]), h = 10)$mean, dimnames = list(NULL, "Income"))
plot(forecast(consumption_model, xreg = income_future))

3 Forecastxgb包核心函数简介

(一). 核心函数xgbar():

forecastxgb使用xgboost算法(简称xgb)，基于自回归(autoregression，简称ar)的思路，通过核心函数xgbar()，以因变量Y的滞后项(Yt-1,…Yt-n)以及自变量X及其滞后项(Xt-1…Xt-n)来预测Y值：
xgbar(y, xreg = NULL, maxlag = max(8, 2 * frequency(y)), nrounds = 100, nrounds_method = c("cv", "v", "manual"), nfold = ifelse(length(y) > 30,10,5), lambda = 1, verbose = FALSE, seas_method = c("dummies","decompose","fourier", "none"), K = max(1, min(round(f/2 - 1), 10)), trend_method = c("none", "differencing"), ...) 参数解释： y：因变量(数据格式必须是单变量时间序列)。 xreg：如果出现多个自变量预测因变量时，使用该参数；另外自变量与因变量两者的行数必须一致。 maxlag：因变量y和自变量x(若有的话)的最大滞后项数目。 nrounds：指xgboost()的最大迭代次数。 nrounds_method：指决定xgboost最大迭代次数的方法：当nrounds_method = 'cv'，nrounds的值将传送到xgboost()作为交叉检验的次数；当nrounds_method = 'v'，xgboost()会把数据拆分成比例为8:2训练集和测试集进行检验；当nrounds_method = 'manual'，xgboost()将采用nrounds的值对全部数据进行迭代。 nfold：当nrounds_method = 'cv'时，nfold决定采用多少折检验。 lambda：用于y的转换系数 (与Box-Cox转换类似，但lambda可以包含负值)，会在使用xgboost()前进行转换(之后会使用逆转换回到原始值)。默认lambda = 1 ，此时y值不会被转换；转换只会用于y值，而不作用于x。 verbose：默认Verbose = FALSE，此时仅显示最终迭代次数，当Verbose = TRUE, 显示每次迭代的误差。 seas_method：处理y值季节性特征的方法，包括"dummies"， "decompose"， "fourier"以及"none"：当seas_method = "dummies"(默认)或者 "fourier"时，会产生季节性标识的预测变量，：当seas_method = "decompose"，对y进行季节性分解后，再用xgboost()进行预测；当seas_method = "none", 不对y季节性特征做处理。 K：当nrounds_method = "fourier"，K值将用于决定傅里叶级数。 trend_method：处理y季节性特征的方法：默认trend_method = "none"；当trend_method = "differencing", 采用类似于arma型的差分方法，通过KPSS检验决定差分阶数，以保证剩余序列平稳。剩余参数设置：当nrounds_method = "cv" 或 "manual"时，xgboost()的参数可以在此使用, xgboost()详细参数参见xgboost包。
(二). 不同Y的季节性特征处理方法出现不同情况：
除了有滞后项外，预测变量集会因参数sea_method的设定而出现不同情况：
当seas_method = ‘none’时，不对Y做季节性特征处理，因此不出现Y的季节性特征变量；当seas_method = ‘decompose’时，会对Y进行季节性分解，并用处理后获得的Y'值作为因变量，因此也不会出现Y的季节性特征变量。

但当seas_method = ‘dummies’ 或者 ‘fourier’时，会通过构造出表达Y的季节性特征的预测变量来参与到xgboost()的计算中，因此在预测变量集中除了滞后项外，还有额外的代表季节性特征的预测变量。

搜索此博客

xuefliang

forecastxgb包

1.安装

2.1 单变量时间序列

2.2多变量面板数据

3 Forecastxgb包核心函数简介

评论

发表评论

此博客中的热门博文

V2ray websocket(ws)+tls+nginx分流

windows 命令行下查看端口占用情况的方法

Rstudio 使用代理

forecastxgb包

1.安装

2.1 单变量时间序列

2.2多变量面板数据

3 Forecastxgb包核心函数简介

评论

发表评论

此博客中的热门博文

V2ray websocket(ws)+tls+nginx分流

windows 命令行下 查看端口占用情况的方法

Rstudio 使用代理

windows 命令行下查看端口占用情况的方法