The goal of this lecture is to bring together knowledge about regression, time series trends, and autocorrelation with the goal of obtaining correct inference in regression problems involving time series. You should become capable of suggesting a model that would meet the goal of data analysis (testing certain relationships) while satisfying assumptions (e.g., uncorrelatedness of residuals) and minimizing the risk of spurious results.
Objectives
Model autocorrelated residuals (in other words, incorporate or account for autocorrelation) so the final residuals satisfy modeling assumptions.
Apply and interpret Granger causality that is based on time series predictability.
Reading materials
Chapters 3.8, 5.5, and 5.6 in Shumway and Stoffer (2017)
Chapter 3 in Kirchgässner and Wolters (2007) on Granger causality
9.1 Introduction
Here we explore a regression model with autocorrelated residuals. We already know how to
forecast, construct prediction intervals, and test hypotheses about regression coefficients when the residuals satisfy the ordinary least squares (OLS) assumptions, i.e., the residuals are white noise and have a joint normal distribution \(N(0, \sigma^{2})\);
model serial dependence (i.e., autocorrelation) in stationary univariate time series, such as using ARMA models.
Here we bring these two skills together.
If the residuals do not satisfy the independence assumption, we can sometimes describe them in terms of a specific correlation model involving one or more new parameters and some ‘new residuals.’ These new residuals essentially satisfy the OLS assumptions and can replace the original correlated residuals (which we will sometimes call the ‘old’ residuals) in the model.
When modeling time series, we often want to find so-called leading indicators, i.e., exogenous variables whose lagged values can be used for predicting the response, so for the forecasts we do not need to predict \(X\)-variables that are employed in a model. Hence, our models may include lagged versions of both the response and predictor variables (a general form of such a model was given in Equation 2.2).
9.2 Cross-correlation function
Cross-correlation is a correlation between different time series, typically with a time offset \(k\) (a.k.a. lag): \[
\mathrm{cor}(X_{t+k},Y_t).
\tag{9.1}\] (Recall that autocorrelation is also a correlation with a lag present, but for the same time series.) By calculating cross-correlations at each lag \(k = 0, \pm 1, \pm 2, \dots\), we obtain the cross-correlation function (CCF). CCF shows how the correlation coefficient varies with the lag \(k\) and helps us to identify important lags and find leading indicators for predictive models. For example, the relationship between nutrient input and algal blooms in a marine environment is not immediate since the algae need time to reproduce. The correlation between a time series of births and demand for baby formula would be distributed over smaller lags \(l\)\[
\mathrm{cor}(\mathrm{Births}_{t - l}, \mathrm{FormulaDemand}_t)
\] than the correlation lags \(L\) between births and school enrollments \[
\mathrm{cor}(\mathrm{Births}_{t - L}, \mathrm{SchoolEnrollment}_t),
\] where \(l,L > 0\) and \(l < L\).
The functions ccf() and forecast::ggCcf() estimate the cross-correlation function and plot it as a base-R or a ggplot2 graph, respectively. The functions use the notations like in Equation 9.1, hence when using ccf(x, y) we are typically interested in identifying negative\(k\)s that correspond to correlation between past values of the predictor \(X_t\) and current values of the response \(Y_t\).
For example, Figure 9.1 shows the CCF for two time series of sales that have been detrended by taking differences. From this plot, lagged values of the first series, \(\Delta \mathrm{BJsales.lead}_{t - 2}\) and \(\Delta \mathrm{BJsales.lead}_{t - 3}\) are significantly correlated with current values of the second series, \(\Delta \mathrm{BJsales}_{t}\). Hence, we can call BJsales.lead a leading indicator for BJsales.
The function ccf() assumes that the two time series are evenly spaced and have no missing data. We can use the argument na.action = na.pass to ignore any missing values.
Note that the 95% confidence band shown in Figure 9.1 is based on how i.i.d. series of length \(n\) would correlate (the same way the band is produced for an autocorrelation function with acf()). However, when \(n\) is small, there are not so many pairs \((X_{t+k}, Y_t)\) to calculate the correlations at large lags, hence the confidence bands should be adjusted by using the smaller \(n'(k) = n - |k|\) to account for this lack of confidence.
Another concern is the autocorrelation of individual time series that may lead to spurious cross-correlation (Dean and Dunsmuir 2016). Intuitively, say if \(X_t\) and \(Y_t\) are independent from each other but both \(X_t\) and \(Y_t\) are positively autocorrelated – high (or low) values in \(X_t\) follow each other, and high (or low) values in \(Y_t\) follow each other – there could be a time alignment among our considered lags \(k = 0, \pm 1, \pm 2, \dots\) when the ups (or downs) in \(X_{t+k}\) match those in \(Y_t\). This may result in spurious cross-correlation. To address the concern of autocorrelation, we can apply bootstrapping to approximate the distribution of cross-correlation coefficients corresponding not to the i.i.d. series but to stationary time series with the same autocorrelation structure as the input data. The function funtimes::ccf_boot() implements the sieve bootstrap and returns results for both Pearson and Spearman correlations (Figure 9.2).
Code
set.seed(123)res<-funtimes::ccf_boot(diff(BJsales.lead), diff(BJsales), B =10000, plot ="none")p1<-res%>%ggplot(aes(x =Lag, y =r_P, xend =Lag, yend =0))+geom_ribbon(aes(ymin =lower_P, ymax =upper_P), fill =4)+geom_point()+geom_segment()+geom_hline(yintercept =0)+ylab("Pearson correlation")p2<-res%>%ggplot(aes(x =Lag, y =r_S, xend =Lag, yend =0))+geom_ribbon(aes(ymin =lower_S, ymax =upper_S), fill =4)+geom_point()+geom_segment()+geom_hline(yintercept =0)+ylab("Spearman correlation")p1+p2+plot_annotation(tag_levels ='A')
Finally, remember that the reported Pearson correlations in Figure 9.1 and Figure 9.2 A measure the direction and strength of linear relationships, while Spearman correlations in Figure 9.2 B correspond to monotonic relationships. To further investigate whether the lagged relationships are linear, monotonic, or of some other form, we need to plot the lagged scatterplots (Figure 9.3). Also see other functions in the package astsa, such as the function astsa::lag1.plot() for exploring nonlinear autocorrelations.
We begin with the simplest case, a constant mean model, where the residuals are serially correlated and follow an AR(1) model, that is \[
Y_{t} = \beta_{0} + \epsilon_{t},
\tag{9.2}\] where \(\epsilon_{t} \sim\) AR(1), i.e., \[
\epsilon_{t} = \phi \epsilon_{t - 1} + a_{t}.
\tag{9.3}\]
Here \(\phi\) is a real number satisfying \(0 < |\phi| < 1\); \(a_{t}\) is white noise with zero mean and the variance \(\nu^{2}\), i.e., \(a_{t} \sim \mathrm{WN}(0,\nu^2)\). We also assume that \(\mathrm{cov}(a_{t}, \epsilon_{s}) = 0\) for all \(s < t\) (i.e., that the residuals \(\epsilon_s\) are not correlated with the future white noise \(a_t\)).
Equation 9.3 allows us to express Equation 9.2 as \[
Y_{t} = \beta_{0} + \phi \epsilon_{t - 1} + a_{t}.
\tag{9.4}\]
The advantage of this representation is that the new residuals \(a_{t}\) satisfy the OLS assumptions. In particular, since \(a_{t}\) is white noise, \(a_{t}\) is homoskedastic and uncorrelated.
We shall also assume that \(a_{t}\) are normally distributed (for justification of the construction of confidence intervals and prediction intervals). However, even if \(a_{t}\) are not normal, \(a_{t}\) are uncorrelated, which is a big improvement over the serially correlated \(\epsilon_{t}\).
Our goal is to remove the \(\epsilon_{t}\) entirely from the constant mean model and replace them with \(a_{t}\) acting as new residuals. This can be done as follows. First, write Equation 9.2 for \(t-1\) and multiply both sides by \(\phi\), \[
\phi Y_{t -1} = \phi \beta_{0} + \phi \epsilon_{t - 1}.
\tag{9.5}\]
Therefore we can rewrite the constant mean model in Equation 9.2 as \[
Y_{t} = (1 - \phi) \beta_{0} + \phi Y_{t - 1} + a_{t}.
\tag{9.6}\]
In general, for any multiple linear regression \[
Y_{t} = \beta_{0} + \sum^{k}_{j =1} \beta_{j} X_{t,j} + \epsilon_{t}, ~~ \text{where} ~~ \epsilon_{t} \sim \mbox{AR(1)},
\tag{9.7}\] we can perform a similar procedure of eliminating \(\epsilon_{t}\).
This elimination procedure leads to the alternate expression \[
Y_{t} = (1 - \phi) \beta_{0} + \phi Y_{t -1} + \sum^{k}_{j = 1} \beta_{j} (X_{t,j} - \phi X_{t-1, j}) + a_{t},
\tag{9.8}\] where \(a_{t}\) is white noise, i.e., homoskedastic with constant (zero) mean and uncorrelated. See Appendix B describing the method of generalized least squares and an example of \(k=1\).
Note that rewriting the model in this way pulls the autocorrelation parameter for the old residuals, \(\phi\), into the regression part of the model. Thus there are now \(k + 2\) unknown regression parameters (\(\beta_{0}, \beta_{1}, \dots, \beta_{k}\), and \(\phi\)). The introduction of an additional parameter into the regression part of the model can be regarded as the price to be paid for extracting new residuals \(a_{t}\) that satisfy the OLS assumptions.
Note that the new residuals \(a_{t}\) have smaller variance than the \(\epsilon_{t}\). In fact, \[
\begin{split}
\sigma^{2} & = \mbox{var} (\epsilon_{t} ) = \mbox{var} (\phi \epsilon_{t - 1} + a_{t}) \\
\\
& = \phi^{2} \mbox{var}(\epsilon_{t - 1}) + \mbox{var} (a_{t} ) ~~~ \mbox{since} ~~ \mbox{cov}(a_{t}, \epsilon_{t - 1}) = 0\\
\\
& = \phi^{2}\sigma^{2} + \nu^{2},
\end{split}
\] leading to the relation \[
\nu^{2} = \sigma^{2} (1 - \phi^{2} ).
\tag{9.9}\]
However, comparing Equation 9.6 with Equation 9.6, and Equation 9.8 with Equation 9.7, we see that the rewritten form of the model is not linear in terms of the parameters \(\beta_{0}, \beta_{1}, \dots, \beta_{k}\) and \(\phi\). For example, the intercept term \((1 - \phi) \beta_{0}\) involves a product of two of the parameters. This nonlinearity makes the OLS, implemented in the R functions lm() and lsfit(), a poor method for obtaining parameter estimates. Instead, we will use the method of maximum likelihood (ML) carried out through such R functions as arima().
The function arima() allows us to input the model in its original form, as in Equation 9.7. It then internally rewrites the model to put it in the form of Equation 9.8. (So we do not have to rewrite the model!) It then makes the assumption that the \(a_t\) are normal and constructs the multivariate normal likelihood function \[
L (Y_{1} , \dots , Y_{n}; Q ),
\] where \(n\) is the sample size and \(Q\) is the vector of all unknown parameters. In general, for an AR(1) model with \(k\) original predictors, we have \(Q = (\phi, \beta_{0}, \beta_{1}, \dots, \beta_{k}, \nu^{2})\). Recall that \(\nu^{2} = \mathrm{var}(a_{t}) = \mathrm{cov}(a_{t} , a_{t})\).
The function arima() then uses the historical data \(Y_{1}, \dots, Y_{n}\) to find the parameter estimates \[
\hat{Q} = \left( \hat{\phi}, \hat{\beta}_{0} , \hat{\beta}_{1} , \dots, \hat{\beta}_{k} , \hat{\nu}^2 \right),
\] which maximize the likelihood \(L\). These estimates (and other things, such as the standard errors of the estimates) can be saved to an output object in R. We will use an example to illustrate the use and interpretation of the function arima().
Moreover, we can extend the regression model in Equation 9.7 with AR(1) errors to a model with a more general form of errors, ARMA, by assuming \(\epsilon_t \sim \text{ARMA}(p,q)\)(see Chapter 6.6 in Brockwell and Davis 2002 and https://robjhyndman.com/hyndsight/arimax/): \[
\begin{split}
Y_t &= \sum^{k}_{j =1} \beta_{j} X_{t,j} + \epsilon_{t},\\
\epsilon_{t} &= \phi_1 \epsilon_{t-1} + \dots + \phi_p \epsilon_{t-p} + \theta_1 a_{t-1} + \dots + \theta_q a_{t-q} + a_t.
\end{split}
\tag{9.10}\] This model in Equation 9.10 can be specified in R functions arima(), fable::ARIMA(), forecast::Arima(), and forecast::auto.arima().
Note
The R package forecast has been superseded by the new package fable that uses an alternate parameterization of constants, see ?fable::ARIMA.
Note
Remember that the variables \(Y_t\) and \(X_{t,j}\) (\(j = 1,\dots,k\)) should be detrended prior to the analysis (to avoid spurious regression results, see the previous lecture on dealing with trends in regression). If differencing is chosen as the method of detrending, the orders of differences \(d\) and \(D\) can be specified directly within the mentioned ARIMA functions (so R will do the differencing for us).
9.4 ARIMAX
ARIMAX (‘X’ stands for ‘external regressor’) models are closely related to Equation 9.10, but there is an important difference. For simplicity of notation, we can present an ARMAX(\(p,q\)) model that is a regular ARMA(\(p,q\)) model for \(Y_t\) plus the external regressors: \[
\begin{split}
Y_t &= \phi_1 Y_{t-1} + \dots + \phi_p Y_{t-p} + \theta_1 a_{t-1} + \dots + \theta_q a_{t-q} + a_t\\
&+\sum^{k}_{j =1} \beta_{j} X_{t,j},
\end{split}
\tag{9.11}\] where \(a_t\) is still a zero-mean white noise process. Interestingly, at this time there is no convenient way to estimate this model in R. One could manually write lagged values of \(Y_t\) as external regressors (i.e., create new variables in R for \(Y_{t-1},\dots, Y_{t-p}\)), use these variables in the R functions mentioned above, but force \(p=0\)(e.g., Soliman et al. 2019 used the functions from the R package forecast).
The difference between Equation 9.10 and Equation 9.11 is the presence of lagged values of the response variable, \(Y_t\), in Equation 9.11. As Hyndman points out, regression coefficients \(\beta_j\) in Equation 9.11 lose their interpretability compared with usual regression and do not show the effect on \(Y_t\) when \(X_t\) increased by one. Instead, \(\beta\)’s in ARMAX Equation 9.11 are interpreted conditional on the value of previous values of the response variable (https://robjhyndman.com/hyndsight/arimax/). Therefore, model formulation as in Equation 9.10 may be preferred.
Note
Other applicable models include models with mixed effects such as for repeated measures ANOVA (e.g., implemented in R using nlme::gls() and nlme::lme(), see different correlation structures; without a grouping factor, the results of nlme::lme(..., correlation = corARMA(...)) should be similar to estimating model in Equation 9.10). Other regression functions often borrow the functionality (and syntax) of the package nlme for estimating random effects, so autocorrelated residuals can be incorporated into a generalized additive model, GAM, mgcv::gamm(); generalized additive model for location scale and shape, GAMLSS, gamlss::gamlss(), which also can be used just as GAM, if scale and shape parameters are not modeled. A slightly different solution is possible using a generalized autoregressive moving average model (GARMA) demonstrated in Section 9.5, see gamlss.util::garmaFit().
Example: Golden tilefish ARIMAX
Here we have time series of 1918–2017 annual landings of golden tilefish (tonne) in the U.S. North Atlantic region and the Atlantic Multi-decadal Oscillation (AMO) index characterizing climatic conditions (Figure 9.4). These time series were described in Nesslage et al. (2021) with R code by Lyubchich and Nesslage (2020). The goal is to develop a regression model to explore the relationship between the landings and AMO.
#> Year Landings AMO
#> Min. :1918 Min. : 5 Min. :-0.434
#> 1st Qu.:1943 1st Qu.: 454 1st Qu.:-0.148
#> Median :1968 Median : 749 Median : 0.014
#> Mean :1968 Mean : 952 Mean : 0.001
#> 3rd Qu.:1992 3rd Qu.:1204 3rd Qu.: 0.149
#> Max. :2017 Max. :3968 Max. : 0.358
#> NA's :3
Code
p1<-ggplot(D, aes(x =Year, y =Landings))+geom_line()+xlab("Year")+ylab("Landings (tonne)")p2<-ggplot(D, aes(x =Year, y =AMO))+geom_line()+xlab("Year")+ylab("AMO")p1+p2+plot_annotation(tag_levels ='A')
Interpolation of missing data (a.k.a. imputation) is a separate and very expansive topic. For a univariate time series, linear interpolation can be implemented using forecast::na.interp(). Also, the landings time series required a power transformation, so the square root transformation was applied (Figure 9.5).
Based on the strongest lagged correlations (Figure 9.6), implement a model \[
\sqrt{Landings_t} = \beta_0 + \beta_{1} AMO_{t-7} + \epsilon_{t},
\] where \(\epsilon_{t} \sim\) ARMA(\(p,q\)), and the orders \(p\) and \(q\) are selected automatically based on Akaike information criterion.
We forgot to check the stationarity of the time series! It seems that the landings can be considered as a unit-root process, so differencing is needed, hence a modified model is
Let \(Y_t\) be the observed time series and \(\boldsymbol{X_t}\) exogenous regressors. Then, we can model the conditional distribution of \(Y_t\), given \(Y_1,\dots,Y_{t-1}\), \(\boldsymbol{X_1,\dots,X_{t}}\) as \[
g(\mu_t)=\boldsymbol{X}'_t\beta+\sum_{j=1}^p\phi_j\{g(Y_{t-j})- \boldsymbol{X}'_{t-j}\beta\}
+\sum_{j=1}^q\theta_j \{g(Y_{t-j})-g(\mu_{t-j})\},
\tag{9.12}\] where \(g(\cdot)\) is an appropriate link function; \(\mu_t\) is a conditional mean of the dependent variable; \(\boldsymbol{\beta}\) is a vector of regression coefficients; \(\phi_j\), \(j=1,\dots, p\), are the autoregressive coefficients; \(\theta_j\), \(j=1,\dots,q\), are the moving average coefficients, while \(p\) and \(q\) are the autoregressive and moving average orders, respectively.
In certain cases, the function \(g(\cdot)\) requires some transformation of the original series \(Y_{t-j}\) to avoid the non-existence of \(g(Y_{t-j})\)(Benjamin et al. 2003).
The generalized autoregressive moving average model (Equation 9.12), GARMA(\(p,q\)), represents a flexible observation-driven modification of the classical Box–Jenkins methodology and GLMs for integer-valued time series. GARMA further advances the classical Gaussian ARMA model to a case where the distribution of the dependent variable is not only non-Gaussian but can be discrete. The dependent variable is assumed to belong to a conditional exponential family distribution given the past information of the process and thus the GARMA can be used to model a variety of discrete distributions (Benjamin et al. 2003). The GARMA model also extends the work of Zeger and Qaqish (1988), who proposed an autoregressive exponential family model, and Li (1994), who introduced its moving average counterpart.
Example: Insurance claims GARMA model
Consider the weekly number of house insurance claims related to water and weather damage in one Canadian city. The number is standardized by the daily number of insured properties in that city. Explore the relationship between the number of claims and weekly total precipitation (mm).
Based on the distribution plots in Figure 9.7, the data are highly right-skewed (have heavy right tails). The number of claims is also a discrete variable. Therefore, we deal with non-normal distributions and need to use generalized-type models, like the generalized linear or additive models (GLMs or GAMs). Since there are many zeros in the counts of claims, we can start with the zero-adjusted Poisson distribution to model the number of claims (Gupta et al. 1996; Stasinopoulos and Rigby 2007).
In our case, there is just a slight chance that past-week precipitation affects the current-week insurance claims. Hence, we will still keep the current precipitation and additionally explore the lagged effects, using the cross-correlation function (Figure 9.8).
Based on the estimated CCFs (Figure 9.8), past-week precipitation is significantly correlated with the current-week number of claims, so we can add the lagged predictor into our models.
Code
Insurance<-Insurance%>%mutate(Precipitation_lag1 =dplyr::lag(Precipitation, 1), Week =1:nrow(Insurance), Year =rep(2002:2011, each =52), Claims_ln =log(Claims+logconstant))
Based on Figure 9.9, there might be an increasing trend in the number of claims that we might be able to approximate with a linear function.
Code
p1<-ggplot2::autoplot(as.ts(log(Insurance$Claims+logconstant)))+xlab("Week")+ylab("ln(Number of claims)")p2<-forecast::ggAcf(as.ts(log(Insurance$Claims+logconstant)), lag.max =110)+ggtitle("")+xlab("Lag (weeks)")p1+p2+plot_annotation(tag_levels ='A')
Plot the data once again after the transformations (Figure 9.10).
Fit a GARMA model relating the weekly number of insurance claims to the total precipitation during that and previous weeks.
# The model function doesn't accept NAs, so remove themInsurance_noNA<-na.omit(Insurance)library(gamlss.util)m00zip<-garmaFit(Claims~Precipitation+Week+Precipitation_lag1 ,family =ZIP ,data =Insurance_noNA)
#> deviance of linear model= 3014
Obtain ACF and PACF plots of the model residuals to select ARMA order (Figure 9.11).
Code
p1<-forecast::ggAcf(m00zip$residuals)+ggtitle("")+xlab("Lag (weeks)")p2<-forecast::ggAcf(m00zip$residuals, type ="partial")+ggtitle("")+xlab("Lag (weeks)")p1+p2+plot_annotation(tag_levels ='A')
Based on the observed ACF and PACF patterns in Figure 9.11, an appropriate model for the temporal dependence could be ARMA(3,0). Refit the GARMA model specifying these orders. Then verify that the temporal dependence in residuals was removed (Figure 9.12), and assess other assumptions (Figure 9.13), including homogeneity and normality of the quantile residuals.
#> ******************************************************************
#> Summary of the Randomised Quantile Residuals
#> mean = 0.257
#> variance = 1.98
#> coef. of skewness = 1.23
#> coef. of kurtosis = 6.76
#> Filliben correlation coefficient = 0.967
#> ******************************************************************
See Dunn and Smyth (1996) for the details on randomized quantile residuals. Overall, their poor correspondence with the standard normal distribution shows a lack of fit of the model. See ?gamlss.family for other distribution families, continuous and discrete; somewhat out-of-date tables with many of these distributions listed are available from Stasinopoulos and Rigby (2007).
Here we try one other distribution appropriate for modeling overdispersed count data – negative binomial distribution. See the model summary below and the verification of residuals in Figure 9.14 and Figure 9.15.
#> ******************************************************************
#> Summary of the Randomised Quantile Residuals
#> mean = 0.0133
#> variance = 1.06
#> coef. of skewness = 0.682
#> coef. of kurtosis = 7.24
#> Filliben correlation coefficient = 0.979
#> ******************************************************************
The residual diagnostics look better for the latter model. One could also consider modeling the strictly periodical component such as using the Fourier series, however, Figure 9.9 did not show a prominent seasonality.
9.6 GAMLSS
Stasinopoulos and Rigby (2007) provide an extension of a generalized additive model to \(k=1,2,3,4\) parameters \(\boldsymbol{\theta}_k\) of a distribution in a so-called generalized additive model for location scale and shape (GAMLSS). The \(k\) parameters represent the location parameter \(\mu\) (it is what we typically model with regression models), scale \(\sigma\), and two shape parameters: skewness and kurtosis. The model can be further generalized for \(k>4\) if needed. It allows us to fit \(k\) individual models to study relationships between regressors \(\boldsymbol{x}\) and parameters of the response distribution: \[
g_k(\boldsymbol{\theta}_k) = h_k\left(\boldsymbol{X}_k,\boldsymbol{\beta}_k\right) + \sum_{j=1}^{J_k}h_{jk}(\boldsymbol{x}_{jk}),
\tag{9.13}\] where \(k=1\) produces model for the mean; \(h_k(\cdot)\) and \(h_{jk}(\cdot)\) are nonlinear functions; \(\boldsymbol{\beta}_k\) is a parameter vector of length \(J_k\); \(\boldsymbol{X}_k\) is an \(n\times J_k\) design matrix; \(\boldsymbol{x}_{jk}\) are vectors of length \(n\). The terms in the GAMLSS provide a flexible framework to specify nonlinearities, random effects, and correlation structure as in the mixed effects models (Zuur et al. 2009); see Table 3 by Stasinopoulos and Rigby (2007) for the possible specifications of the additive terms. Hence, a model of the form Equation 9.13 may accommodate non-normal distributions, possibly nonlinear relationships, and spatiotemporal dependencies in the data.
Example: Insurance claims GAMLSS
Here, we use insights from the latest GARMA model and specify the same regressors and distribution family of the response variable in a GAMLSS.
Negative binomial distribution has two parameters (see ?NBI) – location and scale – so the GAMLSS using this distribution family can include up to two equations (\(k = 1, 2\) in Equation 9.13).
We will keep a linear relationship between claims and the week number (parametric linear trend; note that the variable Week here is the time index, not the week of the year) but will use nonparametric additive smooths for the relationships between number of claims and precipitation amounts.
We will model the scale (variability) of the number of claims using a smooth term of the week number.
We start by fitting a model as follows: \[
\begin{split}
Claims_t &= NegBin(\mu_t, \sigma_t) \\
\ln(\mu_t) &= a_0 + a_1 Week_t + f_1(Precipitation_t) + f_2(Precipitation_{t-1}) + \epsilon_t \\
\ln(\sigma_t) &= b_0 + f_3(Week) \\
\end{split}
\tag{9.14}\] where \(\epsilon_t \sim \mathrm{WN}(0,\sigma^2)\); \(a_0\), \(a_1\), and \(b_0\) are parametric coefficients; \(f_1\), \(f_2\), and \(f_3\) are nonparametric smooths.
In its expandable collection of smoothers, the R package gamlss features penalized B-splines. See the help files ?gamlss and ?pb for more details and the review of spline functions in R by Perperoglou et al. (2019). However, we used the original P-splines ps() because they provided more visually smooth results than pb() under the default settings. Note that the model summary below has a table for each parameter of the distribution. The obtained term plots are in Figure 9.16 and Figure 9.17.
#> ******************************************************************
#> Family: c("NBI", "Negative Binomial type I")
#>
#> Call: gamlss(formula = Claims ~ ps(Precipitation) + Week +
#> ps(Precipitation_lag1), sigma.formula = ~ps(Week),
#> family = NBI, data = Insurance_noNA, control = gamlss.control(c.crit = 0.01,
#> trace = FALSE))
#>
#> Fitting method: RS()
#>
#> ------------------------------------------------------------------
#> Mu link function: log
#> Mu Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.572497 0.092566 6.18 1.3e-09 ***
#> ps(Precipitation) 0.014442 0.002691 5.37 1.2e-07 ***
#> Week 0.001520 0.000269 5.64 2.8e-08 ***
#> ps(Precipitation_lag1) 0.011326 0.002908 3.90 0.00011 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> Sigma link function: log
#> Sigma Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.371642 0.218617 -1.70 0.08975 .
#> ps(Week) -0.002379 0.000717 -3.32 0.00098 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> NOTE: Additive smoothing terms exist in the formulas:
#> i) Std. Error for smoothers are for the linear effect only.
#> ii) Std. Error for the linear terms maybe are not accurate.
#> ------------------------------------------------------------------
#> No. of observations in the fit: 519
#> Degrees of Freedom for the fit: 15
#> Residual Deg. of Freedom: 504
#> at cycle: 6
#>
#> Global Deviance: 2214
#> AIC: 2244
#> SBC: 2308
#> ******************************************************************
Code
gamlss.ggplots::fitted_terms(m00_gamlss, rug =TRUE, nrow =1, what ="mu")
Code
gamlss.ggplots::fitted_terms(m00_gamlss, rug =TRUE, nrow =1, what ="sigma")
For the correlation estimates to work, we need to specify a grouping factor for the random effects. To create a grouping factor, we added a variable with the city ID; here it has only one level "City A" and hence does not affect the estimates substantially. If we had information from multiple cities, the specified form of autoregressive structure form = ~Week|City could readily accommodate it.
The code below estimates the model from Equation 9.14 again but assumes the new autocorrelation structure in the residuals: \(\epsilon_t \sim\) ARMA(0,3). The control arguments change the default settings to speed up the algorithm convergence. See the model summary below and the verification of residuals in Figure 9.18.
#> ******************************************************************
#> Summary of the Randomised Quantile Residuals
#> mean = 0.0141
#> variance = 0.924
#> coef. of skewness = 0.446
#> coef. of kurtosis = 5.68
#> Filliben correlation coefficient = 0.987
#> ******************************************************************
Hence, compared with the GARMA model from the previous section, the GAMLSS Equation 9.14 uses the same predictors and distribution family and specifies ARMA(0,3) dependence in the residuals. What differs is that the GAMLSS allows nonlinear relationships between the number of claims and precipitation and models the variability changing nonlinearly across weeks.
Let \(I_t\) be the total information set available at the time \(t\). This information set includes the two time series \(X\) and \(Y\). Let \(\bar{X}_t\) be the set of all current and past values of \(X\), i.e., \(\bar{X}_t = \{X_{t}, X_{t-1}, \dots, X_{t-k}, \dots \}\) and analogously of \(Y\). Let \(\sigma^2(\cdot)\) be the variance of the corresponding forecast error.
Note
Difference of two sets, \(A\) and \(B\), is denoted by \(A \setminus B\); but sometimes the minus sign is used, \(A - B\).
Granger causality
\(X\) is (simply) Granger causal to \(Y\) if future values of \(Y\) can be predicted better, i.e., with a smaller forecast error variance, if current and past values of \(X\) are used: \[
\sigma^2(Y_{t+1}|I_t) < \sigma^2(Y_{t+1}|I_t \setminus \bar{X}_t).
\tag{9.15}\]
Instantaneous Granger causality
\(X\) is instantaneously Granger causal to \(Y\) if the future value of \(Y\), \(Y_{t+1}\), can be predicted better, i.e., with a smaller forecast error variance, if the future value of \(X\), \(X_{t+1}\), is used in addition to the current and past values of \(X\): \[
\sigma^2(Y_{t+1}|\{I_t, X_{t+1}\}) < \sigma^2(Y_{t+1}|I_t ).
\tag{9.16}\]
Feedback
There is feedback between \(X\) and \(Y\) if \(X\) is causal to \(Y\) and \(Y\) is causal to \(X\). Feedback is only defined for the case of simple causal relations.
The test for Equation 9.15 and Equation 9.16 is, essentially, an \(F\)-test comparing two nested models: with additional predictors \(X\) and without. In other words, consider the model: \[
Y_t = \beta_0 + \sum_{k=1}^{k_1}\beta_k Y_{t-k} + \sum_{k=k_0}^{k_2}\alpha_k X_{t-k} + U_t
\tag{9.17}\] with \(k_0 = 1\). An \(F\)-test is applied to test the null hypothesis, H\(_0\): \(\alpha_1 = \alpha_2 = \dots = \alpha_{k_2} = 0\). By switching \(X\) and \(Y\) in Equation 9.17, it can be tested whether a simple causal relation from \(Y\) to \(X\) exists. There is a feedback relation if the null hypothesis is rejected in both directions (\(X\rightarrow Y\) and \(Y\rightarrow X\)). To test whether there is an instantaneous causality, we finally set \(k_0 = 0\) and perform a \(t\) or \(F\)-test for the null hypothesis H\(_0\): \(\alpha_0 = 0\).
The problem with this test is that the results are strongly dependent on the number of lags of the explanatory variable, \(k_2\). There is a trade-off: the more lagged values we include, the better the influence of this variable can be captured. This argues for a high maximal lag. On the other hand, the power of this test is lower the more lagged values are included (Chapter 3 of Kirchgässner and Wolters 2007). Two general procedures can be used to select the lags: inspecting the sensitivity of results to different \(k_2\) (sensitivity analysis) or one of the different information criteria guiding model selection.
Example: Insurance claims and precipitation Granger causality test
For example, use the insurance data to test the relationships between the log-transformed home insurance number of claims and precipitation. A quick test has been performed to check that the time series are stationary and do not have a strong seasonality, so the chance of detecting spurious relationships is minimized. We can use the series in Equation 9.17.
#> Granger causality test
#>
#> Model 1: Precipitation ~ Lags(Precipitation, 1:1) + Lags(Claims_ln, 1:1)
#> Model 2: Precipitation ~ Lags(Precipitation, 1:1)
#> Res.Df Df F Pr(>F)
#> 1 515
#> 2 516 -1 0.2 0.65
Reverse testing does not confirm a statistically significant Granger causality. Hence, we do not have enough evidence to claim that insurance claims are a Granger cause of precipitation (we could expect to come to this conclusion based on our knowledge of how things work), and hence there is no evidence of feedback between losses and precipitation.
The function lmtest::grangertest() does not allow us to set order = 0 (to test instantaneous Granger causality), but we can do it manually. First, to show that ‘manual’ results match the lmtest::grangertest() output, repeat the test above using two nested models with lag 1.
As an extension to the demonstrated techniques for testing Granger causality, we may consider more complex generalized and additive models, to relax the assumptions of normality and linearity in the modeling process.
Note
The lmtest::grangertest() options set one value to both \(k_1\) and \(k_2\) in Equation 9.17. In our example, it was \(k_1 = k_2 = 1\). The ‘manual’ test using the function anova() can be used for models with \(k_1 = k_2\) or \(k_1 \neq k_2\).
9.8 About predictions
Let \(\hat{Y}_T(h)\) be a forecast \(h\) steps ahead made at the time \(T\). If \(\hat{Y}_T(h)\) only uses information up to time \(T\), the resulting forecasts are called out-of-sample forecasts. Economists call them ex-ante forecasts. We have discussed several ways to select the optimal method or model for forecasting, e.g., using PMAE, PMSE, or coverage – all calculated on a testing set. Chatfield (2000) mentions several ways to unfairly ‘improve’ forecasts:
Fitting the model to all the data including the test set.
Fitting several models to the training set and choosing the model which gives the best ‘forecasts’ of the test set. The selected model is then used (again) to produce forecasts of the test set, even though the latter has already been used in the modeling process.
Using the known test-set values of ‘future’ observations on the explanatory variables in multivariate forecasting. This will improve forecasts of the dependent variable in the test set, but these future values will not of course be known at the time the forecast is supposedly made (though in practice the ‘forecast’ is made at a later date). Economists call such forecasts ex-post forecasts to distinguish them from ex-ante forecasts. The latter, being genuinely out-of-sample, use forecasts of future values of explanatory variables, where necessary, to compute forecasts of the response variable. Ex-post forecasts can be useful for assessing the effects of explanatory variables, provided the analyst does not pretend that they are genuine out-of-sample forecasts.
So what to do if we put lots of effort to build a regression model using time series and need to forecast the response, \(Y_t\), which is modeled using different independent variables \(X_{t,k}\) (\(k=1,\dots,K\))? Two options are possible.
Leading indicators
If \(X_{t,k}\)’s are leading indicators with lags starting at \(l\), we, generally, would not need their future values to obtain the forecasts \(\hat{Y}_T(h)\), where \(h\leqslant l\). For example, the model for losses tested in Section 9.7 shows that precipitation with lag 1 is a good predictor for current losses, i.e., precipitation is a leading indicator. The 1-week ahead forecast of \(Y_{t+1}\) can be obtained using the current precipitation \(X_t\) (all data are available). If \(h>l\), we will be forced to forecast the independent variables, \(X_{t,k}\)’s – see the next option.
Forecast of predictors
If we opt for forecasting \(X_{t,k}\)’s, the errors (uncertainty) of such forecasts will be larger, because future \(X_{t,k}\)’s themselves will be the estimates. Nevertheless, it might be the only choice when leading indicators are not available. Building a full and comprehensive model with all diagnostics for each regressor is usually unfeasible and even problematic if we plan to consider multivariate models for regressors (the complexity of models will quickly escalate). As an alternative, it is common to use automatic or semi-automatic univariate procedures that can help to forecast each of the \(X_{t,k}\)’s. For example, consider exponential smoothing, Holt–Winters smoothing, and auto-selected SARIMA/ARIMA/ARMA/AR/MA models – all those can be automated for a large number of forecasts to make.
9.9 Conclusion
Multivariate models are still much more difficult to fit than univariate ones. Multiple regression remains a treacherous procedure when applied to time series data. Many observed time series exhibit nonlinear characteristics, but nonlinear models often fail to give better out-of-sample forecasts than linear models, perhaps because the latter are more robust to departures from model assumptions. It is always a good idea to end with the so-called eyeball test. Plot the forecasts on a time plot of the data and check that they look intuitively reasonable (Chatfield 2000).
Benjamin MA, Rigby RA, Stasinopoulos DM (2003) Generalized autoregressive moving average models. Journal of the American Statistical Association 98:214–223. https://doi.org/10.1198/016214503388619238
Brockwell PJ, Davis RA (2002) Introduction to time series and forecasting, 2nd edn. Springer, New York, NY, USA
Chatfield C (2000) Time-series forecasting. CRC Press, Boca Raton, FL, USA
Dean RT, Dunsmuir WTM (2016) Dangers and uses of cross-correlation in analyzing time series in perception, performance, movement, and neuroscience: The importance of constructing transfer function autoregressive models. Behavior Research Methods 48:783–802. https://doi.org/10.3758/s13428-015-0611-2
Dunn PK, Smyth GK (1996) Randomized quantile residuals. Journal of Computational and Graphical Statistics 5:236–244. https://doi.org/10.2307/1390802
Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37:424–438. https://doi.org/10.2307/1912791
Nesslage G, Lyubchich V, Nitschke P, et al (2021) Environmental drivers of golden tilefish (Lopholatilus chamaeleonticeps) commercial landings and catch-per-unit-effort. Fisheries Oceanography 30:608–622. https://doi.org/10.1111/fog.12540
Pearl J (2009) Causality: Models, reasoning, and inference, 2nd edn. Cambridge University Press, Cambridge, UK
Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M (2019) A review of spline function procedures in R. BMC Medical Research Methodology 19: https://doi.org/10.1186/s12874-019-0666-3
Rebane G, Pearl J (1987) The recovery of causal poly-trees from statistical data. In: Proceedings of the third annual conference on uncertainty in artificial intelligence. pp 222–228
Soliman M, Lyubchich V, Gel YR (2019) Complementing the power of deep learning with statistical model fusion: Probabilistic forecasting of influenza in Dallas County, Texas, USA. Epidemics 28:100345. https://doi.org/10.1016/j.epidem.2019.05.004
Stasinopoulos DM, Rigby RA (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software 23:1–46. https://doi.org/10.18637/jss.v023.i07
Zeger SL, Qaqish B (1988) Markov regression models for time series: A quasi-likelihood approach. Biometrics 44:1019–1031. https://doi.org/10.2307/2531732