# 3 Smoothing, Detrending, and Deseasonalizing

The goal of this lecture is to learn a variety of methods used for *trend visualization* (such that make a trend in noisy data more apparent), *trend modeling* (such that can help us to come up with ‘an equation’ for the trend, which could be further studied or used for interpolation, forecasting, or detrending), and *detrending* (just to remove trend from the series, with or without modeling it). You should be able to choose a method based on the goals of your analysis. We will start with simple and widely used methods that can be further combined into more elaborate algorithms.

**Objectives**

- Implement smoothing methods often used for trend
- visualization (example: moving average, lowess);
- modeling (example: exponential smoothing, polynomial smoothing);
- removal (example: differencing, however, the above methods also can be used to remove trends).

- Accomplish each of the tasks above when handling a time series with cycles (seasonality).
- Understand the concepts of trend- and difference-stationary time series.

**Reading materials**

## 3.1 Introduction

Recall the classical decomposition \[ Y_t = M_t + S_t + \epsilon_t, \tag{3.1}\] where \(M_t\) is a slowly changing function (trend component), \(\epsilon_t\) is stationary random noise component, and \(S_t\) is the periodic term with the period \(m\geqslant 2\) (seasonal component) and scaling factors \(\lambda_k > 0\) such that \(S_{t+km} = \lambda_kS_t\) for \(1 \leqslant t \leqslant m\) and each \(k \geqslant 1\). For identification, we need \(\sum_{t=1}^m S_t = 0\) and \(\mathrm{E}(\epsilon_t)=0\).

Our goal is to estimate and extract \(M_t\) and \(S_t\) with the hope that the residual or noise component \(\epsilon_t\) will turn out to be a stationary time series (Section 3.2–Section 3.5). Alternatively, Box–Jenkins models use difference operators repeatedly to the series \(Y_t\) until a stationary time series is obtained (see Section 3.6 on differencing).

The seasonal component is said to have *constant seasonal variation* if the scaling factors satisfy \(\lambda_{k} = 1\) for all \(k\). In other words, \(S_{t+m} = S_{t}\) for all \(t\). The constant variation is an assumption of most modeling techniques we will be using in this course. Unfortunately, many real seasonal time series do not have this constancy property and it is necessary to first perform a *variance-stabilizing transformation* to modify the original time series into one with constant variation. To some extent, this is a matter of trial and error, where we work from ‘weaker’ transformations to ‘stronger’ transformations as needed. Typically, power transformations are used \(Y_{t} \rightarrow Y^{\lambda}_{t}\), where \(0 < \lambda < 1\) or, log or even log log transformations, e.g., \(Y_{t} \rightarrow \log Y_{t}\).

The log transformation of \(Y_t\) is convenient when we assume not an additive model as Equation 3.1 is, but a multiplicative model as \[
Y_t = M_t \cdot S_t \cdot \epsilon_t.
\tag{3.2}\] When applying variance-stabilizing log transformation to Equation 3.2, we get an additive result: \[
\log Y_t = \log M_t + \log S_t + \log \epsilon_t,
\tag{3.3}\] which is now similar to Equation 3.1. We will refer to Equation 3.1 as *additive seasonality*, and to Equation 3.2 as *multiplicative seasonality*. Only a few methods can deal with the multiplicative case Equation 3.2 directly; most methods we consider in this course will require us to apply a transformation as in Equation 3.3.

In the following sections, we consider methods for the estimation and elimination of \(M_t\) (for non-seasonal data, when \(S_t = 0\) in the additive case Equation 3.1 or \(S_t = 1\) in the multiplicative case Equation 3.2) and of both \(M_t\) and \(S_t\) (for seasonal data).

## 3.2 Finite moving average smoothing

The non-seasonal model with a trend: \[ Y_t = M_t + \epsilon_t, \] where \(\mathrm{E}(\epsilon_t)=0\) and \(t=1,\dots,n\).

Let \(q\) be a nonnegative integer (\(q \in \mathbb{N}^{+}\)) and consider the *two-sided moving average* of the series \(Y_t\): \[
\begin{split}
W_t &= \frac{1}{2q+1}\sum_{j=-q}^q Y_{t-j}\\
&= \frac{1}{2q+1}\sum_{j=-q}^q M_{t-j} + \frac{1}{2q+1}\sum_{j=-q}^q \epsilon_{t-j} \\
&\approx M_t,
\end{split}
\tag{3.4}\] where \(w = 2q+1\) is the size of the moving window. Note that the above approximation is correct if the average value of \(\epsilon_{t}\) within each window is close to 0 (important for selecting \(q\)).

There are multiple ways to calculate moving averages in R, and formulas for \(W_t\) may vary. If using someone’s else functions, remember to read the help files or source code to find out the exact formula in place of Equation 3.4. Below are some examples.

For example, `forecast::ma()`

has the default option `centre = TRUE`

, and the `order`

argument represents the window size. Odd `order`

and `centre = TRUE`

correspond exactly to our definitions in Equation 3.4:

`forecast::ma(shampoo, order = 5)`

```
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1 NA NA 179 159 177 185 200 188 222 213 206 198
#> 2 215 203 204 222 238 256 260 306 301 324 332 362
#> 3 341 376 387 407 434 452 501 516 544 559 NA NA
```

For example, `pracma::movavg()`

averages the last `n`

data points:

`pracma::movavg(shampoo, n = 5)`

```
#> [1] 266 206 198 179 179 159 177 185 200 188 222 213 206 198 215 203 204 222 238
#> [20] 256 260 306 301 324 332 362 341 376 387 407 434 452 501 516 544 559
```

The base-R function `stats::filter()`

can also be used. Its odd window size and `sides = 2`

correspond to our definitions in Equation 3.4:

```
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1 NA NA 179 159 177 185 200 188 222 213 206 198
#> 2 215 203 204 222 238 256 260 306 301 324 332 362
#> 3 341 376 387 407 434 452 501 516 544 559 NA NA
```

The `sides = 1`

in `stats::filter()`

corresponds to the `pracma::movavg()`

results above:

```
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1 NA NA NA NA 179 159 177 185 200 188 222 213
#> 2 206 198 215 203 204 222 238 256 260 306 301 324
#> 3 332 362 341 376 387 407 434 452 501 516 544 559
```

See Figure 3.2 and Figure 3.3 for the effects of the window size.

### 3.2.1 Moving average smoothing for seasonal data

Come back to the trend-seasonal model Equation 3.1: \[ Y_t = M_t + S_t + \epsilon_t. \]

We apply the following steps for estimating seasonality and trend:

- Apply a moving average filter specially chosen to eliminate the seasonal component and dampen the noise – get \(\hat{M}_t\). Remember that the sum of \(S_t\) within each period is 0. Hence, to smooth out the seasonal component (and noise) and get an estimate of just \(M_t\), we will use a moving window \(w\) sized as the seasonal period \(m\) (more generally, \(w = km\), where \(k\in \mathbb{N}^{+}\), if a greater smoothness is desired). It will often lead to the even window size \(w\). For even \(w\), Equation 3.4 is modified as follows (we now use \(km + 1\) elements so the window is still centered, but the elements on the ends receive half the weight): \[ \hat{M}_t = \frac{0.5Y_{t-km/2} + Y_{t-km/2 + 1} +\dots+Y_{t+km/2-1} + 0.5Y_{t+km/2}}{km}. \tag{3.5}\]
- Use \(Y_t - \hat{M}_t\) to estimate the seasonal component – get \(\hat{S}_t\). If needed, correct the estimates, to sum up to 0 in each period. Let \(\hat{S}^*_t\) be the corrected values.
- Use deseasonalized data \(Y_t - \hat{S}^*_t\) to re-estimate the trend. Let \(\hat{M}^*_t\) be the corrected trend estimate.
- The estimated random noise is then: \(\hat{\epsilon}_t = Y_t - \hat{M}^*_t - \hat{S}^*_t\).

The above algorithm is implemented in `stats::decompose()`

, but its main disadvantage is in the centered moving average filter used to estimate \(M_t\) that does not produce smoothed values for the most recent observations. We can elaborate this algorithm, first, by using different estimators for \(M_t\); second, by also replacing estimators for \(S_t\) (e.g., see the next section).

## 3.3 Lowess smoothing

One of the alternatives for estimating trends \(M_t\) is *locally weighted regression* (shortened as ‘lowess’ or ‘loess’) (Cleveland 1979). We can apply lowess as a regression of \(Y_t\) on time, for example, using the algorithm outlined by Berk (2016):

- Choose the smoothing parameter such as bandwidth, \(f\), which is a proportion between 0 and 1.
- Choose a point \(t_0\) and its \(w = f n\) nearest points on the time axis.
- For these \(w\) nearest neighbor points, compute a weighted least squares regression line for \(Y_t\) on \(t\). The coefficients \(\boldsymbol{\beta}\) of such regression are estimated by minimizing the residual sum of squares \[ \text{RSS}^*(\boldsymbol{\beta}) = (\boldsymbol{Y}^* - \boldsymbol{X}^* \boldsymbol{\beta})^{\top} \boldsymbol{W}^* (\boldsymbol{Y}^* - \boldsymbol{X}^* \boldsymbol{\beta}), \] where the asterisk indicates that only the observations in the window are included. The regressor matrix \(\boldsymbol{X}^*\) can contain polynomial terms of time, \(t\). The \(\boldsymbol{W}^*\) is a diagonal matrix conforming to \(\boldsymbol{X}^*\), with diagonal elements being a function of distance from \(t_0\) (observations closer to \(t_0\) receive higher weights).
- Calculate the fitted value \(\tilde{Y}_t\) for that single \(t_0\).
- Repeat Steps 2–4 for each \(t_0 = 1,\dots,n\).

By selecting large bandwidth \(f\) in the lowess algorithm, we can obtain greater smoothing (see Figure 3.7).

## Code

### 3.3.1 Lowess smoothing for seasonal data

Now use the Secchi time series (Figure 3.4) and implement the procedure of detrending and deseasonalizing from Section 3.2.1, but replace the simple moving average estimates of \(M_t\) with lowess estimates.

## 3.4 Exponential smoothing

Exponential smoothing (ES) is a successful forecasting technique. It turns out that ES can be modified to be used effectively for time series with

- slowly drifting trend (double exponential smoothing);
- trends (Holt’s method);
- seasonal patterns;
- combination of trend and seasonality (Holt–Winters method).

ES is easy to adjust for past errors and easy to prepare follow-on forecasts. ES is ideal for situations where many forecasts must be prepared. Several different functional forms of ES are used depending on the presence of trend or cyclical variations.

In short, an ES is an averaging technique that uses unequal weights while the weights applied to past observations decline in an exponential manner.

**Single exponential smoothing** calculates the smoothed series as a damping coefficient \(\alpha\) (\(\alpha \in [0, 1]\)) times the actual series plus \(1 - \alpha\) times the lagged value of the smoothed series. For the model \[
Y_{t} = M_t + \epsilon_{t},
\] the updating equation is \[
\begin{split}
\hat{M}_1 &= Y_1\\
\hat{M}_t &= \alpha Y_t + (1 - \alpha) \hat{M}_{t - 1}
\end{split}
\] and the forecast is \[
\hat{Y}_{t+1} = \hat{M}_t.
\]

An exponential smoothing over an already smoothed time series is called *double exponential smoothing*. In some cases, it might be necessary to extend it even to a *triple exponential smoothing*.

**Holt’s linear exponential smoothing** Suppose that the series \(Y_t\) is non-seasonal but displays a trend. Now we need to estimate both the current mean (or *level* in some professional jargon) and the current trend.

The updating equations express ideas similar to those for exponential smoothing. However, now we have two smoothing parameters, \(\alpha\) and \(\beta\) (\(\alpha \in [0, 1]\); \(\beta \in [0, 1]\)).

The updating equations are \[ a_{t} = \alpha Y_{t} + \left( 1- \alpha \right) \left( a_{t - 1} + b_{t - 1} \right) \] for the mean and \[ b_{t} = \beta \left( a_{t} - a_{t-1} \right) + \left( 1 - \beta \right) b_{t-1} \] for the trend.

Then the forecasting for \(k\) steps into the future is \[ \hat{Y}_{t+k} = a_{t} + kb_{t}. \]

Usually, the initial (starting) values are \[ \begin{split} a_{1} & = Y_{2}, \\ b_{1} & = Y_{2} - Y_{1}. \end{split} \]

### 3.4.1 Exponential smoothing for seasonal data

**Multiplicative Holt–Winters procedure**

Now in addition to the Holt parameters, suppose that the series exhibits multiplicative seasonality and let \(S_{t}\) be the multiplicative seasonal factor at the time \(t\).

Suppose also that there are \(m\) observations in one period (in a year). For example, \(m = 4\) for quarterly data, and \(m = 12\) for monthly data.

In some time series, seasonal variation is so strong it obscures any trends or other cycles, which are important for our understanding of the observed process. Holt–Winters smoothing method can remove seasonality and makes long-term fluctuations in the series stand out more clearly.

A simple way of detecting a trend in seasonal data is to take averages over a certain period. If these averages change with time we can say that there is evidence of a trend in the series.

We now use three smoothing parameters: \(\alpha\), \(\beta\), and \(\gamma\) (\(\alpha \in [0, 1]\); \(\beta \in [0, 1]\); \(\gamma \in [0, 1]\)).

The updating equations for the level (\(a_t\)), local trend (\(b_t\)), and seasonal factor (\(S_t\)) are: \[ \begin{split} a_{t} & = \alpha Y_{t} / S_{t - m} + (1 - \alpha) ( a_{t - 1} + b_{t - 1}), \\ b_{t} & = \beta (a_{t} - a_{t - 1}) + (1 - \beta) b_{t - 1}, \\ S_{t} & = \gamma Y_{t} / a_{t} + (1 - \gamma) S_{t - m}. \end{split} \]

Then the forecasting for \(k\) steps into the future is \[ \hat{Y}_{t+k} = (a_{t} + kb_{t}) S_{t+k-m}, \] where \(k = 1, 2, \dots, m\).

To obtain starting values, one may use an average over the first few periods (years) of the data.

The smoothing parameters \(\alpha\), \(\beta\) and \(\gamma\) are estimated by minimizing the sum of the squared one-step prediction errors.

**Additive Holt–Winters procedure**

The updating equations are \[ \begin{split} a_{t} & = \alpha (Y_{t} - S_{t - m}) + (1 - \alpha) (a_{t - 1} + b_{t - 1}), \\ b_{t} & = \beta (a_{t} - a_{t - 1}) + (1 - \beta) b_{t - 1} \\ S_{t} & = \gamma (Y_{t} - a_{t}) + (1 - \gamma) S_{t - m}. \end{split} \]

Then the forecasting for \(k\) steps into the future is \[ \hat{Y}_{t+k} = a_{t} + kb_{t} + S_{t + k - m}, \] where \(k = 1, 2, \dots, m\).

## 3.5 Polynomial regression on time

This is a very intuitive procedure for fitting parametric trend functions to time series:

- Make a parametric model assumption about \(M_t\) as a function of time \(t\), for example, quadratic trend: \[ M_t = \beta_0 + \beta_1 t + \beta_2 t^2 \]
- Fit the model using the usual regression estimators (least squares or maximum likelihood)

### 3.5.1 Seasonal regression with dummy variables

To each time \(t\) and each season \(k\) we will assign an indicator that ‘turns on’ if and only if \(t\) falls into that season. In principle, a seasonal period of any positive integer length \(m \geqslant 2\) is possible, however, in most cases, the length of the seasonal cycle is one year and so the seasonal period of the time series depends on the number of observations per year. The most common periods are \(m = 12\) (monthly data) and \(m = 4\) (quarterly data). For each \(k = 1, \dots, m\), we define the indicator \[ X_{k,t} = \left\{ \begin{array}{cl} 1, & \mbox{if} ~ t ~ \text{corresponds to season} ~ k, \\ 0, & \mbox{if} ~ t ~ \text{does not correspond to season} ~ k. \\ \end{array} \right. \]

Note that for each \(t\) we have \(X_{1,t} + X_{2,t} + \dots + X_{m,t} = 1\) since each \(t\) corresponds to exactly one season. Thus, given any \(m - 1\) of the variables, the remaining variable is known and thus redundant. Because of this linear dependence, we must drop one of the indicator variables (seasons) from the model. It does not matter which season is dropped although sometimes there are choices that are simpler from the point of view of constructing or labeling the design matrix. Thus the general form of a seasonal model looks like \[ \begin{split} f (Y_{t}) &= M_{t} + \beta_{2} X_{2,t} + \beta_{3} X_{3,t} + \dots + \beta_{m} X_{m,t} + \epsilon_{t} \\ &= M_{t} + \sum_{i=2}^{m}\beta_{i} X_{i,t} + \epsilon_{t}, \end{split} \] where \(M_{t}\) is a trend term which may also include some \(\beta\) parameters and explanatory variables. The function \(f\) represents the appropriate variance-stabilizing transformations as needed. (Here the 1st season has been dropped from the model.)

### 3.5.2 Seasonal regression with Fourier series

It is a spoiler for the future lecture on spectral analysis of time series, but seasonal regression modeling would not be complete without introducing the trigonometric Fourier series.

We can fit a linear regression model using several pairs of trigonometric functions as predictors. For example, for monthly observations \[ \begin{split} cos_{k} (i) &= \cos (2 \pi ki / 12), \\ sin_{k} (i) &= \sin (2 \pi ki / 12), \end{split} \] where \(i\) is the month within the year, and the trigonometric function has \(k\) cycles per year.

## 3.6 Differencing

A big subset of time series models (particularly, models dealing with random walk series) applies the differencing to eliminate trends.

Let \(\Delta\) be the difference operator (\(\nabla\) notation is also used sometimes), then the simple first-order differences are: \[ \begin{split} \Delta Y_t &= Y_t - Y_{t-1} \\ &=(1-B)Y_t, \end{split} \] where \(B\) is a backshift operator.

The **backshift operator** and difference operator are useful for a convenient representation of higher-order differences. We will often use backshift operators in future lectures: \[
\begin{split}
B^0Y_{t} &= Y_{t} \\
BY_{t} &= Y_{t-1} \\
B^{2}Y_{t} &= Y_{t-2} \\
& \vdots \\
B^{k}Y_{t} &= Y_{t-k}.
\end{split}
\] The convenience is because the powers of the operators can be treated as powers of elements of usual polynomials, so we can make different operations for more complex cases.

For example, if we took second-order differences, \(\Delta^2Y_t = (1-B)^2Y_t\), to remove a trend and then used differences with the seasonal lag of 12 months to remove a strong seasonal component, what would be the final form of the transformed series? The transformed series, \(Y^*_t\), will be: \[ \begin{split} Y^*_t & = (1-B)^2 (1 - B^{12})Y_t \\ & = (1 - 2B + B^2) (1 - B^{12})Y_t \\ & = (1 - 2B + B^2 - B^{12} + 2B^{13} - B^{14})Y_t \\ & = Y_t - 2Y_{t-1} + Y_{t-2} - Y_{t-12} + 2Y_{t-13}-Y_{t-14}, \end{split} \] but we will often use just the top-row notations.

We will discuss formal tests to identify an appropriate order of differences in future lectures (unit-root tests), but for now use the *rule of thumb:* for time trends looking linear use 1st order differences (\(\Delta Y_t = Y_t - Y_{t-1}\)), for parabolic shapes use differences of the 2nd order, etc. Apply the differencing of higher orders until the time series looks stationary (plot the transformed series and its ACF at each step). In practice, differences of order 3 or higher are rarely needed.

A deterministic linear trend can be also eliminated by differencing. For example, consider a time series \(X-t\) with a deterministic linear trend: \[ X_t = a + b t + Z_t, \] where \(Z_t \sim \mathrm{WN}(0,\sigma^2)\). The first-order differences of \(X_t\) remove the linear trend: \[ \begin{align} (1 - B)X_t &= X_t - X_{t-1} \\ &= a + b t + Z_t - (a + b (t - 1) + Z_{t-1}) \\ &= a + b t + Z_t - a - b t + b - Z_{t-1} \\ &= b + Z_t + Z_{t-1}. \end{align} \]

## 3.7 Conclusion

We have implemented various methods for trend visualization, modeling, and trend removal. In each case, the method could be extended to additionally remove from a time series a strong seasonal signal. The choice of a method depends on many things, such as:

- The methods require different skill levels (and time commitment) for their implementation and can be categorized as automatic and non-automatic.
- Some methods are very flexible when dealing with seasonality (e.g., Holt–Winters), while others are not.
- The goal of smoothing and desired degree of smoothness in the output.
- Some of the methods have a convenient way of getting forecasts (e.g., exponential smoothing and polynomial regression), while other methods are more suitable for interpolation (polynomial regression) and simple visualization (simple moving average), or just trend elimination (all of the considered methods are capable of eliminating trends, but differencing is the method that does nothing else but the elimination).

Remember that forecasted values (point forecasts) are not valuable if their uncertainty is not quantified. To quantify the uncertainty, we provide prediction intervals that are based on the past behavior of residuals and careful residual diagnostics (such as described in the previous lectures).

The evaluation of the methods may involve splitting the time series into training and testing periods. Overall, this lecture focused on presenting the methods themselves, while *residual diagnostics* and *evaluation on a testing set* were omitted due to time constraints.

While the considered methods are among the most common, there are many more smoothing methods that can be applied to time series: splines (such as used in generalized additive models, GAMs), kernels, machine learning techniques, etc.