Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM and CatBoost

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM, Scikit-learn and CatBoost

Joaquín Amat Rodrigo, Javier Escobar Ortiz
February, 2021 (last update December 2023)

Introduction

Gradient boosting models have gained popularity in the machine learning community due to their ability to achieve excellent results in a wide range of use cases, including both regression and classification. Although these models have traditionally been less common in forecasting, they can be highly effective in this domain. Some of the key benefits of using gradient boosting models for forecasting include:

  • The ease with which exogenous variables can be included in the model, in addition to autoregressive variables.

  • The ability to capture non-linear relationships between variables.

  • High scalability, allowing models to handle large volumes of data.

  • Some implementations allow the inclusion of categorical variables without the need for additional encoding, such as one-hot encoding.

Despite these benefits, the use of machine learning models for forecasting can present several challenges that can make analysts reluctant to use them, the main ones being:

  • Transforming the data so that it can be used as a regression problem.

  • Depending on how many future predictions are needed (prediction horizon), an iterative process may be required where each new prediction is based on previous ones.

  • Model validation requires specific strategies such as backtesting, walk-forward validation or time series cross-validation. Traditional cross-validation cannot be used.

The skforecast library provides automated solutions to these challenges, making it easier to apply and validate machine learning models to forecasting problems. The library supports several advanced gradient boosting models, including XGBoost, LightGBM, Catboost and scikit-learn HistGradientBoostingRegressor. This document shows how to use them to build accurate forecasting models. To ensure a smooth learning experience, an initial exploration of the data is performed. Then, the modeling process is explained step by step, starting with a recursive model utilizing a LightGBM regressor and moving on to a model incorporating exogenous variables and various coding strategies. The document concludes by demonstrating the use of other gradient boosting model implementations, including XGBoost, CatBoost, and the scikit-learn HistGradientBoostingRegressor.

✎ Note

Machine learning models do not always outperform statistical learning models such as AR, ARIMA or Exponential Smoothing. Which one works best depends largely on the characteristics of the use case to which it is applied. Visit ARIMA and SARIMAX models with skforecast to learn more about statistical models.

✎ Note

Another great examples of how using gradient boosting models for forecasting can be found in the documents Forecasting energy demand with machine learning and Global Forecasting Models.

Use case

Bicycle sharing is a popular shared transport service that provides bicycles to individuals for short-term use. These systems typically provide bike docks where riders can borrow a bike and return it to any dock belonging to the same system. The docks are equipped with special bike racks that secure the bike and only release it via computer control. One of the major challenges faced by operators of these systems is the need to redistribute bikes to ensure that there are bikes available at all docks, as well as free spaces for returns.

In order to improve the planning and execution of bicycle distribution, it is proposed to create a model capable of forecasting the number of users over the next 36 hours. In this way, at 12:00 every day, the company in charge of managing the system will be able to know the expected users for the rest of the day (12 hours) and the next day (24 hours).

For illustrative purposes, the current example only models a single station, but it can be easily extended to cover multiple stations unsing global multi-series forecasting, thereby improving the management of bike-sharing systems on a larger scale.

Libraries

Libraries used in this document.

In [1]:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from astral.sun import sun
from astral import LocationInfo
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
poff.init_notebook_mode(connected=True)
plt.style.use('seaborn-v0_8-darkgrid')

# Modelling and Forecasting
# ==============================================================================
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFECV
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

from skforecast.ForecasterBaseline import ForecasterEquivalentDate
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import bayesian_search_forecaster
from skforecast.model_selection import backtesting_forecaster
import shap
shap.initjs()

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data

The data in this document represent the hourly usage of the bike share system in the city of Washington, D.C. during the years 2011 and 2012. In addition to the number of users per hour, information about weather conditions and holidays is available. The original data was obtained from the UCI Machine Learning Repository and has been previously cleaned (code) applying the following modifications:

  • Renamed columns with more descriptive names.

  • Renamed categories of the weather variables. The category of heavy_rain has been combined with that of rain.

  • Denormalized temperature, humidity and wind variables.

  • date_time variable created and set as index.

  • Imputed missing values by forward filling.

In [2]:
# Downloading data
# ==============================================================================
data = fetch_dataset('bike_sharing', raw=True)
bike_sharing
------------
Hourly usage of the bike share system in the city of Washington D.C. during the
years 2011 and 2012. In addition to the number of users per hour, information
about weather conditions and holidays is available.
Fanaee-T,Hadi. (2013). Bike Sharing Dataset. UCI Machine Learning Repository.
https://doi.org/10.24432/C5W894.
Shape of the dataset: (17544, 12)
In [3]:
# Preprocessing data (setting index and frequency)
# ==============================================================================
data = data[['date_time', 'users', 'holiday', 'weather', 'temp', 'atemp', 'hum', 'windspeed']]
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data.head()
Out[3]:
users holiday weather temp atemp hum windspeed
date_time
2011-01-01 00:00:00 16.0 0.0 clear 9.84 14.395 81.0 0.0
2011-01-01 01:00:00 40.0 0.0 clear 9.02 13.635 80.0 0.0
2011-01-01 02:00:00 32.0 0.0 clear 9.02 13.635 80.0 0.0
2011-01-01 03:00:00 13.0 0.0 clear 9.84 14.395 75.0 0.0
2011-01-01 04:00:00 1.0 0.0 clear 9.84 14.395 75.0 0.0

To facilitate the training of the models, the search for optimal hyperparameters and the evaluation of their predictive accuracy, the data are divided into three separate sets: training, validation and test.

In [4]:
# Split train-validation-test
# ==============================================================================
end_train = '2012-03-31 23:59:00'
end_validation = '2012-08-31 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validacion : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Dates train      : 2011-01-01 00:00:00 --- 2012-03-31 23:00:00  (n=10944)
Dates validacion : 2012-04-01 00:00:00 --- 2012-08-31 23:00:00  (n=3672)
Dates test       : 2012-09-01 00:00:00 --- 2012-12-31 23:00:00  (n=2928)

Data exploration

Graphical exploration of time series can be an effective way of identifying trends, patterns, and seasonal variations. This, in turn, helps to guide the selection of the most appropriate forecasting model.

Plot time series

Full time series

In [5]:
# Interactive plot of time series
# ==============================================================================
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_train.index, y=data_train['users'], mode='lines', name='Train'))
fig.add_trace(go.Scatter(x=data_val.index, y=data_val['users'], mode='lines', name='Validation'))
fig.add_trace(go.Scatter(x=data_test.index, y=data_test['users'], mode='lines', name='Test'))
fig.update_layout(
    title  = 'Number of users',
    xaxis_title="Time",
    yaxis_title="Users",
    legend_title="Partition:",
    width=800,
    height=350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.001
    )
)
#fig.update_xaxes(rangeslider_visible=True)
fig.show()
In [6]:
# Static plot of time series with zoom
# ==============================================================================
zoom = ('2011-08-01 00:00:00', '2011-08-15 00:00:00')

fig = plt.figure(figsize=(8, 4))
grid = plt.GridSpec(nrows=8, ncols=1, hspace=0.1, wspace=0)

main_ax = fig.add_subplot(grid[1:3, :])
data_train['users'].plot(ax=main_ax, label='train', alpha=0.5)
data_val['users'].plot(ax=main_ax, label='validation', alpha=0.5)
data_test['users'].plot(ax=main_ax, label='test', alpha=0.5)
min_y = min(data['users'])
max_y = max(data['users'])
main_ax.fill_between(zoom, min_y, max_y, facecolor='blue', alpha=0.5, zorder=0)
main_ax.set_title(f'Number of users: {data.index.min()}, {data.index.max()}', fontsize=10)
main_ax.set_xlabel('')
main_ax.legend(loc='lower center', ncol=3, bbox_to_anchor=(0.5, -0.8))

zoom_ax = fig.add_subplot(grid[5:, :])
data.loc[zoom[0]: zoom[1]]['users'].plot(ax=zoom_ax, color='blue', linewidth=1)
zoom_ax.set_title(f'Number of users: {zoom}', fontsize=10)
zoom_ax.set_xlabel('')

plt.subplots_adjust(hspace=1)
plt.show();

Seasonality plots

Seasonal plots are a useful tool for identifying seasonal patterns and trends in a time series. They are created by averaging the values of each season over time and then plotting them against time.

In [7]:
# Annual, weekly and daily seasonality
# ==============================================================================
fig, axs = plt.subplots(2, 2, figsize=(8.5, 5.5), sharex=False, sharey=True)
axs = axs.ravel()

# Users distribution by month
data['month'] = data.index.month
data.boxplot(column='users', by='month', ax=axs[0])
data.groupby('month')['users'].median().plot(style='o-', linewidth=0.8, ax=axs[0])
axs[0].set_ylabel('Users')
axs[0].set_title('Users distribution by month')

# Users distribution by week day
data['week_day'] = data.index.day_of_week + 1
data.boxplot(column='users', by='week_day', ax=axs[1])
data.groupby('week_day')['users'].median().plot(style='o-', linewidth=0.8, ax=axs[1])
axs[1].set_ylabel('Users')
axs[1].set_title('Users distribution by week day')

# Users distribution by the hour of the day
data['hour_day'] = data.index.hour + 1
data.boxplot(column='users', by='hour_day', ax=axs[2])
data.groupby('hour_day')['users'].median().plot(style='o-', linewidth=0.8, ax=axs[2])
axs[2].set_ylabel('Users')
axs[2].set_title('Users distribution by the hour of the day')

# Users distribution by week day and hour of the day
mean_day_hour = data.groupby(["week_day", "hour_day"])["users"].mean()
mean_day_hour.plot(ax=axs[3])
axs[3].set(
    title       = "Mean users during week",
    xticks      = [i * 24 for i in range(7)],
    xticklabels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    xlabel      = "Day and hour",
    ylabel      = "Number of users"
)

fig.suptitle("Seasonality plots", fontsize=20)
fig.tight_layout()

There is a clear difference between weekdays and weekends. There is also a clear intra-day pattern, with a different influx of users depending on the time of day.

Autocorrelation plots

Auto-correlation plots are a useful tool for identifying the order of an autoregressive model. The autocorrelation function (ACF) is a measure of the correlation between the time series and a lagged version of itself. The partial autocorrelation function (PACF) is a measure of the correlation between the time series and a lagged version of itself, controlling for the values of the time series at all shorter lags. These plots are useful for identifying the lags to be included in the autoregressive model.

In [8]:
# Autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2))
plot_acf(data['users'], ax=ax, lags=72)
plt.show()
In [9]:
# Partial autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2))
plot_pacf(data['users'], ax=ax, lags=72, method='ywm')
plt.show()

The results of the autocorrelation study show that there is a significant correlation between the number of users in previous hours, as well as the days before, and the number of users in the future. This means that knowing the number of users during certain periods in the past could be valuable in predicting the number of users in the future.

Baseline

When faced with a forecasting problem, it is important to establish a baseline model. This is usually a very simple model that can be used as a reference to assess whether more complex models are worth implementing.

Skforecast allows you to easily create a baseline with its class ForecasterEquivalentDate. This model, also known as Seasonal Naive Forecasting, simply returns the value observed in the same period of the previous season (e.g., the same working day from the previous week, the same hour from the previous day, etc.).

Based on the exploratory analysis performed, the baseline model will be the one that predicts each hour using the value of the same hour on the previous day.

✎ Note

In the following code cells, a baseline forecaster is trained and its predictive ability is evaluated using a backtesting process. If this concept is new to you, do not worry, it will be explained in detail throughout the document. For now, it is sufficient to know that the backtesting process consists of training the model with a certain amount of data and evaluating its predictive ability with the data that the model has not seen. The error metric will be used as a reference to evaluate the predictive ability of the more complex models that will be implemented later.
In [10]:
# Create baseline: value of the same hour of the previous day
# ==============================================================================
forecaster = ForecasterEquivalentDate(
                 offset    = pd.DateOffset(days=1),
                 n_offsets = 1
             )

# Train forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'users'])
forecaster
Out[10]:
======================== 
ForecasterEquivalentDate 
======================== 
Offset: <DateOffset: days=1> 
Number of offsets: 1 
Aggregation function: mean 
Window size: 24 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-08-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Creation date: 2023-12-14 09:54:34 
Last fit date: 2023-12-14 09:54:34 
Skforecast version: 0.11.0 
Python version: 3.11.4 
Forecaster id: None 
In [11]:
# Backtesting
# ==============================================================================
metric_baseline, predictions = backtesting_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data['users'],
                                   steps              = 36,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data.loc[:end_validation]),
                                   refit              = False,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )

print(f"Backtest error (MAE): {metric_baseline}")
Backtest error (MAE): 91.66871584699453

The baseline model achieves an MAE of 91.7, which is used as a reference to assess whether the more complex models are worth implementing.

Recursive multi-step forecasting with LightGBM

LightGBM is a highly efficient implementation of the stochastic gradient boosting algorithm, which has become a benchmark in the field of machine learning. The LightGBM library includes its own API as well as the scikit-learn API, making it compatible with skforecast.

First, an ForecasterAutoreg model is trained using past values (lags) of the response variable as predictors. Later, exogenous variables are added to the model and the improvement in its performance is assessed. Since Gradient Boosting models have a large number of hyperparameters, a Bayesian Search is performed using the bayesian_search_forecaster() function to find the best combination of hyperparameters and lags. Finally, the predictive ability of the model is evaluated using a backtesting process.

Forecaster

In [12]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=15926, verbose=-1),
                 lags      = 24
             )

# Train forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'users'])
forecaster
Out[12]:
================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(random_state=15926, verbose=-1) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
Transformer for y: None 
Transformer for exog: None 
Window size: 24 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-08-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1} 
fit_kwargs: {} 
Creation date: 2023-12-14 09:54:36 
Last fit date: 2023-12-14 09:54:36 
Skforecast version: 0.11.0 
Python version: 3.11.4 
Forecaster id: None 
In [13]:
# Predict
# ==============================================================================
forecaster.predict(steps=10)
Out[13]:
2012-09-01 00:00:00    110.553303
2012-09-01 01:00:00     75.218776
2012-09-01 02:00:00     42.928564
2012-09-01 03:00:00     24.890883
2012-09-01 04:00:00     10.654948
2012-09-01 05:00:00     16.922900
2012-09-01 06:00:00     41.333408
2012-09-01 07:00:00     92.882118
2012-09-01 08:00:00    221.375747
2012-09-01 09:00:00    374.074368
Freq: H, Name: pred, dtype: float64

Backtesting

In order to have a robust estimate of the predictive ability of the model, a backtesting process is carried out. The backtesting process consists of generating a forecast for each observation in the test set, following the same procedure as it would be done in production, and then comparing the predicted value with the actual value.

In [14]:
# Backtest model on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['users'],
                          steps              = 36,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = True, # Change to False to see less information
                          show_progress      = True
                      )

predictions.head()
Information of backtesting process
----------------------------------
Number of observations used for initial training: 14616
Number of observations used for backtesting: 2928
    Number of folds: 82
    Number of steps per fold: 36
    Number of steps to exclude from the end of each train set before test (gap): 0
    Last fold only includes 12 observations.

Fold: 0
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-01 00:00:00 -- 2012-09-02 11:00:00  (n=36)
Fold: 1
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-02 12:00:00 -- 2012-09-03 23:00:00  (n=36)
Fold: 2
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-04 00:00:00 -- 2012-09-05 11:00:00  (n=36)
Fold: 3
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-05 12:00:00 -- 2012-09-06 23:00:00  (n=36)
Fold: 4
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-07 00:00:00 -- 2012-09-08 11:00:00  (n=36)
Fold: 5
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-08 12:00:00 -- 2012-09-09 23:00:00  (n=36)
Fold: 6
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-10 00:00:00 -- 2012-09-11 11:00:00  (n=36)
Fold: 7
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-11 12:00:00 -- 2012-09-12 23:00:00  (n=36)
Fold: 8
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-13 00:00:00 -- 2012-09-14 11:00:00  (n=36)
Fold: 9
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-14 12:00:00 -- 2012-09-15 23:00:00  (n=36)
Fold: 10
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-16 00:00:00 -- 2012-09-17 11:00:00  (n=36)
Fold: 11
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-17 12:00:00 -- 2012-09-18 23:00:00  (n=36)
Fold: 12
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-19 00:00:00 -- 2012-09-20 11:00:00  (n=36)
Fold: 13
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-20 12:00:00 -- 2012-09-21 23:00:00  (n=36)
Fold: 14
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-22 00:00:00 -- 2012-09-23 11:00:00  (n=36)
Fold: 15
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-23 12:00:00 -- 2012-09-24 23:00:00  (n=36)
Fold: 16
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-25 00:00:00 -- 2012-09-26 11:00:00  (n=36)
Fold: 17
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-26 12:00:00 -- 2012-09-27 23:00:00  (n=36)
Fold: 18
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-28 00:00:00 -- 2012-09-29 11:00:00  (n=36)
Fold: 19
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-09-29 12:00:00 -- 2012-09-30 23:00:00  (n=36)
Fold: 20
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-01 00:00:00 -- 2012-10-02 11:00:00  (n=36)
Fold: 21
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-02 12:00:00 -- 2012-10-03 23:00:00  (n=36)
Fold: 22
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-04 00:00:00 -- 2012-10-05 11:00:00  (n=36)
Fold: 23
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-05 12:00:00 -- 2012-10-06 23:00:00  (n=36)
Fold: 24
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-07 00:00:00 -- 2012-10-08 11:00:00  (n=36)
Fold: 25
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-08 12:00:00 -- 2012-10-09 23:00:00  (n=36)
Fold: 26
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-10 00:00:00 -- 2012-10-11 11:00:00  (n=36)
Fold: 27
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-11 12:00:00 -- 2012-10-12 23:00:00  (n=36)
Fold: 28
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-13 00:00:00 -- 2012-10-14 11:00:00  (n=36)
Fold: 29
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-14 12:00:00 -- 2012-10-15 23:00:00  (n=36)
Fold: 30
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-16 00:00:00 -- 2012-10-17 11:00:00  (n=36)
Fold: 31
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-17 12:00:00 -- 2012-10-18 23:00:00  (n=36)
Fold: 32
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-19 00:00:00 -- 2012-10-20 11:00:00  (n=36)
Fold: 33
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-20 12:00:00 -- 2012-10-21 23:00:00  (n=36)
Fold: 34
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-22 00:00:00 -- 2012-10-23 11:00:00  (n=36)
Fold: 35
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-23 12:00:00 -- 2012-10-24 23:00:00  (n=36)
Fold: 36
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-25 00:00:00 -- 2012-10-26 11:00:00  (n=36)
Fold: 37
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-26 12:00:00 -- 2012-10-27 23:00:00  (n=36)
Fold: 38
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-28 00:00:00 -- 2012-10-29 11:00:00  (n=36)
Fold: 39
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-29 12:00:00 -- 2012-10-30 23:00:00  (n=36)
Fold: 40
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-10-31 00:00:00 -- 2012-11-01 11:00:00  (n=36)
Fold: 41
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-01 12:00:00 -- 2012-11-02 23:00:00  (n=36)
Fold: 42
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-03 00:00:00 -- 2012-11-04 11:00:00  (n=36)
Fold: 43
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-04 12:00:00 -- 2012-11-05 23:00:00  (n=36)
Fold: 44
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-06 00:00:00 -- 2012-11-07 11:00:00  (n=36)
Fold: 45
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-07 12:00:00 -- 2012-11-08 23:00:00  (n=36)
Fold: 46
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-09 00:00:00 -- 2012-11-10 11:00:00  (n=36)
Fold: 47
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-10 12:00:00 -- 2012-11-11 23:00:00  (n=36)
Fold: 48
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-12 00:00:00 -- 2012-11-13 11:00:00  (n=36)
Fold: 49
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-13 12:00:00 -- 2012-11-14 23:00:00  (n=36)
Fold: 50
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-15 00:00:00 -- 2012-11-16 11:00:00  (n=36)
Fold: 51
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-16 12:00:00 -- 2012-11-17 23:00:00  (n=36)
Fold: 52
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-18 00:00:00 -- 2012-11-19 11:00:00  (n=36)
Fold: 53
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-19 12:00:00 -- 2012-11-20 23:00:00  (n=36)
Fold: 54
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-21 00:00:00 -- 2012-11-22 11:00:00  (n=36)
Fold: 55
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-22 12:00:00 -- 2012-11-23 23:00:00  (n=36)
Fold: 56
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-24 00:00:00 -- 2012-11-25 11:00:00  (n=36)
Fold: 57
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-25 12:00:00 -- 2012-11-26 23:00:00  (n=36)
Fold: 58
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-27 00:00:00 -- 2012-11-28 11:00:00  (n=36)
Fold: 59
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-28 12:00:00 -- 2012-11-29 23:00:00  (n=36)
Fold: 60
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-11-30 00:00:00 -- 2012-12-01 11:00:00  (n=36)
Fold: 61
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-01 12:00:00 -- 2012-12-02 23:00:00  (n=36)
Fold: 62
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-03 00:00:00 -- 2012-12-04 11:00:00  (n=36)
Fold: 63
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-04 12:00:00 -- 2012-12-05 23:00:00  (n=36)
Fold: 64
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-06 00:00:00 -- 2012-12-07 11:00:00  (n=36)
Fold: 65
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-07 12:00:00 -- 2012-12-08 23:00:00  (n=36)
Fold: 66
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-09 00:00:00 -- 2012-12-10 11:00:00  (n=36)
Fold: 67
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-10 12:00:00 -- 2012-12-11 23:00:00  (n=36)
Fold: 68
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-12 00:00:00 -- 2012-12-13 11:00:00  (n=36)
Fold: 69
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-13 12:00:00 -- 2012-12-14 23:00:00  (n=36)
Fold: 70
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-15 00:00:00 -- 2012-12-16 11:00:00  (n=36)
Fold: 71
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-16 12:00:00 -- 2012-12-17 23:00:00  (n=36)
Fold: 72
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-18 00:00:00 -- 2012-12-19 11:00:00  (n=36)
Fold: 73
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-19 12:00:00 -- 2012-12-20 23:00:00  (n=36)
Fold: 74
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-21 00:00:00 -- 2012-12-22 11:00:00  (n=36)
Fold: 75
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-22 12:00:00 -- 2012-12-23 23:00:00  (n=36)
Fold: 76
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-24 00:00:00 -- 2012-12-25 11:00:00  (n=36)
Fold: 77
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-25 12:00:00 -- 2012-12-26 23:00:00  (n=36)
Fold: 78
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-27 00:00:00 -- 2012-12-28 11:00:00  (n=36)
Fold: 79
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-28 12:00:00 -- 2012-12-29 23:00:00  (n=36)
Fold: 80
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-30 00:00:00 -- 2012-12-31 11:00:00  (n=36)
Fold: 81
    Training:   2011-01-01 00:00:00 -- 2012-08-31 23:00:00  (n=14616)
    Validation: 2012-12-31 12:00:00 -- 2012-12-31 23:00:00  (n=12)

Out[14]:
pred
2012-09-01 00:00:00 110.553303
2012-09-01 01:00:00 75.218776
2012-09-01 02:00:00 42.928564
2012-09-01 03:00:00 24.890883
2012-09-01 04:00:00 10.654948
In [15]:
# Backtesting error
# ==============================================================================
print(f'Backtest error (MAE): {metric}')
Backtest error (MAE): 76.25868057062392

The prediction error of the autoregressive model achieves a lower MAE than that of the baseline model.

Hyperparameter tuning

Hyperparameter tuning is a critical step in developing effective machine learning models. The ForecasterAutoreg used in the previous section included the first 24 lags and a LGMBRegressor model with the default hyperparameters. However, there is no reason why these values are the most appropriate. To find the best hyperparameters, a Bayesian Search is performed using the bayesian_search_forecaster() function. The search is carried out using the same backtesting process as before, but each time, the model is trained with different combinations of hyperparameters and lags. It is important to note that the hiperparameter search must be done using the validation set, so the test data is never used.

The search is performed by testing each combination of hyperparameters and lags as follows:

  1. Train the model using only the training set.

  2. The model is evaluated using the validation set via backtesting.

  3. Select the combination of hyperparameters and lags that gives the lowest error.

  4. Train the model again using the best combination found, this time using both the training and validation data.

By following these steps, one can obtain a model with optimized hyperparameters and avoid overfitting.

In [16]:
# Hyperparameters search
# ==============================================================================
# Lags grid
lags_grid = [48, 72, [1, 2, 3, 23, 24, 25, 167, 168, 169]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'n_estimators'  : trial.suggest_int('n_estimators', 400, 1200, step=100),
        'max_depth'     : trial.suggest_int('max_depth', 3, 10, step=1),
        'learning_rate' : trial.suggest_float('learning_rate', 0.01, 0.5),
        'reg_alpha'     : trial.suggest_float('reg_alpha', 0, 1, step=0.1),
        'reg_lambda'    : trial.suggest_float('reg_lambda', 0, 1, step=0.1),
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'users'], # Test data not used
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data_train),
                                   fixed_train_size   = False,
                                   n_trials           = 20, # Increase this value for a more exhaustive search
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 60,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [  1   2   3  23  24  25 167 168 169] 
  Parameters: {'n_estimators': 1200, 'max_depth': 3, 'learning_rate': 0.01954178119726445, 'reg_alpha': 0.0, 'reg_lambda': 1.0}
  Backtesting metric: 63.71133013848451

In [17]:
# Search results
# ==============================================================================
results_search.head(10)
Out[17]:
lags params mean_absolute_error n_estimators max_depth learning_rate reg_alpha reg_lambda
52 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1200, 'max_depth': 3, 'learni... 63.711330 1200.0 3.0 0.019542 0.0 1.0
53 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1200, 'max_depth': 3, 'learni... 63.725766 1200.0 3.0 0.018067 0.2 0.0
50 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1200, 'max_depth': 3, 'learni... 65.437956 1200.0 3.0 0.047360 0.1 1.0
55 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1200, 'max_depth': 4, 'learni... 65.692969 1200.0 4.0 0.013309 0.2 0.0
54 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1100, 'max_depth': 4, 'learni... 66.287923 1100.0 4.0 0.022655 0.2 0.0
51 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1200, 'max_depth': 3, 'learni... 66.860866 1200.0 3.0 0.074507 0.0 1.0
58 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 900, 'max_depth': 7, 'learnin... 67.396393 900.0 7.0 0.015123 0.0 0.2
48 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 900, 'max_depth': 3, 'learnin... 67.506372 900.0 3.0 0.165470 0.4 0.9
59 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 1000, 'max_depth': 3, 'learni... 69.434228 1000.0 3.0 0.084459 0.1 0.8
42 [1, 2, 3, 23, 24, 25, 167, 168, 169] {'n_estimators': 700, 'max_depth': 8, 'learnin... 71.266863 700.0 8.0 0.224900 0.0 0.4

Since return_best has been set to True, the forecaster object is automatically updated with the best configuration found and trained on the entire dataset. This final model can then be used for future predictions on new data.

In [18]:
# Best model
# ==============================================================================
forecaster
Out[18]:
================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(learning_rate=0.01954178119726445, max_depth=3, n_estimators=1200,
              random_state=15926, reg_lambda=1.0, verbose=-1) 
Lags: [  1   2   3  23  24  25 167 168 169] 
Transformer for y: None 
Transformer for exog: None 
Window size: 169 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-08-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.01954178119726445, 'max_depth': 3, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 1200, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.0, 'reg_lambda': 1.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1} 
fit_kwargs: {} 
Creation date: 2023-12-14 09:54:36 
Last fit date: 2023-12-14 09:57:41 
Skforecast version: 0.11.0 
Python version: 3.11.4 
Forecaster id: None 

Backtesting on test data

Once the best combination of hyperparameters has been identified using the validation data, the predictive capacity of the model is evaluated when applied to the test set. It is highly recommended to review the documentation for the backtesting_forecaster() function to gain a better understanding of its capabilities. This will help to utilize its full potential to analyze the predictive ability of the model.

In [19]:
# Backtest final model on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['users'],
                          steps              = 36,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False,
                          show_progress      = True
                      )

print(f"Backtest error: {metric:.2f}")
predictions.head()
Backtest error: 63.51
Out[19]:
pred
2012-09-01 00:00:00 121.716867
2012-09-01 01:00:00 96.685655
2012-09-01 02:00:00 66.315687
2012-09-01 03:00:00 37.115801
2012-09-01 04:00:00 17.879112
In [20]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width=800,
    height=350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

After optimizing lags and hyperparameters, the prediction error of the autoregressive model achieves an MAE that is significantly lower than that of the baseline model. Next, the predictive capacity of the model is evaluated when exogenous variables are added.

Exogenous Variables

So far, only lagged values of the time series have been used as predictors. However, it is possible to include other variables as predictors. These variables are known as exogenous variables (features) and their use can improve the predictive capacity of the model. A very important point to keep in mind is that the values of the exogenous variables must be known at the time of prediction.

Common examples of exogenous variables are those derived from the calendar, such as the day of the week, month, year, or holidays. Weather variables such as temperature, humidity, and wind also fall into this category, as do economic variables such as inflation and interest rates.

Warning

Exogenous variables must be known at the time of the forecast. For example, if temperature is used as an exogenous variable, the temperature value for the next hour must be known at the time of the forecast. If the temperature value is not known, the forecast will not be possible.

Weather variables should be used with caution. When the model is deployed into production, future weather conditions are not known, but are predictions made by meteorological services. Because they are predictions, they introduce errors into the forecasting model. As a result, the model's predictions are likely to get worse. One way to anticipate this problem, and to know (not avoid) the expected performance of the model, is to use the weather forecasts available at the time the model is trained, rather than the recorded conditions.

Next, exogenous variables are created based on calendar information, sunrise and sunset times, temperature, and holidays. These new variables are then added to the training, validation and test sets, and used as predictors in the autoregressive model.

Calendar and meteorological features

In [21]:
# Calendar features
# ==============================================================================
calendar_features = pd.DataFrame(index=data.index)
calendar_features['month'] = calendar_features.index.month
calendar_features['week_of_year'] = calendar_features.index.isocalendar().week
calendar_features['week_day'] = calendar_features.index.day_of_week + 1
calendar_features['hour_day'] = calendar_features.index.hour + 1

# Sunlight features
# ==============================================================================
location = LocationInfo(
    name='Washington DC',
    region='USA',
    timezone='US/Eastern',
    latitude=40.516666666666666,
    longitude=-77.03333333333333
)
sunrise_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunrise'].hour
    for date in data.index
]
sunset_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunset'].hour
    for date in data.index
]
sun_light_features = pd.DataFrame({
                         'sunrise_hour': sunrise_hour,
                         'sunset_hour': sunset_hour}, 
                         index = data.index
                     )
sun_light_features['daylight_hours'] = (
    sun_light_features['sunset_hour'] - sun_light_features['sunrise_hour']
)
sun_light_features['is_daylight'] = np.where(
                                        (data.index.hour >= sun_light_features['sunrise_hour']) & \
                                        (data.index.hour < sun_light_features['sunset_hour']),
                                        1,
                                        0
                                    )

# Holiday features
# ==============================================================================
holiday_features = data[['holiday']].astype(int)
holiday_features['holiday_previous_day'] = holiday_features['holiday'].shift(24)
holiday_features['holiday_next_day'] = holiday_features['holiday'].shift(-24)

# Temperature features
# ==============================================================================
temp_features = data[['temp']].copy()
temp_features['temp_roll_mean_1_day'] = temp_features['temp'].rolling(24, closed='left').mean()
temp_features['temp_roll_mean_7_day'] = temp_features['temp'].rolling(24*7, closed='left').mean()
temp_features['temp_roll_max_1_day'] = temp_features['temp'].rolling(24, closed='left').max()
temp_features['temp_roll_min_1_day'] = temp_features['temp'].rolling(24, closed='left').min()
temp_features['temp_roll_max_7_day'] = temp_features['temp'].rolling(24*7, closed='left').max()
temp_features['temp_roll_min_7_day'] = temp_features['temp'].rolling(24*7, closed='left').min()


# Merge all exogenous variables
# ==============================================================================
df_exogenous_features = pd.concat([
                            calendar_features,
                            sun_light_features,
                            temp_features,
                            holiday_features
                        ], axis=1)

df_exogenous_features.head(4)
Out[21]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight temp temp_roll_mean_1_day temp_roll_mean_7_day temp_roll_max_1_day temp_roll_min_1_day temp_roll_max_7_day temp_roll_min_7_day holiday holiday_previous_day holiday_next_day
date_time
2011-01-01 00:00:00 1 52 6 1 7 16 9 0 9.84 NaN NaN NaN NaN NaN NaN 0 NaN 0.0
2011-01-01 01:00:00 1 52 6 2 7 16 9 0 9.02 NaN NaN NaN NaN NaN NaN 0 NaN 0.0
2011-01-01 02:00:00 1 52 6 3 7 16 9 0 9.02 NaN NaN NaN NaN NaN NaN 0 NaN 0.0
2011-01-01 03:00:00 1 52 6 4 7 16 9 0 9.84 NaN NaN NaN NaN NaN NaN 0 NaN 0.0

Features with a cyclical pattern

Certain aspects of the calendar, such as hours or days, are cyclical. For example, the hour-day cycle ranges from 0 to 23 hours. These types of features can be handled in several ways, each with its advantages and disadvantages.

  • One approach is to use the features directly as numeric values without any transformation. This method avoids creating a lot of new features, but may impose an incorrect linear order on the values. For example, hour 23 of one day and hour 00 of the next are very far apart in their linear representation, when in fact there is only one hour difference between them.

  • Another possibility is to treat cyclical features as categorical variables to avoid imposing a linear order. However, this approach may result in the loss of the cyclical information inherent in the variable.

  • There is a third way to deal with cyclical features that is often preferred to the other two methods. This involves transforming the features using the sine and cosine of their period. This approach generates only two new features that capture the cyclicality of the data more accurately than the previous two methods because it preserves the natural order of the feature and avoids imposing a linear order.

✎ Note

For a more detailed explanation of the different ways of dealing with cyclical features, see the Cyclical features in time series forecasting document.
In [22]:
# Cliclical encoding of calendar and sunlight features
# ==============================================================================
def cyclical_encoding(data: pd.Series, cycle_length: int) -> pd.DataFrame:
    """
    Encode a cyclical feature with two new features sine and cosine.
    The minimum value of the feature is assumed to be 0. The maximum value
    of the feature is passed as an argument.
      
    Parameters
    ----------
    data : pd.Series
        Series with the feature to encode.
    cycle_length : int
        The length of the cycle. For example, 12 for months, 24 for hours, etc.
        This value is used to calculate the angle of the sin and cos.

    Returns
    -------
    result : pd.DataFrame
        Dataframe with the two new features sin and cos.

    """

    sin = np.sin(2 * np.pi * data/cycle_length)
    cos = np.cos(2 * np.pi * data/cycle_length)
    result =  pd.DataFrame({
                  f"{data.name}_sin": sin,
                  f"{data.name}_cos": cos
              })

    return result


month_encoded = cyclical_encoding(df_exogenous_features['month'], cycle_length=12)
week_of_year_encoded = cyclical_encoding(df_exogenous_features['week_of_year'], cycle_length=52)
week_day_encoded = cyclical_encoding(df_exogenous_features['week_day'], cycle_length=7)
hour_day_encoded = cyclical_encoding(df_exogenous_features['hour_day'], cycle_length=24)
sunrise_hour_encoded = cyclical_encoding(df_exogenous_features['sunrise_hour'], cycle_length=24)
sunset_hour_encoded = cyclical_encoding(df_exogenous_features['sunset_hour'], cycle_length=24)

cyclical_features = pd.concat([
                        month_encoded,
                        week_of_year_encoded,
                        week_day_encoded,
                        hour_day_encoded,
                        sunrise_hour_encoded,
                        sunset_hour_encoded
                    ], axis=1)

df_exogenous_features = pd.concat([df_exogenous_features, cyclical_features], axis=1)
df_exogenous_features.head(3)
Out[22]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight temp temp_roll_mean_1_day ... week_of_year_sin week_of_year_cos week_day_sin week_day_cos hour_day_sin hour_day_cos sunrise_hour_sin sunrise_hour_cos sunset_hour_sin sunset_hour_cos
date_time
2011-01-01 00:00:00 1 52 6 1 7 16 9 0 9.84 NaN ... 0.0 1.0 -0.781831 0.62349 0.258819 0.965926 0.965926 -0.258819 -0.866025 -0.5
2011-01-01 01:00:00 1 52 6 2 7 16 9 0 9.02 NaN ... 0.0 1.0 -0.781831 0.62349 0.500000 0.866025 0.965926 -0.258819 -0.866025 -0.5
2011-01-01 02:00:00 1 52 6 3 7 16 9 0 9.02 NaN ... 0.0 1.0 -0.781831 0.62349 0.707107 0.707107 0.965926 -0.258819 -0.866025 -0.5

3 rows × 30 columns

✎ Note

For illustrative purposes, calendar features and their cyclic encoding have been done manually. However, for a more efficient and automated approach, it is recommended to use the transformers DatetimeFeatures and CyclicalFeatures available in Feature-engine library, and pass them directly to the transformer_exog argument of the forecaster.

Feature interaction

In many cases, exogenous variables are not isolated. Rather, their effect on the target variable depends on the value of other variables. For example, the effect of temperature on electricity demand depends on the time of day. The interaction between the exogenous variables can be captured by new variables that are obtained by multiplying existing variables together. This interactions are easily obtained with the PolynomialFeatures class from scikit-learn.

In [23]:
# Interaction between exogenous variables
# ==============================================================================
transformer_poly = PolynomialFeatures(
                       degree           = 2,
                       interaction_only = True,
                       include_bias     = False
                   ).set_output(transform="pandas")

poly_cols = [
    'month_sin', 
    'month_cos',
    'week_of_year_sin',
    'week_of_year_cos',
    'week_day_sin',
    'week_day_cos',
    'hour_day_sin',
    'hour_day_cos',
    'sunrise_hour_sin',
    'sunrise_hour_cos',
    'sunset_hour_sin',
    'sunset_hour_cos',
    'daylight_hours',
    'is_daylight',
    'holiday_previous_day',
    'holiday_next_day',
    'temp_roll_mean_1_day',
    'temp_roll_mean_7_day',
    'temp_roll_max_1_day',
    'temp_roll_min_1_day',
    'temp_roll_max_7_day',
    'temp_roll_min_7_day',
    'temp',
    'holiday'
]

poly_features = transformer_poly.fit_transform(df_exogenous_features[poly_cols].dropna())
poly_features = poly_features.drop(columns=poly_cols)
poly_features.columns = [f"poly_{col}" for col in poly_features.columns]
poly_features.columns = poly_features.columns.str.replace(" ", "__")
df_exogenous_features = pd.concat([df_exogenous_features, poly_features], axis=1)
df_exogenous_features.head(4)
Out[23]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight temp temp_roll_mean_1_day ... poly_temp_roll_min_1_day__temp_roll_max_7_day poly_temp_roll_min_1_day__temp_roll_min_7_day poly_temp_roll_min_1_day__temp poly_temp_roll_min_1_day__holiday poly_temp_roll_max_7_day__temp_roll_min_7_day poly_temp_roll_max_7_day__temp poly_temp_roll_max_7_day__holiday poly_temp_roll_min_7_day__temp poly_temp_roll_min_7_day__holiday poly_temp__holiday
date_time
2011-01-01 00:00:00 1 52 6 1 7 16 9 0 9.84 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 1 52 6 2 7 16 9 0 9.02 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-01 02:00:00 1 52 6 3 7 16 9 0 9.02 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-01 03:00:00 1 52 6 4 7 16 9 0 9.84 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 306 columns

Categorical features

There are several approaches to incorporating categorical variables into LightGBM (and other gradient boosting frameworks):

  • One option is to transform the data by converting categorical values to numerical values using methods such as one-hot encoding or ordinal encoding. This approach is applicable to all machine learning models.

  • Alternatively, LightGBM can handle categorical variables internally without the need for preprocessing. This can be done automatically by setting the parameter categorical_features='auto' and encoding the variables as the category datatype within a Pandas DataFrame. As another option, users can specify the names of the features to be treated as categorical by passing a list of column names to the categorical_features parameter.

There is no one method that is always better than the others. General rules are:

  • When the cardinality of categorical variables is high (many different values), it is better to use the native support for categorical variables than to use one-hot encoding.

  • With one-hot encoded data, more split points (i.e., more depth) are needed to recover an equivalent split that could be obtained with a single split point using native handling.

  • When a categorical variable is converted to multiple dummy variables using one-hot, its importance is diluted, making the analysis of the importance of the features more complex to interpret.

In [24]:
# Store categorical variables as category type
# ==============================================================================
data["weather"] = data["weather"].astype("category")

One hot encoding

ColumnTransformers in scikit-learn provide a powerful way to define transformations and apply them to specific features. By encapsulating the transformations in a ColumnTransformer object, it can be passed to a Forecaster using the transformer_exog argument.

✎ Note

It is possible to apply a transformation to the entire dataset independent of the Forecaster. However, it is crucial to ensure that the transformations are learned only from the training data to avoid information leakage. In addition, the same transformation should be applied to the input data during the prediction. Therefore, it is advisable to include the transformation in the Forecaster so that it is handled internally. This approach ensures consistency in the application of transformations and reduces the likelihood of errors.
In [25]:
# One hot encoding transformer
# ==============================================================================
one_hot_encoder = make_column_transformer(
                      (
                          OneHotEncoder(sparse_output=False, drop='if_binary'),
                          make_column_selector(dtype_include=['category', 'object']),
                      ),
                      remainder="passthrough",
                      verbose_feature_names_out=False,
                  ).set_output(transform="pandas")  
In [26]:
# Create a forecaster with a transformer for exogenous features
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = LGBMRegressor(random_state=15926, verbose=-1),
                 lags             = 72,
                 transformer_exog = one_hot_encoder
             )

To examine how the data is transformed, it is possible to use the create_train_X_y method to generate the matrices that the Forecaster uses to train the model. This approach provides insight into the specific data manipulations that occur during the training process.

In [27]:
# View training Matrix
# ==============================================================================
exog_features = ['weather']
                 
X_train, y_train = forecaster.create_train_X_y(
                       y    = data.loc[:end_validation, 'users'],
                       exog = data.loc[:end_validation, exog_features]
                   )
X_train.head(3)
Out[27]:
lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 ... lag_66 lag_67 lag_68 lag_69 lag_70 lag_71 lag_72 weather_clear weather_mist weather_rain
date_time
2011-01-04 00:00:00 12.0 20.0 52.0 52.0 110.0 157.0 157.0 76.0 72.0 77.0 ... 2.0 1.0 1.0 13.0 32.0 40.0 16.0 1.0 0.0 0.0
2011-01-04 01:00:00 5.0 12.0 20.0 52.0 52.0 110.0 157.0 157.0 76.0 72.0 ... 3.0 2.0 1.0 1.0 13.0 32.0 40.0 1.0 0.0 0.0
2011-01-04 02:00:00 2.0 5.0 12.0 20.0 52.0 52.0 110.0 157.0 157.0 76.0 ... 8.0 3.0 2.0 1.0 1.0 13.0 32.0 1.0 0.0 0.0

3 rows × 75 columns

The One Hot Encoder strategy has been shown for didactic purposes. For the rest of the document, however, the native support for categorical variables is used.

Native implementation for categorical features

The Gradient Boosting libraries (XGBoost, LightGBM, CatBoost, and HistGradientBoostingRegressor) assume that the input categories are integers starting from 0 up to the number of categories [0, 1, ..., n_categories-1]. In most real cases, categorical variables are not encoded with numbers but with strings, so an intermediate transformation step is necessary. Two options are:

  • Set columns with categorical variables to the type category. Internally, this data structure consists of an array of categories and an array of integer values (codes) that point to the actual value of the array of categories. That is, internally it is a numeric array with a mapping that relates each value with a category. Models are able to automatically identify the columns of type category and access their internal codes. This approach is applicable to XGBoost, LightGBM and CatBoost.

  • Preprocess the categorical columns with an OrdinalEncoder to transform their values to integers and then explicitly indicate which features should be treated as categorical.

To use the the automatic detection in skforecast, categorical variables must first be encoded as integers and then stored as the type category again. This is because skforecast internally uses a numeric numpy array to speed up the calculation.

Warning

The four gradient boosting frameworks – LightGBM, scikit-learn's HistogramGradientBoosting, XGBoost, and CatBoost – are capable of directly handling categorical features within the model. However, it is important to note that each framework has its own configurations, benefits and potential problems. To fully comprehend how to use these frameworks, it is highly recommended to refer to the skforecast user guide for a detailed understanding.

When deploying models in production, it is strongly recommended to avoid using automatic detection based on pandas category type columns. Although pandas provides an internal coding for these columns, it is not consistent across different datasets and may vary depending on the categories present in each dataset. It is therefore crucial to be aware of this issue and to take appropriate measures to ensure consistency in the coding of categorical features when deploying models in production. More details on this issue can be found in Native implementation for categorical features.
In [28]:
# Transformer: Ordinal encoding + cast to category type
# ==============================================================================
pipeline_categorical = make_pipeline(
                           OrdinalEncoder(
                               dtype=int,
                               handle_unknown="use_encoded_value",
                               unknown_value=-1,
                               encoded_missing_value=-1
                           ),
                           FunctionTransformer(
                               func=lambda x: x.astype('category'),
                               feature_names_out= 'one-to-one'
                           )
                       )

transformer_exog = make_column_transformer(
                       (
                           pipeline_categorical,
                           make_column_selector(dtype_exclude=np.number)
                       ),
                       remainder="passthrough",
                       verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

When creating a Forecaster with LGBMRegressor, it is necessary to specify how to handle the categorical columns using the fit_kwargs argument. This is because the categorical_feature argument is only specified in the fit method of LGBMRegressor, and not during its initialization.

In [29]:
# Create a forecaster with automatic categorical detection
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = LGBMRegressor(random_state=15926, verbose=-1),
                 lags             = 72,
                 transformer_exog = transformer_exog,
                 fit_kwargs       = {"categorical_feature": "auto"}
             )

This is the strategy that will be used throughout the rest of this document.

Evaluate model with exogenous features

The forecaster is trained again, but this time, the exogenous variables are also included as predictors. For categorical features, the native implementation is used.

In [30]:
# Select exogenous variables to be included in the model
# ==============================================================================
exog_features = []
# Columns that ends with _sin or _cos are selected
exog_features.extend(df_exogenous_features.filter(regex='_sin$|_cos$').columns.tolist())
# columns that start with temp_ are selected
exog_features.extend(df_exogenous_features.filter(regex='^temp_.*').columns.tolist())
# Columns that start with holiday_ are selected
exog_features.extend(df_exogenous_features.filter(regex='^holiday_.*').columns.tolist())
exog_features.extend(['temp', 'holiday', 'weather'])

df_exogenous_features = df_exogenous_features.filter(exog_features, axis=1)
In [31]:
# Merge target and exogenous variables in the same dataframe
# ==============================================================================
data = data[['users', 'weather']].merge(
           df_exogenous_features,
           left_index=True,
           right_index=True,
           how='left'
       )

# Due to the creation of moving averages, there are missing values at the beginning
# of the series. And due to holiday_next_day there are missing values at the end.
# Numeric columns are cast to float32.
data = data.dropna()
data = data.astype({col: np.float32 for col in data.select_dtypes("number").columns})
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

Because many more features are now included in the model, previously identified hyperparameters may no longer be optimal, so a new hyperparameter search is performed. This time, however, the search is tailored to the range of values identified in the previous search.

In [32]:
# Hyperparameters search
# ==============================================================================

# Create forecaster with a transformer categorical features
forecaster = ForecasterAutoreg(
                 regressor        = LGBMRegressor(random_state=15926, verbose=-1),
                 lags             = 72,
                 transformer_exog = transformer_exog,
                 fit_kwargs       = {"categorical_feature": "auto"}
             )

# Lags grid
lags_grid = [[1, 2, 3, 23, 24, 25, 167, 168, 169]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'n_estimators' : trial.suggest_int('n_estimators', 800, 1400, step=100),
        'max_depth'    : trial.suggest_int('max_depth', 3, 8, step=1),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
        'reg_alpha'    : trial.suggest_float('reg_alpha', 0, 1, step=0.1),
        'reg_lambda'   : trial.suggest_float('reg_lambda', 0, 1, step=0.1),
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'users'],
                                   exog               = data.loc[:end_validation, exog_features],
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data_train),
                                   fixed_train_size   = False,
                                   n_trials           = 20,
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 20,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [  1   2   3  23  24  25 167 168 169] 
  Parameters: {'n_estimators': 1100, 'max_depth': 6, 'learning_rate': 0.012833462456300015, 'reg_alpha': 0.2, 'reg_lambda': 0.1}
  Backtesting metric: 64.36159818293837

In [33]:
# Backtesting model with exogenous variables on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['users'],
                          exog               = data[exog_features],
                          steps              = 36,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False,
                          show_progress      = True
                      )

print(f"Backtest error: {metric:.2f}")
Backtest error: 47.54
In [34]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width=800,
    height=350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

The inclusion of exogenous variables as predictors further improves the predictive capacity of the model.

Model explanaibility

Due to the complex nature of many modern machine learning models, such as ensemble methods, they often function as black boxes, making it difficult to understand why a particular prediction was made. Explanability techniques aim to demystify these models, providing insight into their inner workings and helping to build trust, improve transparency, and meet regulatory requirements in various domains. Enhancing model explainability not only helps to understand model behavior, but also helps to identify biases, improve model performance, and enable stakeholders to make more informed decisions based on machine learning insights.

Skforecast is compatible with some of the most popular model explainability methods: model-specific feature importances, SHAP values, and partial dependence plots.

In [35]:
# Create and train forecaster
# ==============================================================================
params = {
    'n_estimators': 900,
    'max_depth': 7,
    'learning_rate': 0.022655,
    'reg_alpha': 0.8,
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = [1, 2, 3, 23, 24, 25, 167, 168, 169],
                 transformer_exog = transformer_exog,
                 fit_kwargs       = {"categorical_feature": "auto"}
             )
forecaster.fit(
    y    = data.loc[:end_validation, 'users'],
    exog = data.loc[:end_validation, exog_features]
)

Model-specific feature importance

In [36]:
# Extract feature importance
# ==============================================================================
importance = forecaster.get_feature_importances()
importance.sort_values(by='importance', ascending=False).head(10)
Out[36]:
feature importance
0 lag_1 2543
7 lag_168 1974
8 lag_169 1420
4 lag_24 1362
6 lag_167 1283
1 lag_2 1133
2 lag_3 1081
3 lag_23 1007
5 lag_25 779
96 temp 750

Warning

The get_feature_importances() method will only return values if the forecaster's regressor has either the coef_ or feature_importances_ attribute, which is the default in scikit-learn.

Shap values

SHAP (SHapley Additive exPlanations) values are a popular method for explaining machine learning models, as they help to understand how variables and values influence predictions visually and quantitatively.

It is possible to generate SHAP-values explanations from skforecast models with just two essential elements:

  • The internal regressor of the forecaster.

  • The training matrices created from the time series and used to fit the forecaster.

By leveraging these two components, users can create insightful and interpretable explanations for their skforecast models. These explanations can be used to verify the reliability of the model, identify the most significant factors that contribute to model predictions, and gain a deeper understanding of the underlying relationship between the input variables and the target variable.

In [37]:
# Training matrices used by the forecaster to fit the internal regressor
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                       y    = data.loc[:end_validation, 'users'],
                       exog = data.loc[:end_validation, exog_features]
                   )

display(X_train.head(3))
display(y_train.head(3))
lag_1 lag_2 lag_3 lag_23 lag_24 lag_25 lag_167 lag_168 lag_169 weather ... temp_roll_mean_1_day temp_roll_mean_7_day temp_roll_max_1_day temp_roll_min_1_day temp_roll_max_7_day temp_roll_min_7_day holiday_previous_day holiday_next_day temp holiday
date_time
2011-01-15 01:00:00 28.0 27.0 36.0 1.0 5.0 14.0 16.0 16.0 25.0 1 ... 6.594167 6.535595 9.84 4.1 9.84 3.28 0.0 0.0 6.56 0.0
2011-01-15 02:00:00 20.0 28.0 27.0 1.0 1.0 5.0 7.0 16.0 16.0 1 ... 6.696667 6.530715 9.84 4.1 9.84 3.28 0.0 0.0 6.56 0.0
2011-01-15 03:00:00 12.0 20.0 28.0 1.0 1.0 1.0 1.0 7.0 16.0 1 ... 6.799167 6.525833 9.84 4.1 9.84 3.28 0.0 0.0 6.56 0.0

3 rows × 98 columns

date_time
2011-01-15 01:00:00    20.0
2011-01-15 02:00:00    12.0
2011-01-15 03:00:00     8.0
Freq: H, Name: y, dtype: float32
In [38]:
# Create SHAP explainer (for three base models)
# ==============================================================================
explainer = shap.TreeExplainer(forecaster.regressor)

# Sample 50% of the data to speed up the calculation
rng = np.random.default_rng(seed=785412)
sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False)
X_train_sample = X_train.loc[sample, :]

shap_values = explainer.shap_values(X_train_sample)

✎ Note

Shap library has several explainers, each designed for a different type of model. The shap.TreeExplainer explainer is used for tree-based models, such as the LGBMRegressor used in this example. For more information, see the SHAP documentation.
In [39]:
# Shap summary plot (top 10)
# ==============================================================================
shap.summary_plot(shap_values, X_train_sample, max_display=10, show=False)
fig, ax = plt.gcf(), plt.gca()
ax.set_title("SHAP Summary plot")
ax.tick_params(labelsize=8)
fig.set_size_inches(10, 4.5)
In [40]:
# Force plot for the first observation
# ==============================================================================
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train_sample.iloc[0,:])
Out[40]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Feature selection

Feature selection is the process of selecting a subset of relevant features for use in model construction. It is an important step in the machine learning process, as it can help to reduce overfitting, improve model accuracy, and reduce training time. Since the underlying regressors of skforecast follow the scikit-learn API, it is possible to use the feature selection methods available in scikit-learn. Two of the most popular methods are Recursive Feature Elimination and Sequential Feature Selection.

💡 Tip

Feature selection is a powerful tool for improving the performance of machine learning models. However, it is computationally expensive and can be time-consuming. Since the goal is to find the best subset of features, not the best model, it is not necessary to use the entire data set or a highly complex model. Instead, it is recommended to use a small subset of the data and a simple model. Once the best subset of features has been identified, the model can then be trained using the entire dataset and a more complex configuration.

For example, in this use case, the best model is an LGMBRegressor with 900 trees and a maximum depth of 7. However, to find the best subset of features, only 100 trees and a maximum depth of 5 are used.
In [41]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = LGBMRegressor(random_state=15926, verbose=-1),
                 lags             = [1, 2, 3, 23, 24, 25, 167, 168, 169],
                 transformer_exog = transformer_exog,
                 fit_kwargs       = {"categorical_feature": "auto"}
             )

# Underlying matrices used by the forecaster to fit the internal regressor
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                       y    = data_train['users'],
                       exog = data_train[exog_features]
                   )

Scikit-learn's RFECV starts by training a model on the initial set of features, and obtaining the importance of each feature (through attributes such as coef_ or feature_importances_). Then, in each round, the least important features are iteratively removed, followed by cross-validation to calculate the performance of the model with the remaining features. This process continues until further feature removal doesn't improve or starts to degrade the performance of the model (based on a chosen metric), or the min_features_to_select is reached.

The final result is an optimal subset of features that ideally balances model simplicity and predictive power, as determined by the cross-validation process.

In [42]:
# Recursive feature elimination with cross-validation
# ==============================================================================
# Sample 50% of the data to speed up the calculation
rng = np.random.default_rng(seed=785412)
sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False)
X_train_sample = X_train.loc[sample, :]
y_train_sample = y_train.loc[sample]

regressor = LGBMRegressor(
                n_estimators = 100,
                max_depth    = 5,
                random_state = 15926,
                verbose      = -1
            )

selector = RFECV(
    estimator              = regressor,
    step                   = 1,
    cv                     = 3,
    min_features_to_select = 25,
    n_jobs                 = -1
)
selector.fit(X_train_sample, y_train_sample)
selected_features_rfe = selector.get_feature_names_out()

print("Recursive feature elimination")
print("-----------------------------")
print(f"Total number of features available: {X_train.shape[1]}") 
print(f"Total number of records available: {X_train.shape[0]}")
print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")
print(f"Number of features selected: {len(selected_features_rfe)}")
print(f"Selected features : \n {selected_features_rfe}")
Recursive feature elimination
-----------------------------
Total number of features available: 98
Total number of records available: 10607
Total number of records used for feature selection: 5303
Number of features selected: 65
Selected features : 
 ['lag_1' 'lag_2' 'lag_3' 'lag_23' 'lag_24' 'lag_25' 'lag_167' 'lag_168'
 'lag_169' 'weather' 'month_sin' 'week_of_year_sin' 'week_day_sin'
 'hour_day_sin' 'hour_day_cos' 'poly_month_sin__week_of_year_sin'
 'poly_month_sin__week_of_year_cos' 'poly_month_sin__week_day_sin'
 'poly_month_sin__hour_day_sin' 'poly_month_sin__hour_day_cos'
 'poly_month_sin__sunrise_hour_cos' 'poly_month_sin__sunset_hour_cos'
 'poly_month_cos__week_of_year_sin' 'poly_month_cos__week_day_cos'
 'poly_month_cos__hour_day_sin' 'poly_month_cos__hour_day_cos'
 'poly_month_cos__sunset_hour_cos' 'poly_week_of_year_sin__week_day_sin'
 'poly_week_of_year_sin__week_day_cos'
 'poly_week_of_year_sin__hour_day_sin'
 'poly_week_of_year_sin__hour_day_cos'
 'poly_week_of_year_sin__sunrise_hour_cos'
 'poly_week_of_year_sin__sunset_hour_cos'
 'poly_week_of_year_cos__week_day_sin'
 'poly_week_of_year_cos__week_day_cos'
 'poly_week_of_year_cos__hour_day_sin'
 'poly_week_of_year_cos__hour_day_cos'
 'poly_week_of_year_cos__sunrise_hour_sin'
 'poly_week_of_year_cos__sunset_hour_sin'
 'poly_week_of_year_cos__sunset_hour_cos'
 'poly_week_day_sin__week_day_cos' 'poly_week_day_sin__hour_day_sin'
 'poly_week_day_sin__hour_day_cos' 'poly_week_day_sin__sunrise_hour_cos'
 'poly_week_day_sin__sunset_hour_sin' 'poly_week_day_sin__sunset_hour_cos'
 'poly_week_day_cos__hour_day_sin' 'poly_week_day_cos__hour_day_cos'
 'poly_week_day_cos__sunrise_hour_cos'
 'poly_week_day_cos__sunset_hour_sin' 'poly_week_day_cos__sunset_hour_cos'
 'poly_hour_day_sin__hour_day_cos' 'poly_hour_day_sin__sunrise_hour_sin'
 'poly_hour_day_sin__sunrise_hour_cos'
 'poly_hour_day_sin__sunset_hour_sin' 'poly_hour_day_sin__sunset_hour_cos'
 'poly_hour_day_cos__sunrise_hour_sin'
 'poly_hour_day_cos__sunset_hour_sin' 'poly_hour_day_cos__sunset_hour_cos'
 'temp_roll_mean_1_day' 'temp_roll_mean_7_day' 'temp_roll_max_1_day'
 'temp_roll_min_1_day' 'temp' 'holiday']

The forecaster is trained and re-evaluated using the best subset of features.

In [43]:
# Create forecaster with the selected features
# ==============================================================================
params = {
    'n_estimators': 900, 
    'max_depth': 7, 
    'learning_rate': 0.022655,
    'reg_alpha': 0.8, 
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}

selected_lags = [
    int(feature.replace("lag_", ""))
    for feature in selected_features_rfe
    if feature.startswith("lag_")
]
selected_exog_features = [
    feature
    for feature in selected_features_rfe
    if not feature.startswith("lag_")
]

forecaster = ForecasterAutoreg(
                 regressor        = LGBMRegressor(**params),
                 lags             = selected_lags,
                 transformer_exog = transformer_exog,
                 fit_kwargs       = {"categorical_feature": "auto"}
             )

forecaster.fit(y=data_train['users'], exog=data_train[selected_exog_features])

# Backtesting model with exogenous variables on test data
# ==============================================================================
metric_lgbm, predictions = backtesting_forecaster(
                               forecaster         = forecaster,
                               y                  = data['users'],
                               exog               = data[selected_exog_features],
                               steps              = 36,
                               metric             = 'mean_absolute_error',
                               initial_train_size = len(data[:end_validation]),
                               refit              = False,
                               n_jobs             = 'auto',
                               verbose            = False,
                               show_progress      = True
                           )

print(f"Backtest error: {metric_lgbm:.2f}")
Backtest error: 47.39

The performance of the model remains similar to that of the model trained with all the features. However, the model is now much simpler, which will make it faster to train and less prone to overfitting. For the rest of the document, the model will be trained using only the most important features.

In [44]:
# Update exogenous variables
# ==============================================================================
exog_features = selected_exog_features

XGBoost, CatBoost and HistGradientBoostingRegressor

Since the success of Gradient Boosting as a machine learning algorithm, several implementations have been developed. In addition to LightGBM, three other implementations are very popular: XGBoost, CatBoost and HistGradientBoostingRegressor. All of them are compatible with skforecast.

The following sections show how to use these implementations to build forecasting models, with an emphasis on using their native support for categorical features.

XGBoost

In [45]:
# Transformer: Ordinal encoding + cast to category type
# ==============================================================================
pipeline_categorical = make_pipeline(
                           OrdinalEncoder(
                               dtype=int,
                               handle_unknown="use_encoded_value",
                               unknown_value=-1,
                               encoded_missing_value=-1
                           ),
                           FunctionTransformer(
                               func=lambda x: x.astype('category'),
                               feature_names_out= 'one-to-one'
                           )
                       )

transformer_exog = make_column_transformer(
                       (
                           pipeline_categorical,
                           make_column_selector(dtype_exclude=np.number)
                       ),
                       remainder="passthrough",
                       verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
In [46]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(
                                 tree_method = 'hist',
                                 enable_categorical = True,
                                 random_state = 123
                             ),
                 lags = 24,
                 transformer_exog = transformer_exog
             )
In [47]:
# Hyperparameters search
# ==============================================================================
# Lags grid
lags_grid = [[1, 2, 3, 23, 24, 25, 167, 168, 169]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'n_estimators'    : trial.suggest_int('n_estimators', 400, 1200, step=100),
        'max_depth'       : trial.suggest_int('max_depth', 3, 10, step=1),
        'learning_rate'   : trial.suggest_float('learning_rate', 0.01, 1),
        'subsample'       : trial.suggest_float('subsample', 0.1, 1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 1),
        'gamma'           : trial.suggest_float('gamma', 0, 1),
        'reg_alpha'       : trial.suggest_float('reg_alpha', 0, 1),
        'reg_lambda'      : trial.suggest_float('reg_lambda', 0, 1),
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'users'],
                                   exog               = data.loc[:end_validation, exog_features],
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data_train),
                                   fixed_train_size   = False,
                                   n_trials           = 20,
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 20,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [  1   2   3  23  24  25 167 168 169] 
  Parameters: {'n_estimators': 600, 'max_depth': 9, 'learning_rate': 0.16952423395254382, 'subsample': 0.8288920720686112, 'colsample_bytree': 0.8159829075422989, 'gamma': 0.40403004662054776, 'reg_alpha': 0.42402512958262695, 'reg_lambda': 0.007444368802853596}
  Backtesting metric: 65.44580236997675

In [48]:
# Backtesting test data
# ==============================================================================
metric_xgboost, predictions = backtesting_forecaster(
                                  forecaster         = forecaster,
                                  y                  = data['users'],
                                  exog               = data[exog_features],
                                  initial_train_size = len(data.loc[:end_validation]),
                                  fixed_train_size   = False,
                                  steps              = 36,
                                  refit              = False,
                                  metric             = 'mean_absolute_error',
                                  n_jobs             = 'auto',
                                  verbose            = False
                              )

print(f"Backtest error: {metric_xgboost:.2f}")
Backtest error: 50.62

HistGradientBoostingRegressor

When creating a Forecaster using HistogramGradientBoosting, the names of the categorical columns should be specified during the regressor instantiation by passing them as a list to the categorical_feature argument.

In [49]:
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical features (no numerical)
# using ordinal encoding. Numerical features are left untouched. Missing values
# are encoded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = ['weather']
transformer_exog = make_column_transformer(
                       (
                           OrdinalEncoder(
                               dtype=int,
                               handle_unknown="use_encoded_value",
                               unknown_value=-1,
                               encoded_missing_value=-1
                           ),
                           categorical_features
                       ),
                       remainder="passthrough",
                       verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
In [50]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = HistGradientBoostingRegressor(
                                 categorical_features=categorical_features,
                                 random_state=123
                             ),
                 lags = 24,
                 transformer_exog = transformer_exog
             )
In [51]:
# Hyperparameters search
# ==============================================================================
# Lags grid
lags_grid = [[1, 2, 3, 23, 24, 25, 167, 168, 169]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'max_iter'          : trial.suggest_int('max_iter', 400, 1200, step=100),
        'max_depth'         : trial.suggest_int('max_depth', 3, 10, step=1),
        'learning_rate'     : trial.suggest_float('learning_rate', 0.01, 1),
        'min_samples_leaf'  : trial.suggest_int('min_samples_leaf', 1, 20, step=1),
        'l2_regularization' : trial.suggest_float('l2_regularization', 0, 1)
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'users'],
                                   exog               = data.loc[:end_validation, exog_features],
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data_train),
                                   fixed_train_size   = False,
                                   n_trials           = 20,
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 20,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [  1   2   3  23  24  25 167 168 169] 
  Parameters: {'max_iter': 1100, 'max_depth': 10, 'learning_rate': 0.025545046292043277, 'min_samples_leaf': 6, 'l2_regularization': 0.1718958203525952}
  Backtesting metric: 62.312120332509

In [52]:
# Backtesting model with exogenous variables on test data
# ==============================================================================
metric_histgb, predictions = backtesting_forecaster(
                                 forecaster         = forecaster,
                                 y                  = data['users'],
                                 exog               = data[exog_features],
                                 steps              = 36,
                                 metric             = 'mean_absolute_error',
                                 initial_train_size = len(data[:end_validation]),
                                 refit              = False,
                                 n_jobs             = 'auto',
                                 verbose            = False,
                                 show_progress      = True
                             )

print(f"Backtest error: {metric_histgb:.2f}")
Backtest error: 48.51

CatBoost

Unfortunately, the current version of skforecast is not compatible with CatBoost's built-in handling of categorical features. The issue arises because CatBoost only accepts categorical features as integers, while skforecast converts input data to floats for faster computation using numpy arrays in the internal prediction process. To work around this limitation, it is necessary to apply either the one-hot encoding or label encoding strategy to the categorical features before using them with CatBoost.

In [53]:
# One hot encoding
# ==============================================================================
one_hot_encoder = make_column_transformer(
                      (
                          OneHotEncoder(sparse_output=False, drop='if_binary'),
                          make_column_selector(dtype_exclude=np.number),
                      ),
                      remainder="passthrough",
                      verbose_feature_names_out=False,
                  ).set_output(transform="pandas")

# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = CatBoostRegressor(
                                 random_state=123, 
                                 silent=True, 
                                 allow_writing_files=False,
                                 boosting_type = 'Plain', # Faster training
                                 leaf_estimation_iterations = 3, # Faster training
                             ),
                 lags = 24,
                 transformer_exog = one_hot_encoder
             )
In [54]:
# Hyperparameters search
# ==============================================================================
# Lags grid
lags_grid = [[1, 2, 3, 23, 24, 25, 167, 168, 169]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'n_estimators'  : trial.suggest_int('n_estimators', 100, 1000, step=100),
        'max_depth'     : trial.suggest_int('max_depth', 3, 10, step=1),
        'learning_rate' : trial.suggest_float('learning_rate', 0.01, 1),
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'users'],
                                   exog               = data.loc[:end_validation, exog_features],
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   initial_train_size = len(data_train),
                                   fixed_train_size   = False,
                                   n_trials           = 20,
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 20,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [  1   2   3  23  24  25 167 168 169] 
  Parameters: {'n_estimators': 1000, 'max_depth': 3, 'learning_rate': 0.021472902970099742}
  Backtesting metric: 59.34519333454538

In [55]:
# Backtesting on test data
# ==============================================================================
metric_catboost, predictions = backtesting_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data['users'],
                                   exog               = data[exog_features],
                                   initial_train_size = len(data.loc[:end_validation]),
                                   fixed_train_size   = False,
                                   steps              = 36,
                                   refit              = False,
                                   metric             = 'mean_absolute_error',
                                   n_jobs             = 'auto',
                                   verbose            = False
                               )

print(f"Backtest error: {metric_catboost:.2f}")
Backtest error: 53.06

Conclusion

  • The use of Gradient Boosting models in forecasting problems is very easy thanks to the functionalities offered by skforecast.

  • As shown in this document, the inclusion of exogenous variables as predictors can significantly improve forecasting performance.

  • Categorical features can be included as exogenous variables without the need for preprocessing. This is possible thanks to the native support for categorical features offered by LightGBM, XGBoost, CatBoost and HistGradientBoostingRegressor.

In [56]:
pd.DataFrame({
    'Model': ['Baseline', 'LGBMRegressor', 'XGBRegressor',
              'HistGradientBoostingRegressor', 'CatBoostRegressor'],
    'MAE': [metric_baseline, metric_lgbm, metric_xgboost,
            metric_histgb, metric_catboost]
}).round(2).sort_values(by='MAE')
Out[56]:
Model MAE
1 LGBMRegressor 47.39
3 HistGradientBoostingRegressor 48.51
2 XGBRegressor 50.62
4 CatBoostRegressor 53.06
0 Baseline 91.67

✎ Note

To illustrate the process, the hyperparameter search has been kept small. However, to achieve optimal results, it may be necessary to perform a more extensive search for each model.

Session information

In [57]:
import session_info
session_info.show(html=False)
-----
astral              3.2
catboost            1.2.2
lightgbm            4.1.0
matplotlib          3.7.2
numpy               1.25.2
optuna              3.2.0
pandas              2.0.3
plotly              5.17.0
session_info        1.0.0
shap                0.43.0
skforecast          0.11.0
sklearn             1.3.0
statsmodels         0.14.0
xgboost             1.7.6
-----
IPython             8.14.0
jupyter_client      8.3.0
jupyter_core        5.3.1
jupyterlab          4.0.5
notebook            6.5.4
-----
Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]
Linux-5.15.0-1050-aws-x86_64-with-glibc2.31
-----
Session information updated at 2023-12-14 10:09

Bibliography

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia.

Time Series Analysis and Forecasting with ADAM Ivan Svetunkov.

Joseph, M. (2022). Modern time series forecasting with Python: Explore industry-ready time series forecasting using modern machine learning and Deep Learning. Packt Publishing.

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM and CatBoost by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a Attribution-NonCommercial-ShareAlike 4.0 International at https://www.cienciadedatos.net/documentos/py39-forecasting-time-series-with-skforecast-xgboost-lightgbm-catboost.html

How to cite skforecast

If you use skforecast for a scientific publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2023). skforecast (v0.10.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2023). skforecast (Version 0.10.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.10.0}, month = {9}, year = {2023}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊


Creative Commons Licence
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.