Forecasting energy demand with machine learning

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Forecasting energy demand with machine learning

Joaquín Amat Rodrigo, Javier Escobar Ortiz
April, 2021 (last update December 2023)

Introduction

Energy demand forecasting plays a critical role in effectively managing and planning resources for power generation, distribution, and utilization. Predicting energy demand is a complex task influenced by factors such as weather patterns, economic conditions, and societal behavior. This document will examine the creation of forecasting models utilizing machine learning to predict energy demand.

Time series and forecasting

A time series is a sequence of chronologically ordered data at equal or unequal intervals. The forecasting process consists of predicting the future value of a time series, either by modelling the series solely on the basis of its past behavior (autoregressive) or by using other external variables.


When working with time series, it is rarely necessary to predict only the next element in the series ($t_{+1}$). Instead, the most common goal is to forecast a whole future interval (($t_{+1}$), ..., ($t_{+n}$)) or a far future time ($t_{+n}$). Several strategies can be used to generate this type of forecast, skforecast has implemented the following for univariate time series forecasting:

  • Recursive multi-step forecasting: since the value $t_{n-1}$ is needed to predict $t_{n}$, and $t_{n-1}$ is unknown, a recursive process is applied in which, each new prediction, is based on the previous one. This process is known as recursive forecasting or recursive multi-step forecasting and can be easily generated with the ForecasterAutoreg and ForecasterAutoregCustom classes.

  • Direct multi-step forecasting: this method consists of training a different model for each step of the forecast horizon. For example, to predict the next 5 values of a time series, 5 different models are trained, one for each step. As a result, the predictions are independent of each other. This entire process is automated in the ForecasterAutoregDirect class.

Libraries

The libraries used in this document are:

In [1]:
# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd
from astral.sun import sun
from astral import LocationInfo
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.renderers.default = 'notebook' 
pio.templates.default = "seaborn"
poff.init_notebook_mode(connected=True)
plt.style.use('seaborn-v0_8-darkgrid')

# Modelling and Forecasting
# ==============================================================================
from lightgbm import LGBMRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFECV
from skforecast.ForecasterBaseline import ForecasterEquivalentDate
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
from skforecast.model_selection import bayesian_search_forecaster
from skforecast.model_selection import backtesting_forecaster
import shap
shap.initjs()

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data

A time series of electricity demand (MW) is available for the state of Victoria (Australia) from 2011-12-31 to 2014-12-31 is available. The data used in this document has been obtained from the R tsibbledata package. The dataset contains 5 columns and 52,608 complete records. The information in each column is:

  • Time: date and time of the record.
  • Date: date of the record.
  • Demand: electricity demand (MW).
  • Temperature: temperature in Melbourne, the capital of Victoria.
  • Holiday: indicates if the day is a public holiday.
In [2]:
# Data download
# ==============================================================================
data = fetch_dataset(name='vic_electricity', raw=True)
data.info()
vic_electricity
---------------
Half-hourly electricity demand for Victoria, Australia
O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse
Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/,
https://github.com/tidyverts/tsibbledata/.
https://tsibbledata.tidyverts.org/reference/vic_elec.html
Shape of the dataset: (52608, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52608 entries, 0 to 52607
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Time         52608 non-null  object 
 1   Demand       52608 non-null  float64
 2   Temperature  52608 non-null  float64
 3   Date         52608 non-null  object 
 4   Holiday      52608 non-null  bool   
dtypes: bool(1), float64(2), object(2)
memory usage: 1.7+ MB

The Time column is stored as a string. To convert it to datetime, the function pd.to_datetime() is used. Once in datetime format, and to make use of pandas functionalities, it is set as an index. Since the data was recorded every 30 minutes, the frequency '30min' is also specified.

In [3]:
# Data preparation
# ==============================================================================
data = data.copy()
data['Time'] = pd.to_datetime(data['Time'], format='%Y-%m-%dT%H:%M:%SZ')
data = data.set_index('Time')
data = data.asfreq('30min')
data = data.sort_index()
data.head(2)
Out[3]:
Demand Temperature Date Holiday
Time
2011-12-31 13:00:00 4382.825174 21.40 2012-01-01 True
2011-12-31 13:30:00 4263.365526 21.05 2012-01-01 True

One of the first analyses to be carried out when working with time series is to check that the series is complete, that is, that there are no missing values.

In [4]:
# Verify that a temporary index is complete
# ==============================================================================
(data.index == pd.date_range(start=data.index.min(),
                             end=data.index.max(),
                             freq=data.index.freq)).all()

print(f"Number of rows with missing values: {data.isnull().any(axis=1).mean()}")
Number of rows with missing values: 0.0
In [5]:
# Fill gaps in a temporary index
# ==============================================================================
# data.asfreq(freq='30min', fill_value=np.nan)

Although the data is at 30 minute intervals, the aim is to create a model capable of predicting hourly electricity demand, so the data needs to be aggregated. This type of transformation can be done very quickly by combining the Pandas time type index and its resample() method.

It is very important to use the closed='left' and label='right' arguments correctly to avoid introducing future information into the training, leakage). Suppose that values are available for 10:10, 10:30, 10:45, 11:00, 11:12, and 11:30. To obtain the hourly average, the value assigned to 11:00 must be calculated using the values for 10:10, 10:30, and 10:45; and the value assigned to 12:00 must be calculated using the value for 11:00, 11:12 and 11:30.

Diagram of temporal data aggregation excluding forward-looking information.

The 11:00 average does not include the 11:00 point value because in reality the value is not available at that exact time.

In [6]:
# Aggregating in 1H intervals
# ==============================================================================
# The Date column is eliminated so that it does not generate an error when aggregating.
# The Holiday column does not generate an error since it is Boolean and is treated as 0-1.
data = data.drop(columns='Date')
data = data.resample(rule='H', closed='left', label ='right').mean()
data
Out[6]:
Demand Temperature Holiday
Time
2011-12-31 14:00:00 4323.095350 21.225 1.0
2011-12-31 15:00:00 3963.264688 20.625 1.0
2011-12-31 16:00:00 3950.913495 20.325 1.0
2011-12-31 17:00:00 3627.860675 19.850 1.0
2011-12-31 18:00:00 3396.251676 19.025 1.0
... ... ... ...
2014-12-31 09:00:00 4069.625550 21.600 0.0
2014-12-31 10:00:00 3909.230704 20.300 0.0
2014-12-31 11:00:00 3900.600901 19.650 0.0
2014-12-31 12:00:00 3758.236494 18.100 0.0
2014-12-31 13:00:00 3785.650720 17.200 0.0

26304 rows × 3 columns

The dataset starts on 2011-12-31 14:00:00 and ends on 2014-12-31 13:00:00. The first 10 and the last 13 records are discarded so that it starts on 2012-01-01 00:00:00 and ends on 2014-12-30 23:00:00. In addition, in order to optimize the hyperparameters of the model and evaluate its predictive ability, the data is divided into 3 sets: training, validation and test.

In [7]:
# Split data into train-val-test
# ==============================================================================
data = data.loc['2012-01-01 00:00:00': '2014-12-30 23:00:00'].copy()
end_train = '2013-12-31 23:59:00'
end_validation = '2014-11-30 23:59:00'
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

print(f"Train dates      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Validation dates : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Test dates       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates      : 2012-01-01 00:00:00 --- 2013-12-31 23:00:00  (n=17544)
Validation dates : 2014-01-01 00:00:00 --- 2014-11-30 23:00:00  (n=8016)
Test dates       : 2014-12-01 00:00:00 --- 2014-12-30 23:00:00  (n=720)

Graphic exploration

Graphical exploration of time series can be an effective way of identifying trends, patterns, and seasonal variations. This, in turn, helps to guide the selection of the most appropriate forecasting model.

Plot time series

Full time series

In [8]:
# Interactive plot of time series
# ==============================================================================
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_train.index, y=data_train['Demand'], mode='lines', name='Train'))
fig.add_trace(go.Scatter(x=data_val.index, y=data_val['Demand'], mode='lines', name='Validation'))
fig.add_trace(go.Scatter(x=data_test.index, y=data_test['Demand'], mode='lines', name='Test'))
fig.update_layout(
    title  = 'Hourly energy demand',
    xaxis_title="Time",
    yaxis_title="Demand",
    legend_title="Partition:",
    width=850,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.001
    )
)
#fig.update_xaxes(rangeslider_visible=True)
fig.show()

The graph above shows that electricity demand has an annual seasonality. There is a peak in July and very pronounced peaks in demand between January and March.

Section of the time series

Due to the variance of the time series, it is not possible to see the possible intraday pattern from a single chart.

In [9]:
# Zooming time series chart
# ==============================================================================
zoom = ('2013-05-01 14:00:00','2013-06-01 14:00:00')

fig = plt.figure(figsize=(8, 4))
grid = plt.GridSpec(nrows=8, ncols=1, hspace=0.6, wspace=0)

main_ax = fig.add_subplot(grid[:3, :])
data.Demand.plot(ax=main_ax, c='black', alpha=0.5, linewidth=0.5)
min_y = min(data.Demand)
max_y = max(data.Demand)
main_ax.fill_between(zoom, min_y, max_y, facecolor='blue', alpha=0.5, zorder=0)
main_ax.set_title(f'Electricity demand: {data.index.min()}, {data.index.max()}', fontsize=10)
main_ax.set_xlabel('')

zoom_ax = fig.add_subplot(grid[5:, :])
data.loc[zoom[0]: zoom[1]].Demand.plot(ax=zoom_ax, color='blue', linewidth=1)
zoom_ax.set_title(f'Electricity demand: {zoom}', fontsize=10)
zoom_ax.set_xlabel('')

plt.subplots_adjust(hspace=1)
plt.show();

When the time series are zoomed in, a clear weekly seasonality can be seen, with higher consumption during the working week (Monday to Friday) and lower consumption at weekends. There is also a clear correlation between the consumption of one day and that of the previous day.

Seasonality plots

Seasonal plots are a useful tool for identifying seasonal patterns and trends in a time series. They are created by averaging the values of each season over time and then plotting them against time.

Annual, weekly and daily seasonality

In [10]:
# Demand distribution by month
# ==============================================================================
fig, ax = plt.subplots(figsize=(5.5, 2.5))
data['month'] = data.index.month
data.boxplot(column='Demand', by='month', ax=ax,)
data.groupby('month')['Demand'].median().plot(style='o-', linewidth=0.8, ax=ax)
ax.set_ylabel('Demand')
ax.set_title('Demand distribution by month')
fig.suptitle('')
plt.show();

It is observed that there is an annual seasonality, with higher (median) demand values in June, July, and August, and with high demand peaks in November, December, January, February, and March.

In [11]:
# Demand distribution by week day
# ==============================================================================
fig, ax = plt.subplots(figsize=(5.5, 2.5))
data['week_day'] = data.index.day_of_week + 1
data.boxplot(column='Demand', by='week_day', ax=ax)
data.groupby('week_day')['Demand'].median().plot(style='o-', linewidth=0.8, ax=ax)
ax.set_ylabel('Demand')
ax.set_title('Demand distribution by week day')
fig.suptitle('')
plt.show();

Weekly seasonality shows lower demand values during the weekend.

In [12]:
# Demand distribution by the hour of the day
# ==============================================================================
fig, ax = plt.subplots(figsize=(5.5, 2.5))
data['hour_day'] = data.index.hour + 1
data.boxplot(column='Demand', by='hour_day', ax=ax)
data.groupby('hour_day')['Demand'].median().plot(style='o-', linewidth=0.8, ax=ax)
ax.set_ylabel('Demand')
ax.set_title('Demand distribution by the hpur of the day')
fig.suptitle('')
plt.show();

There is also a daily seasonality, with demand falling between 16:00 and 21:00 hours.

Autocorrelation plots

Auto-correlation plots are a useful tool for identifying the order of an autoregressive model. The autocorrelation function (ACF) is a measure of the correlation between the time series and a lagged version of itself. The partial autocorrelation function (PACF) is a measure of the correlation between the time series and a lagged version of itself, controlling for the values of the time series at all shorter lags. These plots are useful for identifying the lags to be included in the autoregressive model.

In [13]:
# Autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2))
plot_acf(data.Demand, ax=ax, lags=60)
plt.show()
In [14]:
# Partial autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2))
plot_pacf(data.Demand, ax=ax, lags=60)
plt.show()

The autocorrelation plot demonstrates a strong correlation between demand in one hour and prior hours, as well as between demand in one hour and the corresponding hour in preceding days. This observed correlation suggests that autoregressive models may be effective in this scenario.

Baseline

When faced with a forecasting problem, it is important to establish a baseline model. This is usually a very simple model that can be used as a reference to assess whether more complex models are worth implementing.

Skforecast allows you to easily create a baseline with its class ForecasterEquivalentDate. This model, also known as Seasonal Naive Forecasting, simply returns the value observed in the same period of the previous season (e.g., the same working day from the previous week, the same hour from the previous day, etc.).

Based on the exploratory analysis performed, the baseline model will be the one that predicts each hour using the value of the same hour on the previous day.

✎ Note

In the following code cells, a baseline forecaster is trained and its predictive ability is evaluated using a backtesting process. If this concept is new to you, do not worry, it will be explained in detail throughout the document. For now, it is sufficient to know that the backtesting process consists of training the model with a certain amount of data and evaluating its predictive ability with the data that the model has not seen. The error metric will be used as a reference to evaluate the predictive ability of the more complex models that will be implemented later.
In [15]:
# Create baseline: value of the same hour of the previous day
# ==============================================================================
forecaster = ForecasterEquivalentDate(
                 offset    = pd.DateOffset(days=1),
                 n_offsets = 1
             )

# Train forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'Demand'])
forecaster
Out[15]:
======================== 
ForecasterEquivalentDate 
======================== 
Offset: <DateOffset: days=1> 
Number of offsets: 1 
Aggregation function: mean 
Window size: 24 
Training range: [Timestamp('2012-01-01 00:00:00'), Timestamp('2014-11-30 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Creation date: 2023-12-09 10:29:11 
Last fit date: 2023-12-09 10:29:11 
Skforecast version: 0.11.0 
Python version: 3.11.5 
Forecaster id: None 
In [16]:
# Backtesting
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['Demand'],
                          steps              = 24,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data.loc[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False,
                          show_progress      = True
                      )

print(f"Backtest error (MAE): {metric}")
Backtest error (MAE): 308.3752715958334

Recursive multi-step forecasting

A recursive autoregressive model ForecasterAutoreg is trained using a gradient boosting regressor LGBMRegressor as the base regressor. A time window of 24 hours (24 lags) is used to predict the next hour's demand. This means that the demand values of the previous 24 hours are used as predictors. The hyperparameters of the underlying regressor are left at their default values.

In [17]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=15926, verbose=-1),
                 lags      = 24
             )

# Train forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'Demand'])
forecaster
Out[17]:
================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(random_state=15926, verbose=-1) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
Transformer for y: None 
Transformer for exog: None 
Window size: 24 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2012-01-01 00:00:00'), Timestamp('2014-11-30 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1} 
fit_kwargs: {} 
Creation date: 2023-12-09 10:29:12 
Last fit date: 2023-12-09 10:29:12 
Skforecast version: 0.11.0 
Python version: 3.11.5 
Forecaster id: None 

Backtesting

To obtain a robust estimate of the model's predictive ability, a backtesting process is performed. The backtesting process consists of generating a forecast for each observation in the test set, following the same procedure as would be done in production, and then comparing the predicted value to the actual value.

The backtesting process is applied using the backtesting_forecaster() function. For this use case, the simulation is carried out as follows: the model is trained with data from 2012-01-01 00:00 to 2014-11-30 23:00, and then it predicts the next 24 hours every day at 23:59. The error metric used is the Mean Absolute Error (MAE).

✎ Note

Since skforecast 0.9.0, all backtesting and grid search functions have been extended to include the n_jobs argument, allowing multi-process parallelization for improved performance. This applies to all functions of the different model_selection modules. The benefits of parallelization depend on several factors, including the regressor used, the number of fits to be performed, and the volume of data involved. When the n_jobs parameter is set to 'auto', the level of parallelization is automatically selected based on heuristic rules that aim to choose the best option for each scenario. For more information, see the backtesting section of the documentation.
In [18]:
# Backtesting
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['Demand'],
                          steps              = 24,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data.loc[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = True,
                          show_progress      = True
                      )
Information of backtesting process
----------------------------------
Number of observations used for initial training: 25560
Number of observations used for backtesting: 720
    Number of folds: 30
    Number of steps per fold: 24
    Number of steps to exclude from the end of each train set before test (gap): 0

Fold: 0
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-01 00:00:00 -- 2014-12-01 23:00:00  (n=24)
Fold: 1
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-02 00:00:00 -- 2014-12-02 23:00:00  (n=24)
Fold: 2
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-03 00:00:00 -- 2014-12-03 23:00:00  (n=24)
Fold: 3
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-04 00:00:00 -- 2014-12-04 23:00:00  (n=24)
Fold: 4
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-05 00:00:00 -- 2014-12-05 23:00:00  (n=24)
Fold: 5
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-06 00:00:00 -- 2014-12-06 23:00:00  (n=24)
Fold: 6
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-07 00:00:00 -- 2014-12-07 23:00:00  (n=24)
Fold: 7
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-08 00:00:00 -- 2014-12-08 23:00:00  (n=24)
Fold: 8
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-09 00:00:00 -- 2014-12-09 23:00:00  (n=24)
Fold: 9
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-10 00:00:00 -- 2014-12-10 23:00:00  (n=24)
Fold: 10
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-11 00:00:00 -- 2014-12-11 23:00:00  (n=24)
Fold: 11
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-12 00:00:00 -- 2014-12-12 23:00:00  (n=24)
Fold: 12
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-13 00:00:00 -- 2014-12-13 23:00:00  (n=24)
Fold: 13
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-14 00:00:00 -- 2014-12-14 23:00:00  (n=24)
Fold: 14
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-15 00:00:00 -- 2014-12-15 23:00:00  (n=24)
Fold: 15
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-16 00:00:00 -- 2014-12-16 23:00:00  (n=24)
Fold: 16
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-17 00:00:00 -- 2014-12-17 23:00:00  (n=24)
Fold: 17
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-18 00:00:00 -- 2014-12-18 23:00:00  (n=24)
Fold: 18
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-19 00:00:00 -- 2014-12-19 23:00:00  (n=24)
Fold: 19
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-20 00:00:00 -- 2014-12-20 23:00:00  (n=24)
Fold: 20
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-21 00:00:00 -- 2014-12-21 23:00:00  (n=24)
Fold: 21
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-22 00:00:00 -- 2014-12-22 23:00:00  (n=24)
Fold: 22
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-23 00:00:00 -- 2014-12-23 23:00:00  (n=24)
Fold: 23
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-24 00:00:00 -- 2014-12-24 23:00:00  (n=24)
Fold: 24
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-25 00:00:00 -- 2014-12-25 23:00:00  (n=24)
Fold: 25
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-26 00:00:00 -- 2014-12-26 23:00:00  (n=24)
Fold: 26
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-27 00:00:00 -- 2014-12-27 23:00:00  (n=24)
Fold: 27
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-28 00:00:00 -- 2014-12-28 23:00:00  (n=24)
Fold: 28
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-29 00:00:00 -- 2014-12-29 23:00:00  (n=24)
Fold: 29
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-30 00:00:00 -- 2014-12-30 23:00:00  (n=24)

In [19]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['Demand'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Demand",
    width=800,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()
In [20]:
# Backtesting error
# ==============================================================================
print(f'Backtest error (MAE): {metric}')
Backtest error (MAE): 236.38849515592196

The autoregressive model achieves a lower MAE than the baseline model. This means that the autoregressive model is able to predict the next day's electricity demand more accurately than the baseline model.

Hyperparameter tuning

The trained ForecasterAutoreg object used the first 24 lags and a LGMBRegressor model with the default hyperparameters. However, there is no reason why these values are the most appropriate. To find the best hyperparameters, a Bayesian Search is performed using the bayesian_search_forecaster() function. The search is carried out using the same backtesting process as before, but each time, the model is trained with different combinations of hyperparameters and lags. It is important to note that the hiperparameter search must be done using the validation set, so the test data is never used.

In [21]:
# Hyperparameters search
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=15926, verbose=-1),
                 lags      = 24, # This value will be replaced in the grid search
             )

# Lags used as predictors
lags_grid = [24, [1, 2, 3, 23, 24, 25, 47, 48, 49]]

# Regressor hyperparameters search space
def search_space(trial):
    search_space  = {
        'n_estimators'  : trial.suggest_int('n_estimators', 600, 1200, step=100),
        'max_depth'     : trial.suggest_int('max_depth', 3, 10, step=1),
        'learning_rate' : trial.suggest_float('learning_rate', 0.01, 0.5),
        'reg_alpha'     : trial.suggest_float('reg_alpha', 0, 1, step=0.1),
        'reg_lambda'    : trial.suggest_float('reg_lambda', 0, 1, step=0.1),
    } 
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster         = forecaster,
                                   y                  = data.loc[:end_validation, 'Demand'],
                                   steps              = 24,
                                   metric             = 'mean_absolute_error',
                                   search_space       = search_space,
                                   lags_grid          = lags_grid,
                                   initial_train_size = len(data[:end_train]),
                                   refit              = False,
                                   n_trials           = 20, # Increase for more exhaustive search
                                   random_state       = 123,
                                   return_best        = True,
                                   n_jobs             = 'auto',
                                   verbose            = False,
                                   show_progress      = True
                               )
Number of models compared: 40,
         20 bayesian search in each lag configuration.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
  Parameters: {'n_estimators': 1100, 'max_depth': 10, 'learning_rate': 0.07087975104890648, 'reg_alpha': 0.8, 'reg_lambda': 0.2}
  Backtesting metric: 231.1119966842797

In [22]:
# Search results
# ==============================================================================
results_search.head(10)
Out[22]:
lags params mean_absolute_error n_estimators max_depth learning_rate reg_alpha reg_lambda
15 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1100, 'max_depth': 10, 'learn... 231.111997 1100.0 10.0 0.070880 0.8 0.2
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1200, 'max_depth': 8, 'learni... 233.187901 1200.0 8.0 0.071589 1.0 0.0
14 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1200, 'max_depth': 10, 'learn... 233.293696 1200.0 10.0 0.074394 0.8 0.2
17 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1100, 'max_depth': 9, 'learni... 234.490443 1100.0 9.0 0.071958 0.7 0.2
16 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1100, 'max_depth': 9, 'learni... 239.766898 1100.0 9.0 0.167380 0.8 0.2
10 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1200, 'max_depth': 8, 'learni... 242.800676 1200.0 8.0 0.038751 0.9 0.0
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1000, 'max_depth': 5, 'learni... 242.914116 1000.0 5.0 0.121157 0.6 0.7
6 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 600, 'max_depth': 6, 'learnin... 243.588252 600.0 6.0 0.221123 0.5 0.4
18 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 900, 'max_depth': 7, 'learnin... 243.741637 900.0 7.0 0.158435 1.0 0.1
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 1100, 'max_depth': 4, 'learni... 244.666519 1100.0 4.0 0.095971 0.5 0.5

Since return_best has been set to True, the forecaster object is automatically updated with the best configuration found and trained on the entire dataset. This final model can then be used for future predictions on new data.

In [23]:
# Best model
# ==============================================================================
forecaster
Out[23]:
================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(learning_rate=0.07087975104890648, max_depth=10,
              n_estimators=1100, random_state=15926, reg_alpha=0.8,
              reg_lambda=0.2, verbose=-1) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
Transformer for y: None 
Transformer for exog: None 
Window size: 24 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2012-01-01 00:00:00'), Timestamp('2014-11-30 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.07087975104890648, 'max_depth': 10, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 1100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.8, 'reg_lambda': 0.2, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1} 
fit_kwargs: {} 
Creation date: 2023-12-09 10:29:13 
Last fit date: 2023-12-09 10:32:36 
Skforecast version: 0.11.0 
Python version: 3.11.5 
Forecaster id: None 

Backtesting on test data

Once the best combination of hyperparameters has been identified using the validation data, the predictive capacity of the model is evaluated when applied to the test set. It is highly recommended to review the documentation for the backtesting_forecaster() function to gain a better understanding of its capabilities. This will help to utilize its full potential to analyze the predictive ability of the model.

In [24]:
# Backtest final model on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['Demand'],
                          steps              = 24,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False, # Change to True to see detailed information
                          show_progress      = True
                      )

print(f"Backtest error: {metric:.2f}")
predictions.head()
Backtest error: 210.06
Out[24]:
pred
2014-12-01 00:00:00 5586.288619
2014-12-01 01:00:00 5592.875437
2014-12-01 02:00:00 5567.013856
2014-12-01 03:00:00 5553.220511
2014-12-01 04:00:00 5521.734716

After optimization of lags and hyperparameters, the prediction error was notably reduced.

Forecasting with exogenous variables

So far, only lagged values of the time series have been used as predictors. However, it is possible to include other variables as predictors. These variables are known as exogenous variables (features) and their use can improve the predictive capacity of the model. A very important point to keep in mind is that the values of the exogenous variables must be known at the time of prediction.

Common examples of exogenous variables are those derived from the calendar, such as the day of the week, month, year, or holidays. Weather variables such as temperature, humidity, and wind also fall into this category, as do economic variables such as inflation and interest rates.

Warning

Exogenous variables must be known at the time of the forecast. For example, if temperature is used as an exogenous variable, the temperature value for the next hour must be known at the time of the forecast. If the temperature value is not known, the forecast will not be possible.

Weather variables should be used with caution. When the model is deployed into production, future weather conditions are not known, but are predictions made by meteorological services. Because they are predictions, they introduce errors into the forecasting model. As a result, the model's predictions are likely to get worse. One way to anticipate this problem, and to know (not avoid) the expected performance of the model, is to use the weather forecasts available at the time the model is trained, rather than the recorded conditions.

Exogenous variables

Next, exogenous variables are created based on calendar information, sunrise and sunset times, temperature, and holidays. These new variables are then added to the training, validation and test sets, and used as predictors in the autoregressive model.

In [25]:
# Calendar features
# ==============================================================================
calendar_features = pd.DataFrame(index=data.index)
calendar_features['month'] = calendar_features.index.month
calendar_features['week_of_year'] = calendar_features.index.isocalendar().week
calendar_features['week_day'] = calendar_features.index.day_of_week + 1
calendar_features['hour_day'] = calendar_features.index.hour + 1

# Sunlight features
# ==============================================================================
location = LocationInfo(
    "Melbourne",
    "Australia",
    latitude=-37.8,
    longitude=144.95,
    timezone='Australia/Melbourne'
)
sunrise_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunrise'].hour
    for date in data.index
]
sunset_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunset'].hour
    for date in data.index
]
sun_light_features = pd.DataFrame({
                         'sunrise_hour': sunrise_hour,
                         'sunset_hour': sunset_hour}, 
                         index = data.index
                     )
sun_light_features['daylight_hours'] = (
    sun_light_features['sunset_hour'] - sun_light_features['sunrise_hour']
)
sun_light_features['is_daylight'] = np.where(
                                        (data.index.hour >= sun_light_features['sunrise_hour']) & \
                                        (data.index.hour < sun_light_features['sunset_hour']),
                                        1,
                                        0
                                    )

# Holiday features
# ==============================================================================
holiday_features = data[['Holiday']].astype(int)
holiday_features['holiday_previous_day'] = holiday_features['Holiday'].shift(24)
holiday_features['holiday_next_day'] = holiday_features['Holiday'].shift(-24)

# Temperature features
# ==============================================================================
temp_features = data[['Temperature']].copy()
temp_features['temp_roll_mean_1_day'] = temp_features['Temperature'].rolling(24, closed='left').mean()
temp_features['temp_roll_mean_7_day'] = temp_features['Temperature'].rolling(24*7, closed='left').mean()
temp_features['temp_roll_max_1_day'] = temp_features['Temperature'].rolling(24, closed='left').max()
temp_features['temp_roll_min_1_day'] = temp_features['Temperature'].rolling(24, closed='left').min()
temp_features['temp_roll_max_7_day'] = temp_features['Temperature'].rolling(24*7, closed='left').max()
temp_features['temp_roll_min_7_day'] = temp_features['Temperature'].rolling(24*7, closed='left').min()


# Merge all exogenous variables
# ==============================================================================
exogenous_features = pd.concat([
                         calendar_features,
                         sun_light_features,
                         temp_features,
                         holiday_features
                     ], axis=1)

exogenous_features.head(4)
Out[25]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight Temperature temp_roll_mean_1_day temp_roll_mean_7_day temp_roll_max_1_day temp_roll_min_1_day temp_roll_max_7_day temp_roll_min_7_day Holiday holiday_previous_day holiday_next_day
Time
2012-01-01 00:00:00 1 52 7 1 6 20 14 0 27.000 NaN NaN NaN NaN NaN NaN 1 NaN 1.0
2012-01-01 01:00:00 1 52 7 2 6 20 14 0 29.650 NaN NaN NaN NaN NaN NaN 1 NaN 1.0
2012-01-01 02:00:00 1 52 7 3 6 20 14 0 31.650 NaN NaN NaN NaN NaN NaN 1 NaN 1.0
2012-01-01 03:00:00 1 52 7 4 6 20 14 0 32.675 NaN NaN NaN NaN NaN NaN 1 NaN 1.0

Certain aspects of the calendar, such as hours or days, are cyclical. For example, the hour-day cycle ranges from 0 to 23 hours. Although interpreted as a continuous variable, the hour 23:00 is only one hour away from 00:00. The same is true for the month-year cycle, since December is only one month away from January. The use of trigonometric functions such as sine and cosine transformations makes it possible to represent cyclic patterns and avoid inconsistencies in data representation. This approach is known as cyclic encoding and can significantly improve the predictive ability of models.

In [26]:
# Cliclical encoding of calendar and sunlight features
# ==============================================================================

def cyclical_encoding(data: pd.Series, cycle_length: int) -> pd.DataFrame:
    """
    Encode a cyclical feature with two new features sine and cosine.
    The minimum value of the feature is assumed to be 0. The maximum value
    of the feature is passed as an argument.
      
    Parameters
    ----------
    data : pd.Series
        Series with the feature to encode.
    cycle_length : int
        The length of the cycle. For example, 12 for months, 24 for hours, etc.
        This value is used to calculate the angle of the sin and cos.

    Returns
    -------
    result : pd.DataFrame
        Dataframe with the two new features sin and cos.

    """

    sin = np.sin(2 * np.pi * data/cycle_length)
    cos = np.cos(2 * np.pi * data/cycle_length)
    result =  pd.DataFrame({
                  f"{data.name}_sin": sin,
                  f"{data.name}_cos": cos
              })

    return result


month_encoded = cyclical_encoding(exogenous_features['month'], cycle_length=12)
week_of_year_encoded = cyclical_encoding(exogenous_features['week_of_year'], cycle_length=52)
week_day_encoded = cyclical_encoding(exogenous_features['week_day'], cycle_length=7)
hour_day_encoded = cyclical_encoding(exogenous_features['hour_day'], cycle_length=24)
sunrise_hour_encoded = cyclical_encoding(exogenous_features['sunrise_hour'], cycle_length=24)
sunset_hour_encoded = cyclical_encoding(exogenous_features['sunset_hour'], cycle_length=24)

cyclical_features = pd.concat([
                        month_encoded,
                        week_of_year_encoded,
                        week_day_encoded,
                        hour_day_encoded,
                        sunrise_hour_encoded,
                        sunset_hour_encoded
                    ], axis=1)

exogenous_features = pd.concat([exogenous_features, cyclical_features], axis=1)
exogenous_features.head(3)
Out[26]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight Temperature temp_roll_mean_1_day ... week_of_year_sin week_of_year_cos week_day_sin week_day_cos hour_day_sin hour_day_cos sunrise_hour_sin sunrise_hour_cos sunset_hour_sin sunset_hour_cos
Time
2012-01-01 00:00:00 1 52 7 1 6 20 14 0 27.00 NaN ... 0.0 1.0 -2.449294e-16 1.0 0.258819 0.965926 1.0 6.123234e-17 -0.866025 0.5
2012-01-01 01:00:00 1 52 7 2 6 20 14 0 29.65 NaN ... 0.0 1.0 -2.449294e-16 1.0 0.500000 0.866025 1.0 6.123234e-17 -0.866025 0.5
2012-01-01 02:00:00 1 52 7 3 6 20 14 0 31.65 NaN ... 0.0 1.0 -2.449294e-16 1.0 0.707107 0.707107 1.0 6.123234e-17 -0.866025 0.5

3 rows × 30 columns

✎ Note

For illustrative purposes, calendar features and their cyclic encoding have been done manually. However, for a more efficient and automated approach, it is recommended to use the transformers DatetimeFeatures and CyclicalFeatures available in Feature-engine library, and pass them directly to the transformer_exog argument of the forecaster.

In many cases, exogenous variables are not isolated. Rather, their effect on the target variable depends on the value of other variables. For example, the effect of temperature on electricity demand depends on the time of day. The interaction between the exogenous variables can be captured by new variables that are obtained by multiplying existing variables together. This interactions are easily obtained with the PolynomialFeatures class from scikit-learn.

In [27]:
# Interaction between exogenous variables
# ==============================================================================
transformer_poly = PolynomialFeatures(
                       degree           = 2,
                       interaction_only = True,
                       include_bias     = False
                   ).set_output(transform="pandas")

poly_cols = [
    'month_sin',
    'month_cos',
    'week_of_year_sin',
    'week_of_year_cos',
    'week_day_sin',
    'week_day_cos',
    'hour_day_sin',
    'hour_day_cos',
    'sunrise_hour_sin',
    'sunrise_hour_cos',
    'sunset_hour_sin',
    'sunset_hour_cos',
    'daylight_hours',
    'is_daylight',
    'holiday_previous_day',
    'holiday_next_day',
    'temp_roll_mean_1_day',
    'temp_roll_mean_7_day',
    'temp_roll_max_1_day',
    'temp_roll_min_1_day',
    'temp_roll_max_7_day',
    'temp_roll_min_7_day',
    'Temperature',
    'Holiday'
]

poly_features = transformer_poly.fit_transform(exogenous_features[poly_cols].dropna())
poly_features = poly_features.drop(columns=poly_cols)
poly_features.columns = [f"poly_{col}" for col in poly_features.columns]
poly_features.columns = poly_features.columns.str.replace(" ", "__")
exogenous_features = pd.concat([exogenous_features, poly_features], axis=1)
exogenous_features.head(4)
Out[27]:
month week_of_year week_day hour_day sunrise_hour sunset_hour daylight_hours is_daylight Temperature temp_roll_mean_1_day ... poly_temp_roll_min_1_day__temp_roll_max_7_day poly_temp_roll_min_1_day__temp_roll_min_7_day poly_temp_roll_min_1_day__Temperature poly_temp_roll_min_1_day__Holiday poly_temp_roll_max_7_day__temp_roll_min_7_day poly_temp_roll_max_7_day__Temperature poly_temp_roll_max_7_day__Holiday poly_temp_roll_min_7_day__Temperature poly_temp_roll_min_7_day__Holiday poly_Temperature__Holiday
Time
2012-01-01 00:00:00 1 52 7 1 6 20 14 0 27.000 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2012-01-01 01:00:00 1 52 7 2 6 20 14 0 29.650 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2012-01-01 02:00:00 1 52 7 3 6 20 14 0 31.650 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2012-01-01 03:00:00 1 52 7 4 6 20 14 0 32.675 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 306 columns

In [28]:
# Select exogenous variables to be included in the model
# ==============================================================================
exog_features = []

# Columns that ends with _sin or _cos are selected
exog_features.extend(exogenous_features.filter(regex='_sin$|_cos$').columns.tolist())

# Columns that start with temp_ are selected
exog_features.extend(exogenous_features.filter(regex='^temp_.*').columns.tolist())

# Columns that start with holiday_ are selected
exog_features.extend(exogenous_features.filter(regex='^holiday_.*').columns.tolist())

# Include original features
exog_features.extend(['Temperature', 'Holiday'])
In [29]:
# Merge target and exogenous variables in the same DataFrame
# ==============================================================================
data = data[['Demand']].merge(
           exogenous_features[exog_features],
           left_index=True,
           right_index=True,
           how='left'
       )

# Due to the creation of moving averages, there are missing values at the beginning
# of the series. Due to holiday_next_day there are missing values at the end.
data = data.dropna()
data = data.astype('float32')

# Split data into train-val-test
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

The hyperparameters and lags identified as optimal in the previous section are again used to train the model, but this time, the exogenous variables are also included as predictors.

In [30]:
# Create forecaster
# ==============================================================================
params = {
    'n_estimators': 1100,
    'max_depth': 10,
    'learning_rate': 0.0709,
    'reg_alpha': 0.8,
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = 24
             )
In [31]:
# Backtesting model
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['Demand'],
                          exog               = data[exog_features],
                          steps              = 24,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False,
                          show_progress      = True
                      )

print(f"Backtest error: {metric:.2f}")
predictions.head()
Backtest error: 132.77
Out[31]:
pred
2014-12-01 00:00:00 5552.413345
2014-12-01 01:00:00 5489.721561
2014-12-01 02:00:00 5465.588132
2014-12-01 03:00:00 5451.111073
2014-12-01 04:00:00 5525.933964

The inclusion of exogenous variables as predictors further improves the predictive capacity of the model.

Prediction intervals

A prediction interval defines the interval within which the true value of the target variable can be expected to be found with a given probability. Rob J Hyndman and George Athanasopoulos, in their book Forecasting: Principles and Practice, list multiple ways to estimate prediction intervals, most of which require that the residuals (errors) of the model to be normally distributed. If this cannot be assumed, one can resort to bootstrapping, which requires only that the residuals be uncorrelated. This is one of the methods available in skforecast. A more detailed explanation of prediction intervals can be found in the Probabilistic forecasting: prediction intervals and prediction distribution user guide.

The following code shows how to generate prediction intervals for the autoregressive model. The prediction_interval() function is used to generate the prediction intervals for each predicted step. Then, the backtesting_forecaster() function is used to generate the prediction intervals for the entire test set. The interval argument is used to specify the desired coverage probability of the prediction intervals. In this case, interval is set to [10, 90], which means that the prediction intervals are calculated for the 10th and 90th percentiles, resulting in a theoretical coverage probability of 80%. The n_boot argument is used to specify the number of bootstrap samples to be used in estimating the prediction intervals. The larger the number of samples, the more accurate the prediction intervals will be, but the longer the calculation will take.

In [32]:
# Create and train forecaster
# ==============================================================================
params = {
    'n_estimators': 1100,
    'max_depth': 10,
    'learning_rate': 0.0709,
    'reg_alpha': 0.8,
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = 24
             )

forecaster.fit(
    y    = data.loc[:end_train, 'Demand'],
    exog = data.loc[:end_train, exog_features]
)
In [33]:
# Predict intervals
# ==============================================================================
# Since the model has been trained with exogenous variables, they must be provided
# for the prediction.
predictions = forecaster.predict_interval(
                  exog     = data.loc[end_train:, exog_features],
                  steps    = 24,
                  interval = [10, 90], 
              )
predictions.head()
Out[33]:
pred lower_bound upper_bound
2014-01-01 00:00:00 3600.929322 3572.643772 3624.916926
2014-01-01 01:00:00 3700.916873 3665.079895 3725.453531
2014-01-01 02:00:00 3758.736410 3727.163598 3807.674444
2014-01-01 03:00:00 3751.756941 3718.551059 3812.145448
2014-01-01 04:00:00 3750.495804 3717.478448 3827.383248

By default, intervals are calculated using in-sample residuals (residuals from the training set). However, this can result in intervals that are too narrow (overly optimistic). To avoid this, the set_out_sample_residuals() method is used to specify out-sample residuals computed with a validation set through backtesting.

In [34]:
# Backtesting on validation data to obtain out-sample residuals
# ==============================================================================
_, predictions_val = backtesting_forecaster(
                         forecaster         = forecaster,
                         y                  = data.loc[:end_validation, 'Demand'], # Train + Validation
                         exog               = data.loc[:end_validation, exog_features],
                         steps              = 24,
                         metric             = 'mean_absolute_error',
                         initial_train_size = len(data.loc[:end_train]),
                         refit              = False,
                         n_jobs             = 'auto',
                         verbose            = False,
                         show_progress      = True
                     )
residuals = predictions_val['pred'] - data.loc[predictions_val.index, 'Demand']
residuals = residuals.dropna().to_numpy()

# Store out sample residuals in the forecaster
# ==============================================================================
forecaster.set_out_sample_residuals(residuals=residuals)

Then, backtest in the test set using the out-of-sample residuals by setting the in_sample_residuals argument to False.

In [35]:
# Backtesting with prediction intervals in test data using out-sample residuals
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster          = forecaster,
                          y                   = data['Demand'], # Full dataset
                          exog                = data[exog_features],
                          steps               = 24,
                          metric              = 'mean_absolute_error',
                          initial_train_size  = len(data.loc[:end_validation]),
                          refit               = False,
                          interval            = [10, 90],
                          n_boot              = 1000,
                          in_sample_residuals = False, # Use out-sample residuals
                          n_jobs              = 'auto',
                          verbose             = False,
                          show_progress       = True
                      )
predictions.head(5)
Out[35]:
pred lower_bound upper_bound
2014-12-01 00:00:00 5552.413345 5393.982951 5751.242281
2014-12-01 01:00:00 5489.721561 5254.015126 5840.891039
2014-12-01 02:00:00 5465.588132 5196.604384 5974.429540
2014-12-01 03:00:00 5451.111073 5168.466441 6054.435283
2014-12-01 04:00:00 5525.933964 5220.894376 6250.380375
In [36]:
# Plot prediction intervals vs real value
# ==============================================================================
fig = go.Figure([
    go.Scatter(
        name='Prediction',
        x=predictions.index,
        y=predictions['pred'],
        mode='lines',
    ),
    go.Scatter(
        name='Real value',
        x=data_test.index,
        y=data_test['Demand'],
        mode='lines',
    ),
    go.Scatter(
        name='Upper Bound',
        x=predictions.index,
        y=predictions['upper_bound'],
        mode='lines',
        marker=dict(color="#444"),
        line=dict(width=0),
        showlegend=False
    ),
    go.Scatter(
        name='Lower Bound',
        x=predictions.index,
        y=predictions['lower_bound'],
        marker=dict(color="#444"),
        line=dict(width=0),
        mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)',
        fill='tonexty',
        showlegend=False
    )
])
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Demand",
    width=800,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    hovermode="x",
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()
In [37]:
# Predicted interval coverage (on test data)
# ==============================================================================
inside_interval = np.where(
                      (data.loc[end_validation:, 'Demand'] >= predictions["lower_bound"]) & \
                      (data.loc[end_validation:, 'Demand'] <= predictions["upper_bound"]),
                      True,
                      False 
                  )

coverage = inside_interval.mean()
print(f"Predicted interval coverage: {round(100*coverage, 2)} %")
Predicted interval coverage: 93.39 %

Empirical coverage exceeding the expected 80% suggests two possibilities: either the intervals are conservatively estimated, leading to wider ranges than necessary, or the model is more effectively capturing data variability than anticipated. Extending the training set to include more data may help to distinguish between these two possibilities.

Model explanaibility

Due to the complex nature of many modern machine learning models, such as ensemble methods, they often function as black boxes, making it difficult to understand why a particular prediction was made. Explanability techniques aim to demystify these models, providing insight into their inner workings and helping to build trust, improve transparency, and meet regulatory requirements in various domains. Enhancing model explainability not only helps to understand model behavior, but also helps to identify biases, improve model performance, and enable stakeholders to make more informed decisions based on machine learning insights.

Skforecast is compatible with some of the most popular model explainability methods: model-specific feature importances, SHAP values, and partial dependence plots.

In [38]:
# Create and train forecaster
# ==============================================================================
params = {
    'n_estimators': 1100,
    'max_depth': 10,
    'learning_rate': 0.0709,
    'reg_alpha': 0.8,
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = 24
             )
forecaster.fit(
    y    = data.loc[:end_validation, 'Demand'],
    exog = data.loc[:end_validation, exog_features]
)

Model-specific feature importance

In [39]:
# Model-specific feature importances
# ==============================================================================
feature_importances = forecaster.get_feature_importances()
feature_importances.sort_values(by='importance', ascending=False).head(10)
Out[39]:
feature importance
0 lag_1 2429
110 Temperature 1300
1 lag_2 1219
23 lag_24 1182
105 temp_roll_min_1_day 821
22 lag_23 814
2 lag_3 788
102 temp_roll_mean_1_day 727
104 temp_roll_max_1_day 718
21 lag_22 659

Warning

The get_feature_importances() method will only return values if the forecaster's regressor has either the coef_ or feature_importances_ attribute, which is the default in scikit-learn.

Shap values

SHAP (SHapley Additive exPlanations) values are a popular method for explaining machine learning models, as they help to understand how variables and values influence predictions visually and quantitatively.

It is possible to generate SHAP-values explanations from skforecast models with just two essential elements:

  • The internal regressor of the forecaster.

  • The training matrices created from the time series and used to fit the forecaster.

By leveraging these two components, users can create insightful and interpretable explanations for their skforecast models. These explanations can be used to verify the reliability of the model, identify the most significant factors that contribute to model predictions, and gain a deeper understanding of the underlying relationship between the input variables and the target variable.

In [40]:
# Training matrices used by the forecaster to fit the internal regressor
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                       y    = data.loc[:end_validation, 'Demand'],
                       exog = data.loc[:end_validation, exog_features]
                   )

display(X_train.head(3))
display(y_train.head(3))
lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 ... temp_roll_mean_1_day temp_roll_mean_7_day temp_roll_max_1_day temp_roll_min_1_day temp_roll_max_7_day temp_roll_min_7_day holiday_previous_day holiday_next_day Temperature Holiday
Time
2012-01-09 00:00:00 4886.074219 4664.689453 4390.567871 3873.921631 3498.672607 3348.528809 3458.686523 3739.490967 3630.717285 3897.311035 ... 20.281250 22.389137 27.075001 14.8 39.525002 14.35 0.0 0.0 19.200001 0.0
2012-01-09 01:00:00 4958.630371 4886.074219 4664.689453 4390.567871 3873.921631 3498.672607 3348.528809 3458.686523 3739.490967 3630.717285 ... 20.223959 22.308035 27.075001 14.8 39.525002 14.35 0.0 0.0 20.525000 0.0
2012-01-09 02:00:00 5039.643066 4958.630371 4886.074219 4664.689453 4390.567871 3873.921631 3498.672607 3348.528809 3458.686523 3739.490967 ... 20.141666 22.224852 27.075001 14.8 39.525002 14.35 0.0 0.0 21.100000 0.0

3 rows × 112 columns

Time
2012-01-09 00:00:00    4958.630371
2012-01-09 01:00:00    5039.643066
2012-01-09 02:00:00    5090.203125
Freq: H, Name: y, dtype: float32
In [41]:
# Create SHAP explainer (for three base models)
# ==============================================================================
explainer = shap.TreeExplainer(forecaster.regressor)

# Sample 50% of the data to speed up the calculation
rng = np.random.default_rng(seed=785412)
sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False)
X_train_sample = X_train.loc[sample, :]

shap_values = explainer.shap_values(X_train_sample)

✎ Note

Shap library has several explainers, each designed for a different type of model. The shap.TreeExplainer explainer is used for tree-based models, such as the LGBMRegressor used in this example. For more information, see the SHAP documentation.
In [42]:
# Shap summary plot (top 10)
# ==============================================================================
shap.summary_plot(shap_values, X_train_sample, max_display=10, show=False)
fig, ax = plt.gcf(), plt.gca()
ax.set_title("SHAP Summary plot")
ax.tick_params(labelsize=8)
fig.set_size_inches(8, 4.5)
In [66]:
# Force plot for the first observation
# ==============================================================================
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train_sample.iloc[0,:])
Out[66]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Feature selection

Feature selection is the process of selecting a subset of relevant features for use in model construction. It is an important step in the machine learning process, as it can help to reduce overfitting, improve model accuracy, and reduce training time. Since the underlying regressors of skforecast follow the scikit-learn API, it is possible to use the feature selection methods available in scikit-learn. Two of the most popular methods are Recursive Feature Elimination and Sequential Feature Selection.

💡 Tip

Feature selection is a powerful tool for improving the performance of machine learning models. However, it is computationally expensive and can be time-consuming. Since the goal is to find the best subset of features, not the best model, it is not necessary to use the entire data set or a highly complex model. Instead, it is recommended to use a small subset of the data and a simple model. Once the best subset of features has been identified, the model can then be trained using the entire dataset and a more complex configuration. For example, in this use case, the best model is an LGMBRegressor with 1100 trees and a maximum depth of 10. However, to find the best subset of features, only 100 trees and a maximum depth of 5 are used.
In [43]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags = 24
             )

# Underlying matrices used by the forecaster to fit the internal regressor
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                       y    = data_train['Demand'],
                       exog = data_train[exog_features]
                   )

Scikit-learn's RFECV starts by training a model on the initial set of features, and obtaining the importance of each feature (through attributes such as coef_ or feature_importances_). Then, in each round, the least important features are iteratively removed, followed by cross-validation to calculate the performance of the model with the remaining features. This process continues until further feature removal doesn't improve or starts to degrade the performance of the model (based on a chosen metric), or the min_features_to_select is reached.

The final result is an optimal subset of features that ideally balances model simplicity and predictive power, as determined by the cross-validation process.

In [44]:
# Recursive feature elimination with cross-validation
# ==============================================================================
# Sample 50% of the data to speed up the calculation
rng = np.random.default_rng(seed=785412)
sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False)
X_train_sample = X_train.loc[sample, :]
y_train_sample = y_train.loc[sample]

print(f"Total number of features available: {X_train.shape[1]}") 
print(f"Total number of records available: {X_train.shape[0]}")
print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")

regressor = LGBMRegressor(
                n_estimators = 100,
                max_depth = 5,
                random_state = 15926,
                verbose = -1
            )
selector = RFECV(
    estimator              = regressor,
    step                   = 1,
    cv                     = 3,
    min_features_to_select = 25,
    n_jobs                 = -1
)
selector.fit(X_train_sample, y_train_sample)
selected_features_rfe = selector.get_feature_names_out()

print("\nRecursive feature elimination")
print("-----------------------------")
print(f"Number of features selected: {len(selected_features_rfe)}")
print(f"Selected features : \n {selected_features_rfe}")
Total number of features available: 112
Total number of records available: 17352
Total number of records used for feature selection: 8676

Recursive feature elimination
-----------------------------
Number of features selected: 74
Selected features : 
 ['lag_1' 'lag_2' 'lag_3' 'lag_4' 'lag_5' 'lag_6' 'lag_7' 'lag_8' 'lag_9'
 'lag_10' 'lag_11' 'lag_12' 'lag_13' 'lag_14' 'lag_15' 'lag_16' 'lag_17'
 'lag_18' 'lag_19' 'lag_20' 'lag_21' 'lag_22' 'lag_23' 'lag_24'
 'month_cos' 'week_of_year_sin' 'week_of_year_cos' 'week_day_sin'
 'hour_day_sin' 'hour_day_cos' 'sunset_hour_cos'
 'poly_month_sin__week_of_year_sin' 'poly_month_sin__week_of_year_cos'
 'poly_month_sin__week_day_sin' 'poly_month_sin__week_day_cos'
 'poly_month_sin__hour_day_sin' 'poly_month_sin__hour_day_cos'
 'poly_month_cos__week_of_year_sin' 'poly_month_cos__hour_day_sin'
 'poly_month_cos__hour_day_cos' 'poly_week_of_year_sin__week_day_sin'
 'poly_week_of_year_sin__week_day_cos'
 'poly_week_of_year_sin__hour_day_sin'
 'poly_week_of_year_sin__hour_day_cos'
 'poly_week_of_year_cos__week_day_sin'
 'poly_week_of_year_cos__week_day_cos'
 'poly_week_of_year_cos__hour_day_sin'
 'poly_week_of_year_cos__hour_day_cos'
 'poly_week_of_year_cos__sunrise_hour_sin'
 'poly_week_day_sin__week_day_cos' 'poly_week_day_sin__hour_day_sin'
 'poly_week_day_sin__hour_day_cos' 'poly_week_day_sin__sunset_hour_sin'
 'poly_week_day_cos__hour_day_sin' 'poly_week_day_cos__hour_day_cos'
 'poly_hour_day_sin__hour_day_cos' 'poly_hour_day_sin__sunrise_hour_sin'
 'poly_hour_day_sin__sunrise_hour_cos'
 'poly_hour_day_sin__sunset_hour_sin' 'poly_hour_day_sin__sunset_hour_cos'
 'poly_hour_day_cos__sunrise_hour_sin'
 'poly_hour_day_cos__sunrise_hour_cos'
 'poly_hour_day_cos__sunset_hour_sin' 'poly_hour_day_cos__sunset_hour_cos'
 'poly_sunrise_hour_sin__sunset_hour_sin' 'temp_roll_mean_1_day'
 'temp_roll_mean_7_day' 'temp_roll_max_1_day' 'temp_roll_min_1_day'
 'temp_roll_max_7_day' 'temp_roll_min_7_day' 'holiday_previous_day'
 'Temperature' 'Holiday']

The forecaster is trained and re-evaluated using the best subset of features.

In [45]:
# Create a forecaster with the selected features
# ==============================================================================
params = {
    'n_estimators': 1100,
    'max_depth': 10,
    'learning_rate': 0.0709,
    'reg_alpha': 0.8,
    'reg_lambda': 0.2,
    'random_state': 15926,
    'verbose': -1
}

selected_lags = [
    int(feature.replace("lag_", ""))
    for feature in selected_features_rfe
    if feature.startswith("lag_")
]
selected_exog_features = [
    feature
    for feature in selected_features_rfe
    if not feature.startswith("lag_")
]

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = selected_lags,
             )

forecaster.fit(y=data_train['Demand'], exog=data_train[selected_exog_features])

# Backtesting model with exogenous variables on test data
# ==============================================================================
metric_lgbm, predictions = backtesting_forecaster(
                               forecaster         = forecaster,
                               y                  = data['Demand'],
                               exog               = data[selected_exog_features],
                               steps              = 24,
                               metric             = 'mean_absolute_error',
                               initial_train_size = len(data[:end_validation]),
                               refit              = False,
                               n_jobs             = 'auto',
                               verbose            = False,
                               show_progress      = True
                           )

print(f"Backtest error: {metric_lgbm:.2f}")
Backtest error: 134.38

The performance of the model is not compromised, only a small increase in MAE is observed. The number of features is reduced from 112 to 74, making the model simpler and faster to train. In addition, the risk of overfitting is reduced because the model is less likely to learn noise from irrelevant features.

Direct multi-step forecasting

The ForecasterAutoreg and ForecasterAutoregCustom models follow a recursive strategy in which each new prediction is based on the previous one. Another strategy for multi-step forecasting involves training a separate model for each step to be predicted. This is known as direct multi-step forecasting and, although more computationally expensive than the recursive approach due to the need to train multiple models, it may yield better results.

In [46]:
# Forecaster with direct method
# ==============================================================================
forecaster = ForecasterAutoregDirect(
                 regressor        = LGBMRegressor(**params),
                 steps            = 24, # Steps to forecast
                 lags             = selected_lags
             )

# Backtesting model
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['Demand'],
                          exog               = data[selected_exog_features],
                          steps              = 24,
                          metric             = 'mean_absolute_error',
                          initial_train_size = len(data[:end_validation]),
                          refit              = False,
                          n_jobs             = 'auto',
                          verbose            = False,
                          show_progress      = True
                      )

print(f"Backtest error: {metric:.2f}")
predictions.head()
Backtest error: 125.51
Out[46]:
pred
2014-12-01 00:00:00 5603.557489
2014-12-01 01:00:00 5733.570513
2014-12-01 02:00:00 5808.531093
2014-12-01 03:00:00 5932.285155
2014-12-01 04:00:00 6230.603757

The direct multi-step forecasting model outperforms the predictive capability of the recursive model, further reducing the mean absolute error. However, it is important to keep in mind its higher computational cost in assessing whether it is worth implementing.

Anticipated daily forecast

So far, we have evaluated the model under the assumption that the forecasts for the next day are generated at exactly 11:59 PM each day. However, this approach is impractical because there is no time to manage the early hours of the next day.

Now, suppose that the predictions for the next day need to be made at 11:00 AM each day to allow enough time. This means that at 11:00 on day $D$ you have to forecast the hours [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] of the same day and the hours [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] of day $D+1$. This means that a total of 36 hours into the future have to be predicted, although only the last 24 hours have to be stored.

This type of evaluation can be easily done using the backtesting_forecaster() function and its gap argument. In addition, the allow_incomplete_fold argument allows the last fold to be excluded from the analysis if it does not have the same size as the required number of steps. The process adapted to this scenario is run on a daily basis and consists of the following steps:

  1. At 11:00 a.m. on the first day of the test set, the next 36 hours (the remaining 12 hours of the day plus 24 hours of the following day) are predicted.

  2. Only the forecasts for the next day are stored, i.e. starting from position 12.

  3. The next day until 11:00 a.m. is added to the test set.

  4. The process is repeated.

Thus, at 11:00 a.m. each day, the model has access to the actual demand values recorded up to that time.


Warning

There are two considerations to take into account:
  • Training data must end where the gap begins. In this case, the initial_train_size must be extended by 12 positions so that the first point to be predicted is 2014-12-01 12:00:00.

  • In this example, although only the last 24 predictions (steps) are stored for model evaluation, the total number of predicted steps in each fold is 36 (steps + gap).
In [47]:
# End of initial_train_size + 12 positions
# ==============================================================================
data.iloc[:len(data.loc[:end_validation])+12].tail(2)
Out[47]:
Demand month_sin month_cos week_of_year_sin week_of_year_cos week_day_sin week_day_cos hour_day_sin hour_day_cos sunrise_hour_sin ... temp_roll_mean_1_day temp_roll_mean_7_day temp_roll_max_1_day temp_roll_min_1_day temp_roll_max_7_day temp_roll_min_7_day holiday_previous_day holiday_next_day Temperature Holiday
Time
2014-12-01 10:00:00 5084.011230 -2.449294e-16 1.0 -0.354605 0.935016 0.781832 0.62349 2.588190e-01 -0.965926 0.965926 ... 25.589582 18.573214 30.950001 20.0 33.849998 11.15 0.0 0.0 19.90 0.0
2014-12-01 11:00:00 4851.066895 -2.449294e-16 1.0 -0.354605 0.935016 0.781832 0.62349 1.224647e-16 -1.000000 0.965926 ... 25.129168 18.601191 29.500000 19.9 33.849998 11.15 0.0 0.0 19.35 0.0

2 rows × 89 columns

In [48]:
# Backtesting with gap
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(**params),
                 lags      = selected_lags,
             )

metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['Demand'],
                          exog                  = data[selected_exog_features],
                          steps                 = 24,
                          metric                = 'mean_absolute_error',
                          initial_train_size    = len(data.loc[:end_validation])+12,
                          gap                   = 12, # Number of positions to skip
                          allow_incomplete_fold = True,
                          refit                 = False,
                          n_jobs                = 'auto',
                          verbose               = False,
                          show_progress         = True  
                      )

print("")
print('Backtesting metric:', metric)
predictions.head(5)
Backtesting metric: 155.9956576546436
Out[48]:
pred
2014-12-02 00:00:00 5491.432984
2014-12-02 01:00:00 5572.316160
2014-12-02 02:00:00 5626.441703
2014-12-02 03:00:00 5773.025609
2014-12-02 04:00:00 5928.848210

As expected, the error increases as the forecast horizon increases from 24 to 36 hours.

Conclusions

The use of gradient boosting models has proven to be a powerful tool for forecasting energy demand. One of the main advantages of these models is that they can easily incorporate exogenous variables, which can significantly improve the predictive power of the model. In addition, the use of explicability techniques can help to visually and quantitatively understand how variables and values affect predictions. All of these issues are easily addressed using the skforecast library.

Forecaster Exogenous Variables MAE backtest
ForecasterEquivalentDate (Baseline) False 308.4
ForecasterAutoreg False 210.1
ForecasterAutoreg True 132.8
ForecasterAutoreg True (Feature Selection) 134.4
ForecasterAutoregDirect True (Feature Selection) 125.5

Session information

In [49]:
import session_info
session_info.show(html=False)
-----
astral              3.2
lightgbm            4.1.0
matplotlib          3.8.2
numpy               1.26.2
optuna              3.4.0
pandas              2.1.3
plotly              5.18.0
session_info        1.0.0
shap                0.44.0
skforecast          0.11.0
sklearn             1.3.2
statsmodels         0.14.0
-----
IPython             8.17.2
jupyter_client      8.6.0
jupyter_core        5.5.0
notebook            6.4.12
-----
Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
Windows-10-10.0.19045-SP0
-----
Session information updated at 2023-12-09 10:46

Bibliography

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. book

Time Series Analysis and Forecasting with ADAM Ivan Svetunkov book

Joseph, M. (2022). Modern time series forecasting with Python: Explore industry-ready time series forecasting using modern machine learning and Deep Learning. Packt Publishing.

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Forecasting energy demand with machine learning by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://www.cienciadedatos.net/py29-forecasting-electricity-power-demand-python.html

How to cite skforecast

If you use skforecast for a scientific publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2023). skforecast (v0.10.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2023). skforecast (Version 0.10.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.10.0}, month = {9}, year = {2023}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊


Creative Commons Licence
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.