Intermittent demand forecasting with skforecast

Joaquín Amat Rodrigo, Javier Escobar OrtizApril, 2023

More use cases of forecasting with skforecast: skforecast examples

Introduction¶

Intermittent demand forecasting is a statistical method used to predict the demand for products that have sporadic or irregular sales patterns. These types of products are characterized by periods of high demand followed by periods of little or no demand. Intermittent demand forecasting is used in the manufacturing, retail, and healthcare industries to manage inventory levels, optimize production schedules, and reduce out-of-stocks and excess inventory costs.

Regular intermittent demand refers to predictable demand patterns with known intervals between demand periods. This type of intermittent demand occurs with a degree of regularity, such as a product that has seasonal demand or a product that is ordered every month on a certain date. Forecasting of regular intermittent demand can be done using statistical methods, such as Croston's method, which is a common technique used to forecast demand in these situations. However, these methods often do not take into account important exogenous variables that can provide valuable information about demand intervals. As a result, machine learning-based forecasting can be an appealing alternative as it can incorporate these variables and improve accuracy.

On the other hand, irregular intermittent demand refers to unpredictable demand patterns, with no known intervals between demand periods. This type of intermittent demand is random and unpredictable, such as a product that is ordered only a few times a year and in varying quantities. Forecasting irregular intermittent demand is challenging because there is no predictable pattern to the demand. Traditional forecasting methods may not work effectively in these situations and alternative forecasting methods, such as bootstrapping or simulation, may be required.

In summary, regular intermittent demand has a predictable pattern whereas irregular intermittent demand does not. Forecasting regular intermittent demand is easier than forecasting irregular intermittent demand due to the predictability of the demand pattern.

This document demonstrates how the Python library skforecast can be used to forecast regular intermittent demand scenarios. By using this library, the machine learning model can focus on learning to predict demand during periods of interest, while avoiding the influence of periods when there is no demand.

Libraries¶

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "seaborn"

# Modeling
# ==============================================================================
from lightgbm import LGBMRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import mean_absolute_error
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregCustom import ForecasterAutoregCustom
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster

Data¶

The data used in this example represents the number of users who visited a store during its opening hours from Monday to Friday, between 7:00 and 20:00. Therefore, any predictions outside this period are not useful and can either be ignored or set to 0.

# Downloading data
# ======================================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine'
       '-learning-python/master/data/intermittent_demand.csv')
data = pd.read_csv(url, sep=',')
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data.head(3)

# Split train-val-test
# ======================================================================================
end_train = '2012-03-31 23:59:00'
end_validation = '2012-08-31 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validation : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train      : 2011-01-01 00:00:00 --- 2012-03-31 23:00:00  (n=10944)
Dates validation : 2012-04-01 00:00:00 --- 2012-08-31 23:00:00  (n=3672)
Dates test       : 2012-09-01 00:00:00 --- 2012-12-31 23:00:00  (n=2928)

# Plot time series
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_train.index, y=data_train['users'], name="train", mode="lines")
trace2 = go.Scatter(x=data_val.index, y=data_val['users'], name="validation", mode="lines")
trace3 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.update_layout(
    title="Time series of users",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 800,
    height = 400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

# Boxplot for weekly seasonality
# ==============================================================================
data['week_day'] = data.index.day_of_week + 1
fig = px.box(
        data,
        x="week_day",
        y="users",
        title = 'Distribution of users per day of week',
        width=600,
        height=300
     )
median_values = data.groupby('week_day')['users'].median()
fig.add_trace(
    go.Scatter(
        x=median_values.index,
        y=median_values.values,
        mode='lines+markers',
        line=dict(color='blue', dash='dash'),
        showlegend=False
    )
)
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

# Boxplot for daily seasonality
# ==============================================================================
data['hour_day'] = data.index.hour + 1
fig = px.box(
        data,
        x="hour_day",
        y="users",
        title = 'Distribution of users per hour of day',
        width=600,
        height=300
     )
median_values = data.groupby('hour_day')['users'].median()
fig.add_trace(
    go.Scatter(
        x=median_values.index,
        y=median_values.values,
        mode='lines+markers',
        line=dict(color='blue', dash='dash'),
        showlegend=False
    )
)
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

Define a custom metric to evaluate the model¶

To accurately evaluate the performance of the model, it is crucial to define a metric that closely reflects the business scenario in which the model will be used. Specifically, in this case, the model's performance should be optimized during weekdays from 9:00 to 20:00.

Fortunately, skforecast provides the flexibility to use custom functions as metrics for backtesting and model tuning.

def custom_metric(y_true, y_pred):
    """
    Calculate the mean absolute error using only the predicted values for weekdays
    from 9:00 AM to 8:00 PM
    """

    day_of_week = y_true.index.day_of_week
    hour_of_day = y_true.index.hour
    mask = day_of_week.isin([0, 1, 2, 3, 4]) | ((hour_of_day > 7) | (hour_of_day < 20))
    metric = mean_absolute_error(y_true[mask], y_pred[mask])
    
    return metric

Forecasting¶

# Create forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(
                                 learning_rate = 0.1,
                                 max_depth     = 5,
                                 n_estimators  = 500,
                                 random_state  = 123,
                             ),
                 lags = 24
             )
forecaster

================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(max_depth=5, n_estimators=500, random_state=123) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
Transformer for y: None 
Transformer for exog: None 
Window size: 24 
Weight function included: False 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: None 
Training index type: None 
Training index frequency: None 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} 
Creation date: 2023-04-30 12:00:01 
Last fit date: None 
Skforecast version: 0.7.0 
Python version: 3.10.9 
Forecaster id: None

# Backtesting test period
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['users'],
                          initial_train_size = len(data.loc[:end_validation]),
                          fixed_train_size   = False,
                          steps              = 36,
                          refit              = False,
                          metric             = custom_metric,
                          verbose            = False # Change to True to see detailed information
                      )

print(f"Backtest error: {metric}")

Backtest error: 108.04513580389987

# Set to zero predictions for closed hours
# ==============================================================================
hour = data_test.index.hour
day_of_week = data_test.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
predictions[is_closed] = 0

# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 800,
    height = 400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

When analysing the estimated predictions, it is clear that the model struggles to accurately capture the patterns of store opening and closing times. In addition, the influence of lags leads to an underestimation of the first few hours of the day and the days following the closing days.

# Distribution of residuals by day of week and hour of day
# ==============================================================================
residuals = (predictions['pred'] - data_test['users']).to_frame('residuals')
residuals['week_day'] = residuals.index.day_of_week + 1
residuals['hour_day'] = residuals.index.hour + 1

fig = px.box(
        residuals,
        x="week_day",
        y="residuals",
        title = 'Distribution of residuals per hour of day',
        width=600,
        height=300
     )
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

fig = px.box(
        residuals,
        x="hour_day",
        y="residuals",
        title = 'Distribution of residuals per hour of day',
        width=600,
        height=300
     )
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

Inform the model when the store is closed¶

Exogenous variables can be used in a forecasting model to provide additional information and improve the model's ability to detect patterns. This approach offers the advantage of incorporating external factors that could influence the accuracy of the forecast, leading to a more reliable and accurate forecasting model. The exog argument makes it incredibly easy to integrate exogenous variables into the forecaster.

# Create exogenous variable
# ==============================================================================
hour = data.index.hour
day_of_week = data.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
data['is_closed'] = is_closed

end_train = '2012-03-31 23:59:00'
end_validation = '2012-08-31 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

# Backtesting test period
# ==============================================================================
metric, predictions = backtesting_forecaster(
    forecaster         = forecaster,
    y                  = data['users'],
    exog               = data['is_closed'],
    initial_train_size = len(data.loc[:end_validation]),
    fixed_train_size   = False,
    steps              = 36,
    refit              = False,
    metric             = custom_metric,
    verbose            = False # Change to True to see detailed information
)

print(f"Backtest error: {metric}")

Backtest error: 52.93064967792484

# Set to zero predictions for closed hours
# ==============================================================================
hour = data_test.index.hour
day_of_week = data_test.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
predictions[is_closed] = 0

# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 800,
    height = 400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

Incorporating an exogenous variable into the model led to a significant improvement in forecasting accuracy, with the error being halved. This highlights the importance of using external factors to enhance the performance of the forecasting model.

Model tunning¶

Hyperparameter tuning is a critical aspect of developing accurate and effective machine learning models. Hyperparameters are values that cannot be learned from data and must be set by the user before the model is trained. These hyperparameters can have a significant impact on the performance of the model, and careful tuning can improve its accuracy and generalisation to new data. In the case of forecasting models, the lags included in the model can be considered as an additional hyperparameter.

Hyperparameter tuning involves systematically testing different values or combinations of hyperparameters (including lags) to find the optimal configuration that gives the best results. The skforecast library offers several hyperparameter tuning strategies, including grid search, random search, and Bayesian search, to identify the optimal combination of lags and hyperparameters that achieve the best prediction performance.

# Grid search hyperparameters and lags
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(
                                 learning_rate = 0.1,
                                 max_depth     = 5,
                                 n_estimators  = 500,
                                 random_state  = 123,
                             ),
                 lags = 24
             )

# Lags used as predictors
lags_grid = [24, 48, 72]

# Regressor hyperparameters
param_grid = {'n_estimators': [50, 100, 500],
              'max_depth': [5, 10, 15],
              'learning_rate': [0.01, 0.1]}

results_grid = grid_search_forecaster(
                   forecaster         = forecaster,
                   y                  = data.loc[:end_validation, 'users'],
                   exog               = data.loc[:end_validation, 'is_closed'],
                   param_grid         = param_grid,
                   lags_grid          = lags_grid,
                   steps              = 36,
                   refit              = False,
                   metric             = custom_metric,
                   initial_train_size = len(data.loc[:end_train]),
                   fixed_train_size   = False,
                   return_best        = True,
                   verbose            = False
               )

Number of models compared: 54.

loop lags_grid: 100%|███████████████████████████████████████| 3/3 [01:50<00:00, 36.71s/it]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72] 
  Parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100}
  Backtesting metric: 39.30052275988041

# Backtesting test period
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data['users'],
                          exog               = data['is_closed'],
                          initial_train_size = len(data.loc[:end_validation]),
                          fixed_train_size   = False,
                          steps              = 36,
                          refit              = False,
                          metric             = custom_metric,
                          verbose            = False # Change to True to see detailed information
                      )

print(f"Backtest error: {metric}")

Backtest error: 39.86155496691617

# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 800,
    height = 400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

An improved set of hyperparameters has been discovered through grid search, resulting in a further reduction of forecast errors.

Session information¶

import session_info
session_info.show(html=False)

-----
lightgbm            3.3.5
matplotlib          3.6.3
numpy               1.23.5
pandas              1.5.3
plotly              5.13.1
session_info        1.0.0
skforecast          0.7.0
sklearn             1.2.2
-----
IPython             8.10.0
jupyter_client      7.3.4
jupyter_core        5.2.0
-----
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
Linux-5.15.0-1035-aws-x86_64-with-glibc2.31
-----
Session information updated at 2023-04-30 12:02

%%html

How to cite this document?

Intermittent demand forecasting with skforecast by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/documentos/py48-intermittent-demand-forecasting.html

Did you like the article? Your support is important

Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

	users
date_time
2011-01-01 00:00:00	0.0
2011-01-01 01:00:00	0.0
2011-01-01 02:00:00	0.0