Forecasting web traffic with machine learning and Python

If you like  Skforecast ,  please give us a star on   GitHub! ⭐️

Forecasting web traffic with machine learning and Python

Joaquín Amat Rodrigo, Javier Escobar Ortiz
September, 2021 (last update July 2023)

Introduction

A time series is a succession of chronologically ordered data spaced at equal or unequal intervals. The forecasting process consists of predicting the future value of a time series, either by modeling the series solely based on its past behavior (autoregressive) or by using other external variables. This document describes how to use machine learning and statistical models in order to forecast the number of users who visit a website.

Use case

The history of daily visits to the website cienciadedatos.net is available since 07/01/2020. The goal is to generate a forecasting model capable of predicting the web traffic during the next 7 days. The user wants to be able to run the model every Monday and obtain daily traffic predictions for the rest of the week.

To evaluate the performance of the model according to its intended use, it is advisable not to predict only the last 7 days of the time series, but to simulate the whole process. Backtesting is a special type of cross-validation applied to the previous period(s) and can be used with different strategies:

Backtesting with refit and increasing training size (fixed origin)

The model is trained each time before making predictions. With this configuration, the model uses all the data available so far. It is a variation of the standard cross-validation but, instead of making a random distribution of the observations, the training set increases sequentially, maintaining the temporal order of the data.

Backtesting with refit and fixed training size (rolling origin)

A technique similar to the previous one but, in this case, the forecast origin rolls forward, therefore, the size of training remains constant. This is also known as time series cross-validation or walk-forward validation.

Backtesting with intermittent refit

The model is retrained every $n$ iterations of predictions.

Backtesting without refit

After an initial train, the model is used sequentially without updating it and following the temporal order of the data. This strategy has the advantage of being much faster since the model is trained only once. However, the model does not incorporate the latest information available, so it may lose predictive capacity over time.


The most appropriate validation method will depend on the strategy to be used in production, whether the model will be periodically retrained or not before the prediction process. Regardless of the strategy used, it is important not to include test data in the search process to avoid overfitting problems.

Libreries

Libraries used in this document are:

In [1]:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
import hvplot.pandas
%matplotlib inline
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 10

# Modelling and forecasting
# ==============================================================================
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster
from skforecast.ForecasterSarimax import ForecasterSarimax
from skforecast.model_selection_sarimax import backtesting_sarimax
from skforecast.model_selection_sarimax import grid_search_sarimax

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from pmdarima import ARIMA

# Warnings config
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

Data

Data has been obtained from the google analytics service integrated into the website and can be downloaded here. The fields included are:

  • date: day/month/year

  • users: total number of users who visit the web

In [2]:
# Data downloading
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/' +
       'master/data/visitas_por_dia_web_cienciadedatos.csv')
data = pd.read_csv(url, sep=',')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421 entries, 0 to 420
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    421 non-null    object
 1   users   421 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 6.7+ KB

Column date is stored as object. It is converted to datetime type using pd.to_datetime () function. Furthermore, it is set as an index to take advantage of pandas functionalities and finally, its frequency is set to 1 day.

In [3]:
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%d/%m/%y')
data = data.set_index('date')
data = data.asfreq('1D')
data = data.sort_index()
data.head(3)
Out[3]:
users
date
2020-07-01 2324
2020-07-02 2201
2020-07-03 2146
In [4]:
# Check index is complete or there are missing values
# ==============================================================================
(data.index == pd.date_range(
                    start = data.index.min(),
                    end   = data.index.max(),
                    freq  = data.index.freq)).all()
Out[4]:
True
In [5]:
print(f"Missing values: {data.isnull().any(axis=1).sum()}")
Missing values: 0

The data set (starts on 2020-07-01 and ends on 2021-08-22), is divided into 3 partitions: one for training, one for validation and one for testing.

In [6]:
# Split data: train-validation-test
# ==============================================================================
end_train = '2021-03-30 23:59:00'
end_validation = '2021-06-30 23:59:00'

data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Training dates   : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Validation dates : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Test dates       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Training dates   : 2020-07-01 00:00:00 --- 2021-03-30 00:00:00  (n=273)
Validation dates : 2021-03-31 00:00:00 --- 2021-06-30 00:00:00  (n=92)
Test dates       : 2021-07-01 00:00:00 --- 2021-08-25 00:00:00  (n=56)

Graphical exploration

When working with time series, it is important to represent their values. This allows patterns such as trends and seasonality to be identified.

Time series

In [7]:
# Static plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(12, 4))
# data_train.users.plot(ax=ax, label='train', linewidth=1)
# data_val.users.plot(ax=ax, label='val', linewidth=1)
# data_test.users.plot(ax=ax, label='test', linewidth=1)
# ax.set_title('Daily visitors')
# ax.legend();

# Interactive plot
# ==============================================================================
plot_train = data_train.users.hvplot.line(label='train')
plot_val   = data_val.users.hvplot.line(label='val')
plot_test  = data_test.users.hvplot.line(label='test')

layout = plot_train * plot_val * plot_test
layout = layout.opts(title='Daily visitors', ylabel='users')
layout = layout.opts(height=300, width=550)
layout
Out[7]:

Seasonality

In [8]:
# Static plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(7, 3.5))
# data['month'] = data.index.month
# data.boxplot(column='users', by='month', ax=ax,)
# data.groupby('month')['users'].median().plot(style='o-', linewidth=0.8, ax=ax)
# ax.set_ylabel('users')
# ax.set_title('Monthly visitors')
# fig.suptitle('');

# Interactive plot
# ==============================================================================
data['month'] = data.index.month
boxplot  = data.sort_values('month').hvplot.box(
               y                  = 'users',
               by                 = 'month',
               legend             = False,
               box_fill_color     = None,
               outlier_fill_color = None
            )
lineplot = data.groupby('month')['users'].median().hvplot.line(legend=False)
scatterplot = data.groupby('month')['users'].median().hvplot.scatter(legend=False)
layout = boxplot * lineplot * scatterplot
layout = layout.opts(title='Monthly visitors', ylabel='users')
layout = layout.opts(height=300, width=500)
layout
Out[8]:
In [9]:
# Static plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(9, 3.5))
# data['month_day'] = pd.Series(data.index).dt.day.values
# data.boxplot(column='users', by='month_day', ax=ax,)
# data.groupby('month_day')['users'].median().plot(style='o-', linewidth=0.8, ax=ax)
# ax.set_ylabel('users')
# ax.set_title('Month-day visitors')
# fig.suptitle('');

# Interactive plot
# ==============================================================================
data['month_day'] = pd.Series(data.index).dt.day.values
boxplot = data.hvplot.box(
              y                  = 'users',
              by                 = 'month_day',
              legend             = False,
              box_fill_color     = None,
              outlier_fill_color = None
          )
lineplot = data.groupby('month_day')['users'].median().hvplot.line(legend=False)
scatterplot = data.groupby('month_day')['users'].median().hvplot.scatter(legend=False)
layout = boxplot * lineplot * scatterplot
layout = layout.opts(title='Month-day visitors', ylabel='users')
layout = layout.opts(height=300, width=500)
layout
Out[9]:
In [10]:
# Static plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(7, 3.5))
# data['week_day'] = data.index.day_of_week + 1
# data.boxplot(column='users', by='week_day', ax=ax)
# data.groupby('week_day')['users'].median().plot(style='o-', linewidth=0.8, ax=ax)
# ax.set_ylabel('users')
# ax.set_title('Week-day visitors')
# fig.suptitle('');

# Interactive plot
# ==============================================================================
data['week_day'] = data.index.day_of_week + 1
boxplot = data.sort_values('week_day').hvplot.box(
              y                  = 'users',
              by                 = 'week_day',
              legend             = False,
              box_fill_color     = None,
              outlier_fill_color = None
          )
lineplot = data.groupby('week_day')['users'].median().hvplot.line(legend=False)
scatterplot = data.groupby('week_day')['users'].median().hvplot.scatter(legend=False)
layout = boxplot * lineplot * scatterplot
layout = layout.opts(title = 'Week-day visitors', ylabel = 'users')
layout = layout.opts(height=300, width=500)
layout
Out[10]:

It cannot be determined if there is an annual seasonality since the data does not span two years. However, there is a weekly seasonality, with a reduction in web traffic on weekends.

Autocorrelation plots

In [11]:
# Autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(5, 2))
plot_acf(data.users, ax=ax, lags=50)
plt.show()
In [12]:
# Partial autocorrelation plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(5, 2))
plot_pacf(data.users, ax=ax, lags=50)
plt.show()

Autocorrelation and partial autocorrelation plots show a clear association between the number of users on a day and the previous days. This is an indication that autoregressive models may achive good predictions.

Autoregressive model

A autoregressive model (ForecasterAutoreg) is trained using a linear regressor with Ridge regularization, and a time window of 2 weeks (14 lags). The latter means that, for each prediction, the traffic the website had in the previous 14 days is used as predictors.

Ridge models require predictors to be standardized. A StandardScaler is added to the forecaster using the argument transformer_y.

Training Forecaster

In [13]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor     = Ridge(random_state=123),
                 lags          = 14,
                 transformer_y = StandardScaler(),
                 forecaster_id = 'web_traffic'
             )

forecaster.fit(y=data_train.users)
forecaster
Out[13]:
================= 
ForecasterAutoreg 
================= 
Regressor: Ridge(random_state=123) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 
Transformer for y: StandardScaler() 
Transformer for exog: None 
Window size: 14 
Weight function included: False 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2020-07-01 00:00:00'), Timestamp('2021-03-30 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} 
fit_kwargs: {} 
Creation date: 2023-07-14 13:16:35 
Last fit date: 2023-07-14 13:16:35 
Skforecast version: 0.9.0 
Python version: 3.11.4 
Forecaster id: web_traffic 

Prediction (backtest)

In order to evaluate the model, it is trained using data from 2020-07-01 to 2021-06-30 and then, predictions are made seven days at a time, without retraining the model. This type of validation is known as backtesting, and can be easily applied with the function backtesting_forecaster ().

In [14]:
# Backtest
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data.users,
                          initial_train_size = len(data.loc[:end_validation]),
                          steps              = 7,
                          refit              = False,
                          fixed_train_size   = False,
                          metric             = 'mean_absolute_error',
                          verbose            = True,
                          show_progress      = False
                      )

print(f'Backtest error: {metric}')
predictions.head(5)
Information of backtesting process
----------------------------------
Number of observations used for initial training: 365
Number of observations used for backtesting: 56
    Number of folds: 8
    Number of steps per fold: 7
    Number of steps to exclude from the end of each train set before test (gap): 0

Fold: 0
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-07-01 00:00:00 -- 2021-07-07 00:00:00  (n=7)
Fold: 1
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-07-08 00:00:00 -- 2021-07-14 00:00:00  (n=7)
Fold: 2
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-07-15 00:00:00 -- 2021-07-21 00:00:00  (n=7)
Fold: 3
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-07-22 00:00:00 -- 2021-07-28 00:00:00  (n=7)
Fold: 4
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-07-29 00:00:00 -- 2021-08-04 00:00:00  (n=7)
Fold: 5
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-08-05 00:00:00 -- 2021-08-11 00:00:00  (n=7)
Fold: 6
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-08-12 00:00:00 -- 2021-08-18 00:00:00  (n=7)
Fold: 7
    Training:   2020-07-01 00:00:00 -- 2021-06-30 00:00:00  (n=365)
    Validation: 2021-08-19 00:00:00 -- 2021-08-25 00:00:00  (n=7)

Backtest error: 221.0161509090934
Out[14]:
pred
2021-07-01 3199.921107
2021-07-02 2998.675773
2021-07-03 2256.899442
2021-07-04 2038.767690
2021-07-05 2998.858487
In [15]:
# Plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(10, 3))
# data_test.loc[predictions.index, 'users'].plot(ax=ax, linewidth=2, label='test')
# predictions.plot(linewidth=2, label='prediction', ax=ax)
# ax.set_title('Predictions vs real values')
# ax.legend();

plot_test = data_test.users.hvplot.line(label='test')
plot_predict = predictions.hvplot.line(label='prediction')
layout = plot_test * plot_predict
layout = layout.opts(
             title = 'Predictions vs real values',
             ylabel = 'users',
             legend_position = 'bottom_left'
         )
layout = layout.opts(height=300, width=550)
layout
Out[15]:

Tuning the hyper-parameters

In the previous section, the first 14 lags have been used as predictors and a Ridge model with the default hyperparameters as regressor. However, there is no reason why these values should be the most appropriate.

In order to identify the best combination of lags and hyperparameters, a grid search is used. This process consists of training a model with each combination of hyperparameters-lags, and evaluating its predictive capacity using backtesting. It is important to evaluate the models using only the validation data and not include the test data.

In [16]:
# Grid search of hyper-parameters
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor     = Ridge(random_state=123),
                 lags          = 14,
                 transformer_y = StandardScaler(),
                 forecaster_id = 'web_traffic'
             )

# Regressor's hyper-parameters
param_grid = {'alpha': np.logspace(-3, 3, 10)}

# Lags used as predictors
lags_grid = [7, 14, 21, [7, 14, 21]]

grid_results = grid_search_forecaster(
                   forecaster         = forecaster,
                   y                  = data.loc[:end_validation, 'users'],
                   param_grid         = param_grid,
                   lags_grid          = lags_grid,
                   steps              = 7,
                   metric             = 'mean_absolute_error',
                   refit              = False,
                   initial_train_size = len(data_train),
                   fixed_train_size   = False,
                   return_best        = True,
                   show_progress      = True,
                   verbose            = False
               )
Number of models compared: 40.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 
  Parameters: {'alpha': 2.154434690031882}
  Backtesting metric: 214.7916766843454

In [17]:
# Grid search results
# ==============================================================================
grid_results.head(10)
Out[17]:
lags params mean_absolute_error alpha
15 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 2.154434690031882} 214.791677 2.154435
14 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 0.4641588833612777} 216.551409 0.464159
13 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 0.09999999999999999} 217.770757 0.100000
12 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 0.021544346900318832} 218.127429 0.021544
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 0.004641588833612777} 218.206475 0.004642
10 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 0.001} 218.223609 0.001000
26 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 10.0} 219.315042 10.000000
16 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] {'alpha': 10.0} 220.932702 10.000000
25 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 2.154434690031882} 221.500069 2.154435
24 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 0.4641588833612777} 226.720683 0.464159

The best results are obtained using lags [1 2 3 4 5 6 7 8 9 10 11 12 13 14] and a configuration of Ridge {'alpha': 2.154}. By indicating return_best = True in the grid_search_forecaster function, at the end of the process, the forecaster object is automatically retrained with the best configuration and the whole data set.

In [18]:
forecaster
Out[18]:
================= 
ForecasterAutoreg 
================= 
Regressor: Ridge(alpha=2.154434690031882, random_state=123) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 
Transformer for y: StandardScaler() 
Transformer for exog: None 
Window size: 14 
Weight function included: False 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2020-07-01 00:00:00'), Timestamp('2021-06-30 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: {'alpha': 2.154434690031882, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} 
fit_kwargs: {} 
Creation date: 2023-07-14 13:16:35 
Last fit date: 2023-07-14 13:16:37 
Skforecast version: 0.9.0 
Python version: 3.11.4 
Forecaster id: web_traffic 

Once the best model has been identified and trained (using both, the training and the validation set), its prediction error is calculated with the test set.

In [19]:
# Backtest final model using test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data.users,
                          initial_train_size = len(data.loc[:end_validation, :]),
                          steps              = 7,
                          refit              = False,
                          fixed_train_size   = False,
                          metric             = 'mean_absolute_error',
                          show_progress      = True,
                          verbose            = False
                      )

print(f'Backtest error using test data: {metric}')
Backtest error using test data: 216.35610241344355

After optimizing lags and hyperparameters, it has been possible to reduce the prediction error.

ARIMA model

SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors) is a generalization of the ARIMA model that incorporates both seasonality and exogenous variables. SARIMAX models are among the most widely used statistical forecasting models with excellent forecasting performance.

Skforcast uses the pmdarima implementation of the ARIMA model in conjunction with the ForecasterSarimax class. Validation and optimization can be performed using the backtesting_sarimax() and grid_search_sarimax() functions of the model_selection_sarimax module.

In [20]:
# ForecasterSarimax
# ==============================================================================
forecaster_sarimax = ForecasterSarimax(
                         regressor     = ARIMA(order=(14, 0, 0), maxiter=250),
                         fit_kwargs    = {'disp': 0},
                         forecaster_id = 'web_traffic_sarimax'
                     )

# Backtest ARIMA
# ==============================================================================
metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster_sarimax,
                          y                  = data.users,
                          initial_train_size = len(data.loc[:end_validation]),
                          steps              = 7,
                          metric             = 'mean_absolute_error',
                          refit              = False,
                          fixed_train_size   = False,
                          verbose            = False,
                          show_progress      = True
                      )

print(f'Backtest error: {metric}')
predictions.head(5)
Backtest error: 221.16168467861522
Out[20]:
pred
2021-07-01 3168.883918
2021-07-02 2982.103691
2021-07-03 2241.585640
2021-07-04 2007.877835
2021-07-05 2986.983560
In [21]:
# fig, ax = plt.subplots(figsize=(10, 3))
# data_test.loc[predictions.index, 'users'].plot(linewidth=2, label='test', ax=ax)
# predictions.plot(linewidth=2, label='prediction', ax=ax)
# ax.set_title('Predictions (ARIMA) vs real values')
# ax.legend();

plot_test = data_test.users.hvplot.line(label='test')
plot_predict = predictions['pred'].hvplot.line(label='prediction')
layout = plot_test * plot_predict
layout = layout.opts(
             title = 'Predictions (ARIMA) vs real values',
             ylabel = 'users',
             legend_position = 'bottom_left'
         )
layout = layout.opts(height=300, width=550)
layout
Out[21]:

Tuning the hyperparameters

Like most models, ARIMA has several hyperparameters that control its behavior:

  • $p$ is the order (number of time lags) of the autoregressive part of the model.

  • $d$ is the degree of differencing (the number of times that past values have been subtracted from the data).

  • $q$ is the order of the moving-average part of the model.

In the pmdarima implementation, these hyperparameters are specified by the order argument. Two good references for more details on ARIMA models are: https://openforecast.org/adam/ARIMA.html and https://otexts.com/fpp3/arima.html.

The grid_search_sarimax function can be used to perform a hyperparameter search by comparing the models according to a metric obtained by backtesting. Check other tuning strategies.

In [22]:
# Grid search of hyperparameters
# ==============================================================================
param_grid = {'order': [(14, 0, 0), (14, 2, 0), (14, 1, 0), (14, 1, 1), (14, 1, 4),
                        (21, 0, 0), (21, 0, 0), (21, 1, 0), (21, 1, 1), (21, 1, 4)]}

results = grid_search_sarimax(
              forecaster         = forecaster_sarimax,
              y                  = data.users,
              param_grid         = param_grid,
              initial_train_size = len(data.loc[:end_validation]),
              steps              = 7,
              metric             = 'mean_absolute_error',
              refit              = False,
              fixed_train_size   = False,
              return_best        = True,
              verbose            = False
          )

results
Number of models compared: 10.
`Forecaster` refitted using the best-found parameters, and the whole data set: 
  Parameters: {'order': (21, 1, 1)}
  Backtesting metric: 181.8471530226284

Out[22]:
params mean_absolute_error order
8 {'order': (21, 1, 1)} 181.847153 (21, 1, 1)
7 {'order': (21, 1, 0)} 183.105023 (21, 1, 0)
3 {'order': (14, 1, 1)} 194.957834 (14, 1, 1)
2 {'order': (14, 1, 0)} 195.250881 (14, 1, 0)
5 {'order': (21, 0, 0)} 201.219922 (21, 0, 0)
6 {'order': (21, 0, 0)} 201.219922 (21, 0, 0)
1 {'order': (14, 2, 0)} 201.630906 (14, 2, 0)
9 {'order': (21, 1, 4)} 210.298115 (21, 1, 4)
0 {'order': (14, 0, 0)} 221.161685 (14, 0, 0)
4 {'order': (14, 1, 4)} 237.137797 (14, 1, 4)
In [23]:
# Backtest final model using test data
# ==============================================================================
metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster_sarimax,
                          y                  = data.users,
                          initial_train_size = len(data.loc[:end_validation]),
                          steps              = 7,
                          metric             = 'mean_absolute_error',
                          refit              = False,
                          fixed_train_size   = False,
                          verbose            = False,
                          show_progress      = True
                      )

print(f'Backtest error: {metric}')
Backtest error: 181.8471530226284

Forecasting adding exogenous features

In the previous example, only lags of the target variable itself were used as predictors. In certain situations, it is possible to have information on other variables whose future value is known, which can be used as additional predictors in the model. Some typical examples are:

  • Holidays (local, national...)

  • Month of the year

  • Day of the week

  • Time of day

In this use case, the graphical analysis showed evidence that the number of visits to the website decreases at weekends. The day of the week that each date corresponds to can be known in advance, so it can be used as an exogenous variable. See how it affects the models when this information is included as a predictor.

In [24]:
# Creation of new exogenous features
# ==============================================================================
data = data.drop(columns=['month', 'month_day'])
# One hot encoding of week day
data = pd.get_dummies(data, columns=['week_day'], dtype='int64')
data.head(3)
Out[24]:
users week_day_1 week_day_2 week_day_3 week_day_4 week_day_5 week_day_6 week_day_7
date
2020-07-01 2324 0 0 1 0 0 0 0
2020-07-02 2201 0 0 0 1 0 0 0
2020-07-03 2146 0 0 0 0 1 0 0
In [25]:
# Split data train-val-test
# ==============================================================================
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

ForecasterAutoreg

In [26]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor     = Ridge(alpha=2.15, random_state=123),
                 lags          = 14,
                 transformer_y = StandardScaler(),
                 forecaster_id = 'web_traffic'
             )

col_exog = [column for column in data.columns if column.startswith(('week'))]
forecaster.fit(y=data_train.users, exog=data_train[col_exog])

# Backtest forecaster with exogenous features
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster         = forecaster,
                          y                  = data.users,
                          exog               = data[col_exog],
                          initial_train_size = len(data.loc[:end_validation, :]),
                          steps              = 7,
                          metric             = 'mean_absolute_error',
                          refit              = False,
                          fixed_train_size   = False,
                          verbose            = False,
                          show_progress      = True
                      )

print(f'Backtest error: {metric}')
predictions.head(5)
Backtest error: 195.8886834949196
Out[26]:
pred
2021-07-01 3286.890749
2021-07-02 3021.079567
2021-07-03 2256.401254
2021-07-04 1985.650262
2021-07-05 3039.877224

ARIMA

In [27]:
# ForecasterSarimax
# ==============================================================================
forecaster_sarimax = ForecasterSarimax(
                         regressor     = ARIMA(
                                             order = (21, 1, 1), 
                                             seasonal_order = (0,0,0,0),
                                             maxiter = 250
                                         ),
                         fit_kwargs    = {'disp': 0},
                         forecaster_id = 'web_traffic_sarimax'
                     )

# Backtest ARIMA with exogenous features
# ==============================================================================
metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster_sarimax,
                          y                  = data.users,
                          exog               = data[col_exog],
                          initial_train_size = len(data.loc[:end_validation]),
                          steps              = 7,
                          metric             = 'mean_absolute_error',
                          refit              = False,
                          fixed_train_size   = False,
                          verbose            = False,
                          show_progress      = True
                      )

print(f'Backtest error: {metric}')
predictions.head(5)
Backtest error: 204.2734788083199
Out[27]:
pred
2021-07-01 3283.747062
2021-07-02 3020.118688
2021-07-03 2282.196821
2021-07-04 2013.499728
2021-07-05 3106.419001

Prediction intervals

Both functions, backtesting_forecaster and backtesting_sarimax, allow obtaining their intervals in addition to the predictions.

In [28]:
# Backtest with prediction intervals
# ==============================================================================
metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster_sarimax,
                          y                  = data.users,
                          exog               = data[col_exog],
                          initial_train_size = len(data.loc[:end_validation]),
                          steps              = 7,
                          metric             = 'mean_absolute_error',
                          refit              = False,
                          fixed_train_size   = False,
                          alpha              = 0.05,
                          verbose            = False,
                          show_progress      = True
                      )

print(f'Backtest error: {metric}')
predictions.head(5)
Backtest error: 204.2734788083199
Out[28]:
pred lower_bound upper_bound
2021-07-01 3283.747062 2846.969992 3720.524132
2021-07-02 3020.118688 2382.541070 3657.696307
2021-07-03 2282.196821 1551.317084 3013.076558
2021-07-04 2013.499728 1201.558370 2825.441086
2021-07-05 3106.419001 2220.712192 3992.125811
In [29]:
# Static plot
# ==============================================================================
# fig, ax = plt.subplots(figsize=(10, 3))
# data_test.loc[predictions.index, 'users'].plot(linewidth=2, label='test', ax=ax)
# predictions.iloc[:, 0].plot(linewidth=2, label='prediction', ax=ax)
# ax.set_title('Predictions (ARIMA) vs real values')
# ax.fill_between(
#     predictions.index,
#     predictions.iloc[:, 1],
#     predictions.iloc[:, 2],
#     alpha = 0.2,
#     color = 'red',
#     label = 'prediction interval' 
# )
# ax.legend();

# Interactive plot
# ==============================================================================
plot_test = data_test.users.hvplot.line(label='test')
plot_predict = predictions['pred'].hvplot.line(label='prediction')
plot_intervalo = predictions.hvplot.area(
                     y     = 'lower_bound',
                     y2    = 'upper_bound',
                     color = 'red',
                     alpha = 0.2,
                     label = 'prediction interval'
                 )
layout = plot_intervalo * plot_test * plot_predict
layout = layout.opts(
             title = 'Predictions (ARIMA) vs real values',
             ylabel = 'users',
             legend_position = 'bottom_left'
         )
layout = layout.opts(height=300, width=550)
layout
Out[29]:

Conclusion

The best results are obtained with an ARIMA.

Model Exogenous features MAE backtest
ARIMA True 181.3
ARIMA False 181.8
Autoregressive-ridge True 195.9
Autoregressive-ridge False 216.4

How to further improve the model:

Session information

In [30]:
import session_info
session_info.show(html=False)
-----
holoviews           1.16.2
hvplot              0.8.4
matplotlib          3.7.2
numpy               1.24.4
pandas              2.0.3
pmdarima            2.0.3
seaborn             0.12.2
session_info        1.0.0
skforecast          0.9.0
sklearn             1.3.0
statsmodels         0.14.0
-----
IPython             8.14.0
jupyter_client      8.3.0
jupyter_core        5.3.1
notebook            6.5.4
-----
Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]
Linux-5.15.0-1039-aws-x86_64-with-glibc2.31
-----
Session information updated at 2023-07-14 13:21

References


Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. book

Time Series Analysis and Forecasting with ADAM Ivan Svetunkov book

Python for Finance: Mastering Data-Driven Finance

How to cite this document?

Forecasting web traffic with machine learning and Python by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/py36-forecasting-visitas-web-machine-learning.html DOI


Did you like the article? Your support is important

Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊


Creative Commons Licence
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.