Introduction

Although historical data is available in many real use cases of forecasting, not all are reliable. Some examples of these scenarios are:

IoT sensors: within the Internet of Things, sensors capture the raw data from the physical world. Often the sensors are deployed or installed in harsh environments. This inevitably means that the sensors are prone to failure, malfunction, and rapid attrition, causing the sensor to produce unusual and erroneous readings.
Factory shutdown: every certain period of operation, factories need to be shut down for repair, overhaul, or maintenance activities. These events cause production to stop, generating a gap in the data.
Pandemic (Covid-19): the Covid 19 pandemic changed population behavior significantly, directly impacting many time series such as production, sales, and transportation.

The presence of unreliable or unrepresentative values in the data history is a major problem, as it hinders model learning. For most forecasting algorithms, removing that part of the data is not an option because they require the time series to be complete. An alternative solution is to reduce the weight of the affected observations during model training. This document shows two examples of how skforecast makes it easy to apply this strategy.

✎ Note

In the following examples, a portion of the time series is excluded from model training by giving it a weight of zero. However, the use of weights is not limited to including or excluding observations, but to balancing the degree of influence of each observation in the forecasting model. For example, an observation with a weight of 10 has 10 times more impact on the model training than an observation with a weight of 1.

⚠ Warning

In most gradient boosting implementations (LightGBM, XGBoost, CatBoost), samples with zero weight are ignored when calculating the gradients and hessians. However, the values for those samples are still considered when building the feature histograms. Therefore, the resulting model may differ from the model trained without the zero-weighted samples. See more details in this issue.

Libraries

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme

# Modelling and Forecasting
# ==============================================================================
import skforecast
import sklearn
from sklearn.linear_model import Ridge
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster

# Configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

color = '\033[1m\033[38;5;208m'
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")

Version skforecast: 0.16.0
Version scikit-learn: 1.6.1
Version pandas: 2.2.3
Version numpy: 2.2.5

Covid-19 lockdown

During the lockdown period imposed as a consequence of the covid-19 pandemic, the behavior of the population was altered. An example of this can be seen in the use of the bicycle rental service in the city of Madrid (Spain).

Data

# Data download
# ==============================================================================
data = fetch_dataset('bicimad')
data.head()

bicimad
-------
This dataset contains the daily users of the bicycle rental service (BiciMad) in
the city of Madrid (Spain) from 2014-06-23 to 2022-09-30.
The original data was obtained from: Portal de datos abiertos del Ayuntamiento
de Madrid https://datos.madrid.es/portal/site/egob
Shape of the dataset: (3022, 1)

	users
date
2014-06-23	99
2014-06-24	72
2014-06-25	119
2014-06-26	135
2014-06-27	149

# Split data into train-val-test
# ==============================================================================
data = data.loc['2020-01-01': '2021-12-31']
end_train = '2021-06-01'
data_train = data.loc[: end_train, :]
data_test  = data.loc[end_train:, :]

print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates test  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train : 2020-01-01 00:00:00 --- 2021-06-01 00:00:00  (n=518)
Dates test  : 2021-06-01 00:00:00 --- 2021-12-31 00:00:00  (n=214)

# Time series plot
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(8, 3))
data_train.users.plot(ax=ax, label='train', linewidth=1)
data_test.users.plot(ax=ax, label='test', linewidth=1)
ax.axvspan(
    pd.to_datetime('2020-03-16'),
    pd.to_datetime('2020-04-21'), 
    label="Covid-19 confinement",
    color="red",
    alpha=0.3
)

ax.axvspan(
    pd.to_datetime('2020-04-21'),
    pd.to_datetime('2020-05-31'), 
    label="Recovery time",
    color="white",
    alpha=0.3
)

ax.set_title('Number of users BiciMAD')
ax.legend();

Include the whole time series

A forecaster is initialized without taking into consideration the lockdown period.

# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = Ridge(),
                 lags      = 21,
             )   
forecaster

ForecasterRecursive

General Information

Regressor: Ridge
Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]
Window features: None
Window size: 21
Series name: None
Exogenous included: False
Weight function included: False
Differentiation order: None
Creation date: 2025-05-13 21:50:29
Last fit date: None
Skforecast version: 0.16.0
Python version: 3.12.9
Forecaster id: None

Exogenous Variables

None

Data Transformations

Transformer for y: None
Transformer for exog: None

Training Information

Training range: Not fitted
Training index type: Not fitted
Training index frequency: Not fitted

Regressor Parameters

{'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.0001}

Fit Kwargs

{}

🛈 API Reference 🗎 User Guide

Once the model is created, a backtesting process is run to simulate the behavior of the forecaster if it had predicted the test set in 10-day batches.

# Backtesting: predict next 7 days at a time.
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 7,
        initial_train_size = len(data.loc[:end_train]),
        fixed_train_size   = False,
        refit              = False,    
)
metric, predictions_backtest = backtesting_forecaster(
                                   forecaster = forecaster,
                                   y          = data.users,
                                   cv         = cv,
                                   metric     = 'mean_absolute_error',
                                   verbose    = False
                               )
metric

  0%|          | 0/31 [00:00<?, ?it/s]

	mean_absolute_error
0	1469.144278

Exclude part of the time series

To minimize the influence on the model of these dates, a custom function is created weights following the rules:

Weight of 0 if index date is:
- Within the lockdown period (2020-03-16 to 2020-04-21).
- Within the recovery period (2020-04-21 to 2020-05-31).
- 21 days after the recovery period to avoid including impacted values as lags (2020-05-31) to 2020-06-21).
Weight of 1 otherwise.

If an observation has a weight of 0, it has no influence during model training.

# Custom function to create weights
# ==============================================================================
def custom_weights(index):
    """
    Return 0 if index is one between 2020-03-16 and 2020-06-21.
    """
    weights = np.where((index >= '2020-03-16') & (index <= '2020-06-21'), 0, 1)
    
    return weights

Again, a ForecasterAutoreg is initialized but this time including the custom_weights function.

# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor   = Ridge(random_state=123),
                 lags        = 21,
                 weight_func = custom_weights
             )

# Backtesting: predict next 7 days at a time.
# ==============================================================================
metric, predictions_backtest = backtesting_forecaster(
                                   forecaster = forecaster,
                                   y          = data.users,
                                   cv         = cv,
                                   metric     = 'mean_absolute_error',
                                   verbose    = False
                               )

metric

  0%|          | 0/31 [00:00<?, ?it/s]

	mean_absolute_error
0	1404.679364

Giving a weight of 0 to the lockdown period (excluding it from the model training) slightly improves the forecasting performance.

Power plant shutdown

Power plants used to generate energy are very complex installations that require a high level of maintenance. It is common that, every certain period of operation, the plant has to be shut down for repair, overhaul, or maintenance activities.

Data

# Data download
# ==============================================================================
url = ('https://raw.githubusercontent.com/skforecast/skforecast-datasets/refs/'
       'heads/main/data/energy_production_shutdown.csv')
data = pd.read_csv(url, sep=',')

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('D')
data = data.sort_index()
data.head()

	production
date
2012-01-01	375.1
2012-01-02	474.5
2012-01-03	573.9
2012-01-04	539.5
2012-01-05	445.4

# Split data into train-test
# ==============================================================================
data = data.loc['2012-01-01 00:00:00': '2014-12-30 23:00:00']
end_train = '2013-12-31 23:59:00'
data_train = data.loc[: end_train, :]
data_test  = data.loc[end_train:, :]

print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates test  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train : 2012-01-01 00:00:00 --- 2013-12-31 00:00:00  (n=731)
Dates test  : 2014-01-01 00:00:00 --- 2014-12-30 00:00:00  (n=364)

# Time series plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
data_train.production.plot(ax=ax, label='train', linewidth=1)
data_test.production.plot(ax=ax, label='test', linewidth=1)
ax.axvspan(
    pd.to_datetime('2012-06-01'),
    pd.to_datetime('2012-09-30'), 
    label="Shutdown",
    color="white",
    alpha=0.2
)
ax.set_title('Energy production')
ax.legend();

Include the whole time series

A forecaster is initialized without taking in consideration the shutdown period.

# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = Ridge(random_state=123),
                 lags      = 21,
             )

# Backtesting: predict next 10 days at a time.
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 10,
        initial_train_size = len(data.loc[:end_train]),
        refit              = False,    
)
metric, predictions_backtest = backtesting_forecaster(
                                   forecaster = forecaster,
                                   y          = data.production,
                                   cv         = cv,
                                   metric     = 'mean_absolute_error',
                                   verbose    = False
                              )
metric

  0%|          | 0/37 [00:00<?, ?it/s]

	mean_absolute_error
0	28.424722

Exclude part of the time series

The factory shutdown took place from 2012-06-01 to 2012-09-30. To minimize the influence on the model of these dates, a custom function is created that gives a value of 0 if the index date is within the shutdown period or 21 days later (lags used by the model) and 1 otherwise. If an observation has a weight of 0, it has no influence at all during model training.

# Custom function to create weights
# ==============================================================================
def custom_weights(index):
    """
    Return 0 if index is one between 2012-06-01 and 2012-10-21.
    """
    weights = np.where((index >= '2012-06-01') & (index <= '2012-10-21'), 0, 1)
    return weights

# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor   = Ridge(random_state=123),
                 lags        = 21,
                 weight_func = custom_weights
             )

# Backtesting: predict next 10 days at a time.
# ==============================================================================
metric, predictions_backtest = backtesting_forecaster(
                                   forecaster = forecaster,
                                   y          = data.production,
                                   cv         = cv,
                                   metric     = 'mean_absolute_error',
                                   verbose    = False
                               )
metric

  0%|          | 0/37 [00:00<?, ?it/s]

	mean_absolute_error
0	27.808331

As in the previous example, excluding the observations during the shutdown period slightly improves the forecasting performance.

Session information

import session_info
session_info.show(html=False)

-----
matplotlib          3.10.1
numpy               2.2.5
pandas              2.2.3
session_info        v1.0.1
skforecast          0.16.0
sklearn             1.6.1
-----
IPython             9.1.0
jupyter_client      8.6.3
jupyter_core        5.7.2
notebook            6.5.7
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Linux-6.11.0-25-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-05-13 21:50

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Mitigating the Impact of Covid on Forecasting Models by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://www.cienciadedatos.net/documentos/py45-weighted-time-series-forecasting.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.16.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.16.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.16.0}, month = {05}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

This work by Joaquín Amat Rodrigo is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Mitigating the Impact of Covid on Forecasting Models

Joaquín Amat Rodrigo, Javier Escobar Ortiz

December 2022 (last update May 2025)

Introduction

Libraries

Covid-19 lockdown

Data

Include the whole time series

ForecasterRecursive

Exclude part of the time series

Power plant shutdown

Data

Include the whole time series

Exclude part of the time series

Session information

Citation