Grid search de modelos Random Forest con out-of-bag error y early stopping

Grid search de modelos Random Forest con out-of-bag error y early stopping

Joaquín Amat Rodrigo
Enero, 2021

Más sobre ciencia de datos: cienciadedatos.net

Introducción


Los modelos Random Forest tienen, entre muchas otras, la ventaja de disponer del Out-of-Bag error, lo que permite obtener una estimación del error de test sin tener que recurrir a procesos validación cruzada, que son computacionalmente muy costosos. Esta característica, combinada con una estrategia de early stopping, puede emplearse para acelerar el proceso de búsqueda de hiperparámetros (grid search, random search,búsqueda bayesiana...)).

En el siguiente documento se muestra cómo adaptar los modelos RandomForestClassifier y RandomForestRegressor de scikit-learn para que tengan parada temprana y utilicen el Out-of-Bag error en la búsqueda de hiperparámetros.

Out-of-Bag Error


Dada la naturaleza del proceso de bagging en el que se basan los modelos de Random Forest, resulta posible estimar el error de test sin necesidad de recurrir a métodos de validación cruzada (cross-validation). El hecho de que los árboles se ajusten empleando muestras generadas por bootstrapping conlleva que, en promedio, cada ajuste solo use aproximadamente dos tercios de las observaciones originales. Al tercio restante se le llama out-of-bag (OOB).

Si para cada árbol ajustado en el proceso de bagging se registran las observaciones empleadas, se puede predecir la respuesta de la observación i haciendo uso de aquellos árboles en los que esa observación ha sido excluida. Siguiendo este proceso, se pueden obtener las predicciones para las n observaciones y con ellas calcular la métrica de interés. Como cada observación se predice empleando únicamente los árboles en cuyo ajuste no participó dicha observación, el OOB-error sirve como estimación del error de test. De hecho, si el número de árboles es suficientemente alto, el OOB-error es prácticamente equivalente al leave-one-out cross-validation error.

Esta es una ventaja añadida de los métodos de bagging, y por lo tanto de Random Forest, ya que evita tener que recurrir al proceso de validación cruzada (computacionalmente costoso) para la optimización de los hiperparámetros. Aun así, dos limitaciones se han de tener en cuenta a la hora de utilizar el Out-of-Bag Error:

  • El Out-of-Bag Error no es adecuado cuando las observaciones tienen una relación temporal (series temporales). Como la selección de las observaciones que participan en cada entrenamiento es aleatoria, no se respeta el orden temporal y se estaría introduciendo información a futuro.

  • El preprocesado de los datos de entrenamiento se hace de forma conjunta, por lo que las observaciones out-of-bag pueden sufrir data leakage). De ser así, las estimaciones del OOB-error son demasiado optimistas. Afortunadamente, los modelos de Random Forest requieren de pocas transformaciones, por ejemplo, no es necesario el escalado o normalización de los predictores.

En un muestreo por bootstrapping, si el tamaño de los datos de entrenamiento es n, cada observación tiene una probabilidad de ser elegida de $\frac{1}{n}$ . Por lo tanto, la probabilidad de no ser elegida en todo el proceso es de $(1−\frac{1}{n})^n$ , lo que converge en $\frac{1}{e}$ , que es aproximadamente un tercio.

Parada temprana (early stopping)


Una de las características de los modelos Random Forest es que, alcanzado un número suficiente de árboles, el modelo deja de mejorar. Aunque, a diferencia de otros modelos como Gradient Boosting, un exceso de árboles en modelos Random Forest no causa overfitting, es poco eficiente en términos de tiempo y computación añadir más árboles de los necesarios.

Para evitar este problema, se pueden emplear estrategias que detengan el proceso de entrenamiento a partir del momento en el que el modelo deja de mejorar, por ejemplo, monitorizando una métrica en un conjunto de validación, o con el out-of-bag error. Esta última es la que se muestra en este documento.

Código


Las siguientes 3 funciones permiten realizar la búsqueda de hiperparámetros (Grid Search) de modelos Random Forest aplicando en cada entrenamiento una estrategia de parada temprana y utilizando el out-of-bag error como métrica de comparación.

  • check_early_stopping(): dada una secuencia de valores y unas reglas, determina si se activa o no la parada temprana. Es necesario definir el tipo de métrica al que pertenecen los valores para determinar si el modelo es mejor a medida que aumenta el valor o al revés.

  • fit_RandomForest_early_stopping(): entrena un modelo RandomForestClassifier o RandomForestRegressor hasta que se cumple una condición de parada temprana o se alcanza el máximo de árboles (n_estimators) definido al crear el modelo.

  • custom_gridsearch_RandomForestClassifier(): grid search de un modelo RandomForestClassifier utilizando una métrica out-of-bag para comparar los modelos y activando la parada temprana en cada entrenamiento.

In [1]:
import pandas as pd
import numpy as np
import typing
from typing import Optional, Union, Tuple
import logging
import tqdm

from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score
from sklearn.metrics import  mean_absolute_error, mean_squared_error


logging.basicConfig(
    format = '%(asctime)-5s %(name)-10s %(levelname)-5s %(message)s', 
    level  = logging.INFO,
)


def check_early_stopping(
    scores: Union[list, np.ndarray],
    metric: str,
    stopping_rounds: int=4,
    stopping_tolerance: float=0.01,
    max_runtime_sec: int=None,
    start_time: pd.Timestamp=None) -> bool:
    
    """
    Check if early stopping condition is met.
    
    Parameters
    ----------
    
    scores: list, np.ndarray
        Scores used to evaluate early stopping conditions.
        
    metric: str
        Metric which scores referes to. Used to determine if higher score
        means a better model or the opposite.
        
    stopping_rounds: int, default 4
        Number of consecutive rounds without improvement needed to stop
        the training.
    
    stopping_tolerance: float, default 0.01
        Minimum percentage of positive change between two consecutive rounds
        needed to consider it as an improvement.
    
    max_runtime_sec: int, default `None`
        Maximum allowed runtime in seconds for model training. `None` means unlimited.
    
    start_time: pd.Timestamp, default `None`
        Time when training started. Used to determine if `max_runtime_sec` has been
        reached.
        
        
    Returns
    ------
    bool:
        `True` if any condition needed for early stopping is met. `False` otherwise.
        
    Notes
    -----
    
    Example of early stopping:
        
    Stop after 4 rounds without an improvement of 1% or higher: `stopping_rounds` = 4,
    `stopping_tolerance` = 0.01, `max_runtime_sec` = None.
    
    """
    
    allowed_metrics = ['accuracy', 'auc', 'f1', 'mse', 'mae', 'squared_error',
                       'absolute_error']
    
    if metric not in allowed_metrics:
        raise Exception(
                f"`metric` argument must be one of: {allowed_metrics}. "
                f"Got {metric}"
        )
    
    if isinstance(scores, list):
        scores = np.array(scores)
        
    if max_runtime_sec is not None:
        
        if start_time is None:
            start_time = pd.Timestamp.now()
            
        runing_time = (pd.Timestamp.now() - start_time).total_seconds()
        
        if runing_time > max_runtime_sec:
            logging.debug(
                f"Reached maximum time for training ({max_runtime_sec} seconds). "
                f"Early stopping activated."
            )
            return True
        
    if len(scores) < stopping_rounds:
        return False
    
    if metric in ['accuracy', 'auc', 'f1']:
        # The higher the metric, the better
        diff_scores = scores[1:] - scores[:-1]
        improvement = diff_scores / scores[:-1]
        
    if metric in ['mse', 'mae', 'squared_error', 'absolute_error']:
        # The lower the metric, the better
        
        # scores = -1 * scores 
        # diff_scores = scores[:-1] - scores[1:]
        # improvement = diff_scores / scores[1:]
        diff_scores = scores[1:] - scores[:-1]
        improvement = diff_scores / scores[:-1]
        improvement = -1 * improvement
        
    improvement = np.hstack((np.nan, improvement))
    logging.debug(f"Improvement: {improvement}")
    
    if (improvement[-stopping_rounds:] < stopping_tolerance).all():
        return True
    else:
        return False


    
def fit_RandomForest_early_stopping(
    model: Union[RandomForestClassifier, RandomForestRegressor],
    X: Union[np.ndarray, pd.core.frame.DataFrame],
    y: np.ndarray,
    metric: str,
    positive_class: int=1,
    score_tree_interval: int=None,
    stopping_rounds: int=4,
    stopping_tolerance: float=0.01,
    max_runtime_sec: int=None) -> np.ndarray:
    
    """
    Fit a RandomForest model until an early stopping condition is met or
    `n_estimatos` is reached.
    
    Parameters
    ----------
    
    model: RandomForestClassifier, RandomForestRegressor
        Model to be fitted.
        
    X: np.ndarray, pd.core.frame.DataFrame
        Training input samples. 
    
    y: np.ndarray, pd.core.frame.DataFrame
        Target value of the input samples. 
    
    scores: list, np.ndarray
        Scores used to evaluate early stopping conditions.
        
    metric: str
        Metric used to generate the score. Used to determine if higher score
        means a better model or the opposite.
        
    score_tree_interval: int, default `None`
        Score the model after this many trees. If `None`, the model is scored after
        `n_estimators` / 10.
        
    stopping_rounds: int
        Number of consecutive rounds without improvement needed to stop the training.
    
    stopping_tolerance: float, default 0.01
        Minimum percentage of positive change between two consecutive rounds
        needed to consider it as an improvement. 
    
    max_runtime_sec: int, default `None`
        Maximum allowed runtime in seconds for model training. `None` means unlimited.
        
        
    Returns
    ------
    oob_scores: np.ndarray
        Out of bag score for each scoring point.
    
    """
    
    if score_tree_interval is None:
        score_tree_interval = int(model.n_estimators / 10)
        
    allowed_metrics = ['accuracy', 'auc', 'f1', 'mse', 'mae', 'squared_error',
                       'absolute_error']
    
    if metric not in allowed_metrics:
        raise Exception(
                f"`metric` argument must be one of: {allowed_metrics}. "
                f"Got {metric}"
        )
    
    if not model.oob_score:
        model.set_params(oob_score=True)
        
    start_time = pd.Timestamp.now()
    oob_scores = []
    scoring_points = np.arange(0, model.n_estimators + 1, score_tree_interval)[1:]
    scoring_points = np.hstack((1, scoring_points))
    
    metrics = {
        'auc' : roc_auc_score,
        'accuracy' : accuracy_score,
        'f1': f1_score,
        'mse': mean_squared_error,
        'squared_error': mean_squared_error,
        'mae': mean_absolute_error,
        'absolute_error': mean_absolute_error,        
    }
    
    for i, n_estimators in enumerate(scoring_points):
        
        logging.debug(f"Training with n_stimators: {n_estimators}")
        model.set_params(n_estimators=n_estimators)
        model.fit(X=X, y=y)
        
        if metric == 'auc':
            oob_predictions = model.oob_decision_function_[:, positive_class]
            # If n_estimators is small it might be possible that a data point
            # was never left out during the bootstrap. In this case,
            # oob_decision_function_ might contain NaN.
            oob_score = metrics[metric](
                            y_true=y[~np.isnan(oob_predictions)],
                            y_score=oob_predictions[~np.isnan(oob_predictions)]
                        )
        else:
            oob_predictions = model.oob_decision_function_
            oob_predictions = np.argmax(oob_predictions, axis=1)
            oob_score = metrics[metric](
                            y_true=y[~np.isnan(oob_predictions)],
                            y_score=oob_predictions[~np.isnan(oob_predictions)]
                        )
            
        oob_scores.append(oob_score)
        
        early_stopping = check_early_stopping(
                            scores             = oob_scores,
                            metric             = metric,
                            stopping_rounds    = stopping_rounds,
                            stopping_tolerance = stopping_tolerance,
                            max_runtime_sec    = max_runtime_sec,
                            start_time         = start_time
                         )    
        
        if early_stopping:
            logging.debug(
                f"Early stopping activated at round {i + 1}: n_estimators = {n_estimators}"
            )
            break
        
    logging.debug(f"Out of bag score = {oob_scores[-1]}")
    
    return np.array(oob_scores), scoring_points[:len(oob_scores)]
    

def custom_gridsearch_RandomForestClassifier(
    model: RandomForestClassifier,
    X: Union[np.ndarray, pd.core.frame.DataFrame],
    y: np.ndarray,
    metric: str,
    param_grid: dict,
    positive_class: int=1,
    score_tree_interval: int=None,
    stopping_rounds: int=5,
    stopping_tolerance: float=0.01,
    model_max_runtime_sec: int=None,
    max_models: int=None,
    max_runtime_sec: int=None,
    return_best: bool=True) -> Tuple[pd.DataFrame, pd.DataFrame]:
    
    '''
    Grid search for RandomForestClassifier model based on out-of-bag metric and 
    early stopping for each model fit.
    
    Parameters
    ----------
    
    model: RandomForestClassifier
        Model to search over.
           
    X: np.ndarray, pd.core.frame.DataFrame
        The training input samples. 
    
    y: np.ndarray, pd.core.frame.DataFrame
        The target of input samples. 
    
    scores: list, np.ndarray
        Scores used to evaluate early stopping conditions.
        
    metric: str
        Metric used to generate the score. I is used to determine if higher score
        means a better model or the opposite.
        
    score_tree_interval: int, default `None`
        Score the model after this many trees. If `None`, the model is scored after
        `n_estimators` / 10.
        
    stopping_rounds: int
        Number of consecutive rounds without improvement needed to stop the training.
    
    stopping_tolerance: float, default 0.01
        Minimum percentage of positive change between two consecutive rounds
        needed to consider it as an improvement. 
    
    model_max_runtime_sec: int, default `None`
        Maximum allowed runtime in seconds for model training. `None` means unlimited.
        
    max_models: int, default `None`
        Maximum number of models trained during the search.
    
    max_runtime_sec: int, default `None`
        Maximum number of seconds for the search.
        
    return_best : bool
        Refit model using the best found parameters on the whole data.
        
        
    Returns
    ------
    
    results: pd.DataFrame
    
    '''
    
    results = {'params': [], 'oob_metric': []}
    start_time = pd.Timestamp.now()
    history_scores = {}
    history_scoring_points = np.array([], dtype = int)
    param_grid = list(ParameterGrid(param_grid))
    
    if not model.oob_score:
        model.set_params(oob_score=True)
    
    if max_models is not None and max_models < len(param_grid):
        param_grid = np.random.choice(param_grid, max_models)

    for params in tqdm.tqdm(param_grid):
        
        if max_runtime_sec is not None:
            runing_time = (pd.Timestamp.now() - start_time).total_seconds()
            if runing_time > max_runtime_sec:
                logging.info(
                    f"Reached maximum time for GridSearch ({max_runtime_sec} seconds). "
                    f"Search stopped."
                )
                break   
        
        model.set_params(**params)

        oob_scores, scoring_points = fit_RandomForest_early_stopping(
                                        model = clone(model), # Clone to avoid modification of n_estimators
                                        X = X,
                                        y = y,
                                        metric = metric,
                                        positive_class      = positive_class,
                                        score_tree_interval = score_tree_interval,
                                        stopping_rounds     = stopping_rounds,
                                        stopping_tolerance  = stopping_tolerance,
                                        max_runtime_sec     = model_max_runtime_sec
                                     )
      
        history_scoring_points = np.union1d(history_scoring_points,  scoring_points)        
        history_scores[str(params)] = oob_scores
        params['n_estimators'] = scoring_points[-1]
        results['params'].append(params)
        results['oob_metric'].append(oob_scores[-1])
        logging.debug(f"Modelo: {params} \u2713")

    results = pd.DataFrame(results)
    history_scores = pd.DataFrame(
                            dict([(k, pd.Series(v)) for k,v in history_scores.items()])
                         )
    history_scores['n_estimators'] = history_scoring_points
    
    if metric in ['accuracy', 'auc', 'f1']:
        results = results.sort_values('oob_metric', ascending=False)
    else:
        results = results.sort_values('oob_metric', ascending=True)
        
    results = results.rename(columns = {'oob_metric': f'oob_{metric}'})
    
    if return_best:
        best_params = results['params'].iloc[0]
        print(
            f"Refitting mode using the best found parameters and the whole data set: \n {best_params}"
        )
        
        model.set_params(**best_params)
        model.fit(X=X, y=y)
        
    results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
    results = results.drop(columns = 'params')
    
    return results, history_scores

Ejemplo

Datos

In [2]:
# Datos
# ==============================================================================
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/'
       'Estadistica-machine-learning-python/master/data/adult_custom_python.csv')
datos = pd.read_csv(url, sep=",")

datos.info()

X = datos.drop(columns='salario')
y = datos.salario

X = pd.get_dummies(X, drop_first=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age               45222 non-null  int64 
 1   workclass         45222 non-null  object
 2   final_weight      45222 non-null  int64 
 3   education         45222 non-null  object
 4   education_number  45222 non-null  int64 
 5   marital_status    45222 non-null  object
 6   occupation        45222 non-null  object
 7   relationship      45222 non-null  object
 8   race              45222 non-null  object
 9   sex               45222 non-null  object
 10  capital_gain      45222 non-null  int64 
 11  capital_loss      45222 non-null  int64 
 12  hours_per_week    45222 non-null  int64 
 13  native_country    45222 non-null  object
 14  salario           45222 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.2+ MB

Búsqueda basada en métrica out-of-bag

In [3]:
# Grid de valores sobre los que buscar
param_grid = {
             'max_depth'   : [3, 10, 20],
             'min_samples_leaf': [0.05, 0.1],
             'max_features': ['sqrt', 'log2'],
             'ccp_alpha': [0, 0.01]
            }
# Modelo
model = RandomForestClassifier(
            n_estimators = 1000,
            oob_score    = True,
            n_jobs       = -1,
            random_state = 123
        )

# Búsqueda de mejor modelo basada en métrica out-of-bag
start = pd.Timestamp.now()

resultados, history = custom_gridsearch_RandomForestClassifier(
                        model                 = model,
                        X                     = X,
                        y                     = y,
                        metric                = 'auc',
                        param_grid            = param_grid,
                        positive_class        = 1,
                        score_tree_interval   = 50,
                        stopping_rounds       = 4,
                        stopping_tolerance    = 0.01,
                        model_max_runtime_sec = None,
                        max_models            = None,
                        max_runtime_sec       = None,
                        return_best           = True
                      )

end = pd.Timestamp.now()
print(f"Duración búsqueda: {end-start}")
100%|██████████| 24/24 [02:40<00:00,  6.70s/it]
Refitting mode using the best found parameters and the whole data set: 
 {'ccp_alpha': 0, 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 0.05, 'n_estimators': 300}
Duración búsqueda: 0 days 00:02:42.687182
In [4]:
resultados
Out[4]:
oob_auc ccp_alpha max_depth max_features min_samples_leaf n_estimators
6 0.877553 0.00 10 log2 0.05 300
10 0.877553 0.00 20 log2 0.05 300
4 0.876136 0.00 10 sqrt 0.05 250
8 0.876136 0.00 20 sqrt 0.05 250
2 0.875290 0.00 3 log2 0.05 250
0 0.873859 0.00 3 sqrt 0.05 250
22 0.872979 0.01 20 log2 0.05 300
18 0.872979 0.01 10 log2 0.05 300
14 0.872352 0.01 3 log2 0.05 300
20 0.871002 0.01 20 sqrt 0.05 250
16 0.871002 0.01 10 sqrt 0.05 250
12 0.870019 0.01 3 sqrt 0.05 300
9 0.855211 0.00 20 sqrt 0.10 300
5 0.855211 0.00 10 sqrt 0.10 300
1 0.854974 0.00 3 sqrt 0.10 300
11 0.854084 0.00 20 log2 0.10 300
7 0.854084 0.00 10 log2 0.10 300
3 0.853899 0.00 3 log2 0.10 300
17 0.850821 0.01 10 sqrt 0.10 300
21 0.850821 0.01 20 sqrt 0.10 300
13 0.850496 0.01 3 sqrt 0.10 300
15 0.849019 0.01 3 log2 0.10 300
19 0.849019 0.01 10 log2 0.10 300
23 0.849019 0.01 20 log2 0.10 300
In [5]:
fig, ax = plt.subplots(1, 1, figsize=(7,5))
history.set_index('n_estimators').plot(legend=False, ax=ax)
ax.set_ylabel('AUC');
ax.set_title('Evolución de la métrica out-of-bag');

Búsqueda basada en validación cruzada

In [6]:
from sklearn.model_selection import GridSearchCV

# Grid de valores sobre los que buscar
param_grid = {
             'max_depth'   : [3, 10, 20],
             'min_samples_leaf': [0.05, 0.1],
             'max_features': ['sqrt', 'log2'],
             'ccp_alpha': [0, 0.01]
            }
# Modelo
model = RandomForestClassifier(
            n_estimators = 1000,
            oob_score    = False,
            n_jobs       = -1,
            random_state = 123
        )


# Búsqueda por grid search con validación cruzada
start = pd.Timestamp.now()

grid = GridSearchCV(
        estimator  = model,
        param_grid = param_grid,
        scoring    = 'roc_auc',
        cv         = 5, 
        refit      = False,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X, y = y)

end = pd.Timestamp.now()
print(f"Duración búsqueda: {end-start}")
Duración búsqueda: 0 days 00:08:10.084963
In [7]:
# Resultados
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False)
Out[7]:
param_ccp_alpha param_max_depth param_max_features param_min_samples_leaf mean_test_score std_test_score mean_train_score std_train_score
6 0 10 log2 0.05 0.878148 0.002636 0.878743 0.000537
10 0 20 log2 0.05 0.878148 0.002636 0.878743 0.000537
2 0 3 log2 0.05 0.876852 0.002807 0.877345 0.000652
4 0 10 sqrt 0.05 0.876741 0.002586 0.877291 0.000556
8 0 20 sqrt 0.05 0.876741 0.002586 0.877291 0.000556
0 0 3 sqrt 0.05 0.875353 0.002981 0.875781 0.000502
22 0.01 20 log2 0.05 0.874735 0.002897 0.875127 0.000503
18 0.01 10 log2 0.05 0.874735 0.002897 0.875127 0.000503
14 0.01 3 log2 0.05 0.873500 0.002960 0.873922 0.000638
20 0.01 20 sqrt 0.05 0.873017 0.002971 0.873407 0.000507
16 0.01 10 sqrt 0.05 0.873017 0.002971 0.873407 0.000507
12 0.01 3 sqrt 0.05 0.871871 0.003251 0.872314 0.000376
11 0 20 log2 0.1 0.857833 0.002547 0.858329 0.000767
7 0 10 log2 0.1 0.857833 0.002547 0.858329 0.000767
3 0 3 log2 0.1 0.857781 0.002554 0.858269 0.000759
9 0 20 sqrt 0.1 0.857431 0.002756 0.857903 0.000716
5 0 10 sqrt 0.1 0.857431 0.002756 0.857903 0.000716
1 0 3 sqrt 0.1 0.857338 0.002769 0.857790 0.000709
19 0.01 10 log2 0.1 0.855294 0.002501 0.855733 0.000834
23 0.01 20 log2 0.1 0.855294 0.002501 0.855733 0.000834
15 0.01 3 log2 0.1 0.855270 0.002514 0.855674 0.000779
17 0.01 10 sqrt 0.1 0.854539 0.002660 0.854884 0.000814
21 0.01 20 sqrt 0.1 0.854539 0.002660 0.854884 0.000814
13 0.01 3 sqrt 0.1 0.854433 0.002689 0.854813 0.000815

Comparación de resultados


Los dos métodos obtienen errores de validación similares e identifican como mejor modelo el que tiene la configuración:

  • param_ccp_alpha = 0
  • param_max_depth = 10
  • param_max_features = log2
  • param_min_samples_leaf = 0.05

La estrategia basada en métrica out-of-bag y parada temprana es aproximadamente 4x más rápida y los modelos resultantes tienen menos árboles.

Profiling


Profiling del código para identificar qué partes están requiriendo más tiempo de computo.

In [8]:
# Grid de valores sobre los que buscar
param_grid = {
             'max_depth'   : [3, 10, 20],
             'min_samples_leaf': [0.05, 0.1],
             'max_features': ['sqrt', 'log2'],
             'ccp_alpha': [0, 0.01]
            }
# Modelo
model = RandomForestClassifier(
            n_estimators = 1000,
            oob_score    = True,
            n_jobs       = -1,
            random_state = 123
        )
In [9]:
from line_profiler import LineProfiler

lp = LineProfiler()
lp_wrapper = lp(custom_gridsearch_RandomForestClassifier)
lp_wrapper(
    model                 = model,
    X                     = X,
    y                     = y,
    metric                = 'auc',
    param_grid            = param_grid,
    positive_class        = 1,
    score_tree_interval   = 50,
    stopping_rounds       = 4,
    stopping_tolerance    = 0.01,
    model_max_runtime_sec = None,
    max_models            = 5,
    max_runtime_sec       = None
)

lp.print_stats()
100%|██████████| 5/5 [00:36<00:00,  7.30s/it]
Refitting mode using the best found parameters and the whole data set: 
 {'ccp_alpha': 0, 'max_depth': 3, 'max_features': 'sqrt', 'min_samples_leaf': 0.05, 'n_estimators': 250}
Timer unit: 1e-06 s

Total time: 38.3506 s
File: <ipython-input-1-501723037cfe>
Function: custom_gridsearch_RandomForestClassifier at line 259

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   259                                           def custom_gridsearch_RandomForestClassifier(
   260                                               model: RandomForestClassifier,
   261                                               X: Union[np.ndarray, pd.core.frame.DataFrame],
   262                                               y: np.ndarray,
   263                                               metric: str,
   264                                               param_grid: dict,
   265                                               positive_class: int=1,
   266                                               score_tree_interval: int=None,
   267                                               stopping_rounds: int=5,
   268                                               stopping_tolerance: float=0.01,
   269                                               model_max_runtime_sec: int=None,
   270                                               max_models: int=None,
   271                                               max_runtime_sec: int=None,
   272                                               return_best: bool=True) -> Tuple[pd.DataFrame, pd.DataFrame]:
   273                                               
   274                                               '''
   275                                               Grid search for RandomForestClassifier model based on out-of-bag metric and 
   276                                               early stopping for each model fit.
   277                                               
   278                                               Parameters
   279                                               ----------
   280                                               
   281                                               model: RandomForestClassifier
   282                                                   Model to search over.
   283                                                      
   284                                               X: np.ndarray, pd.core.frame.DataFrame
   285                                                   The training input samples. 
   286                                               
   287                                               y: np.ndarray, pd.core.frame.DataFrame
   288                                                   The target of input samples. 
   289                                               
   290                                               scores: list, np.ndarray
   291                                                   Scores used to evaluate early stopping conditions.
   292                                                   
   293                                               metric: str
   294                                                   Metric used to generate the score. I is used to determine if higher score
   295                                                   means a better model or the opposite.
   296                                                   
   297                                               score_tree_interval: int, default `None`
   298                                                   Score the model after this many trees. If `None`, the model is scored after
   299                                                   `n_estimators` / 10.
   300                                                   
   301                                               stopping_rounds: int
   302                                                   Number of consecutive rounds without improvement needed to stop the training.
   303                                               
   304                                               stopping_tolerance: float, default 0.01
   305                                                   Minimum percentage of positive change between two consecutive rounds
   306                                                   needed to consider it as an improvement. 
   307                                               
   308                                               model_max_runtime_sec: int, default `None`
   309                                                   Maximum allowed runtime in seconds for model training. `None` means unlimited.
   310                                                   
   311                                               max_models: int, default `None`
   312                                                   Maximum number of models trained during the search.
   313                                               
   314                                               max_runtime_sec: int, default `None`
   315                                                   Maximum number of seconds for the search.
   316                                                   
   317                                               return_best : bool
   318                                                   Refit model using the best found parameters on the whole data.
   319                                                   
   320                                                   
   321                                               Returns
   322                                               ------
   323                                               
   324                                               results: pd.DataFrame
   325                                               
   326                                               '''
   327                                               
   328         1          3.0      3.0      0.0      results = {'params': [], 'oob_metric': []}
   329         1         23.0     23.0      0.0      start_time = pd.Timestamp.now()
   330         1          1.0      1.0      0.0      history_scores = {}
   331         1         13.0     13.0      0.0      history_scoring_points = np.array([], dtype = int)
   332         1         77.0     77.0      0.0      param_grid = list(ParameterGrid(param_grid))
   333                                               
   334         1          4.0      4.0      0.0      if not model.oob_score:
   335                                                   model.set_params(oob_score=True)
   336                                               
   337         1          2.0      2.0      0.0      if max_models is not None and max_models < len(param_grid):
   338         1        103.0    103.0      0.0          param_grid = np.random.choice(param_grid, max_models)
   339                                           
   340         6       5534.0    922.3      0.0      for params in tqdm.tqdm(param_grid):
   341                                                   
   342         5         10.0      2.0      0.0          if max_runtime_sec is not None:
   343                                                       runing_time = (pd.Timestamp.now() - start_time).total_seconds()
   344                                                       if runing_time > max_runtime_sec:
   345                                                           logging.info(
   346                                                               f"Reached maximum time for GridSearch ({max_runtime_sec} seconds). "
   347                                                               f"Search stopped."
   348                                                           )
   349                                                           break   
   350                                                   
   351         5       1930.0    386.0      0.0          model.set_params(**params)
   352                                           
   353         5          9.0      1.8      0.0          oob_scores, scoring_points = fit_RandomForest_early_stopping(
   354         5       4126.0    825.2      0.0                                          model = clone(model), # Clone to avoid modification of n_estimators
   355         5          9.0      1.8      0.0                                          X = X,
   356         5          5.0      1.0      0.0                                          y = y,
   357         5          6.0      1.2      0.0                                          metric = metric,
   358         5          7.0      1.4      0.0                                          positive_class      = positive_class,
   359         5          7.0      1.4      0.0                                          score_tree_interval = score_tree_interval,
   360         5          6.0      1.2      0.0                                          stopping_rounds     = stopping_rounds,
   361         5          5.0      1.0      0.0                                          stopping_tolerance  = stopping_tolerance,
   362         5   36497333.0 7299466.6     95.2                                          max_runtime_sec     = model_max_runtime_sec
   363                                                                                )
   364                                                 
   365         5        209.0     41.8      0.0          history_scoring_points = np.union1d(history_scoring_points,  scoring_points)        
   366         5         47.0      9.4      0.0          history_scores[str(params)] = oob_scores
   367         5          9.0      1.8      0.0          params['n_estimators'] = scoring_points[-1]
   368         5          8.0      1.6      0.0          results['params'].append(params)
   369         5          6.0      1.2      0.0          results['oob_metric'].append(oob_scores[-1])
   370         5         53.0     10.6      0.0          logging.debug(f"Modelo: {params} \u2713")
   371                                           
   372         1       1094.0   1094.0      0.0      results = pd.DataFrame(results)
   373         1          3.0      3.0      0.0      history_scores = pd.DataFrame(
   374         1       2798.0   2798.0      0.0                              dict([(k, pd.Series(v)) for k,v in history_scores.items()])
   375                                                                    )
   376         1       1038.0   1038.0      0.0      history_scores['n_estimators'] = history_scoring_points
   377                                               
   378         1          2.0      2.0      0.0      if metric in ['accuracy', 'auc', 'f1']:
   379         1        894.0    894.0      0.0          results = results.sort_values('oob_metric', ascending=False)
   380                                               else:
   381                                                   results = results.sort_values('oob_metric', ascending=True)
   382                                                   
   383         1        914.0    914.0      0.0      results = results.rename(columns = {'oob_metric': f'oob_{metric}'})
   384                                               
   385         1          2.0      2.0      0.0      if return_best:
   386         1        453.0    453.0      0.0          best_params = results['params'].iloc[0]
   387         1          2.0      2.0      0.0          print(
   388         1        102.0    102.0      0.0              f"Refitting mode using the best found parameters and the whole data set: \n {best_params}"
   389                                                   )
   390                                                   
   391         1        386.0    386.0      0.0          model.set_params(**best_params)
   392         1    1825797.0 1825797.0      4.8          model.fit(X=X, y=y)
   393                                                   
   394         1       6103.0   6103.0      0.0      results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
   395         1       1453.0   1453.0      0.0      results = results.drop(columns = 'params')
   396                                               
   397         1          2.0      2.0      0.0      return results, history_scores

In [10]:
lp = LineProfiler()
lp_wrapper = lp(fit_RandomForest_early_stopping)
lp_wrapper(
    model                 = model,
    X                     = X,
    y                     = y,
    metric                = 'auc',
    positive_class        = 1,
    score_tree_interval   = 50,
    stopping_rounds       = 4,
    stopping_tolerance    = 0.01,
    max_runtime_sec       = None
)

lp.print_stats()
Timer unit: 1e-06 s

Total time: 6.00894 s
File: <ipython-input-1-501723037cfe>
Function: fit_RandomForest_early_stopping at line 128

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   128                                           def fit_RandomForest_early_stopping(
   129                                               model: Union[RandomForestClassifier, RandomForestRegressor],
   130                                               X: Union[np.ndarray, pd.core.frame.DataFrame],
   131                                               y: np.ndarray,
   132                                               metric: str,
   133                                               positive_class: int=1,
   134                                               score_tree_interval: int=None,
   135                                               stopping_rounds: int=4,
   136                                               stopping_tolerance: float=0.01,
   137                                               max_runtime_sec: int=None) -> np.ndarray:
   138                                               
   139                                               """
   140                                               Fit a RandomForest model until an early stopping condition is met or
   141                                               `n_estimatos` is reached.
   142                                               
   143                                               Parameters
   144                                               ----------
   145                                               
   146                                               model: RandomForestClassifier, RandomForestRegressor
   147                                                   Model to be fitted.
   148                                                   
   149                                               X: np.ndarray, pd.core.frame.DataFrame
   150                                                   Training input samples. 
   151                                               
   152                                               y: np.ndarray, pd.core.frame.DataFrame
   153                                                   Target value of the input samples. 
   154                                               
   155                                               scores: list, np.ndarray
   156                                                   Scores used to evaluate early stopping conditions.
   157                                                   
   158                                               metric: str
   159                                                   Metric used to generate the score. Used to determine if higher score
   160                                                   means a better model or the opposite.
   161                                                   
   162                                               score_tree_interval: int, default `None`
   163                                                   Score the model after this many trees. If `None`, the model is scored after
   164                                                   `n_estimators` / 10.
   165                                                   
   166                                               stopping_rounds: int
   167                                                   Number of consecutive rounds without improvement needed to stop the training.
   168                                               
   169                                               stopping_tolerance: float, default 0.01
   170                                                   Minimum percentage of positive change between two consecutive rounds
   171                                                   needed to consider it as an improvement. 
   172                                               
   173                                               max_runtime_sec: int, default `None`
   174                                                   Maximum allowed runtime in seconds for model training. `None` means unlimited.
   175                                                   
   176                                                   
   177                                               Returns
   178                                               ------
   179                                               oob_scores: np.ndarray
   180                                                   Out of bag score for each scoring point.
   181                                               
   182                                               """
   183                                               
   184         1          1.0      1.0      0.0      if score_tree_interval is None:
   185                                                   score_tree_interval = int(model.n_estimators / 10)
   186                                                   
   187         1          2.0      2.0      0.0      allowed_metrics = ['accuracy', 'auc', 'f1', 'mse', 'mae', 'squared_error',
   188         1          1.0      1.0      0.0                         'absolute_error']
   189                                               
   190         1          1.0      1.0      0.0      if metric not in allowed_metrics:
   191                                                   raise Exception(
   192                                                           f"`metric` argument must be one of: {allowed_metrics}. "
   193                                                           f"Got {metric}"
   194                                                   )
   195                                               
   196         1          2.0      2.0      0.0      if not model.oob_score:
   197                                                   model.set_params(oob_score=True)
   198                                                   
   199         1         19.0     19.0      0.0      start_time = pd.Timestamp.now()
   200         1          1.0      1.0      0.0      oob_scores = []
   201         1         21.0     21.0      0.0      scoring_points = np.arange(0, model.n_estimators + 1, score_tree_interval)[1:]
   202         1         49.0     49.0      0.0      scoring_points = np.hstack((1, scoring_points))
   203                                               
   204                                               metrics = {
   205         1          1.0      1.0      0.0          'auc' : roc_auc_score,
   206         1          1.0      1.0      0.0          'accuracy' : accuracy_score,
   207         1          1.0      1.0      0.0          'f1': f1_score,
   208         1          1.0      1.0      0.0          'mse': mean_squared_error,
   209         1          0.0      0.0      0.0          'squared_error': mean_squared_error,
   210         1          1.0      1.0      0.0          'mae': mean_absolute_error,
   211         1          1.0      1.0      0.0          'absolute_error': mean_absolute_error,        
   212                                               }
   213                                               
   214         6         10.0      1.7      0.0      for i, n_estimators in enumerate(scoring_points):
   215                                                   
   216         6         63.0     10.5      0.0          logging.debug(f"Training with n_stimators: {n_estimators}")
   217         6       1813.0    302.2      0.0          model.set_params(n_estimators=n_estimators)
   218         6    5678314.0 946385.7     94.5          model.fit(X=X, y=y)
   219                                                   
   220         6         15.0      2.5      0.0          if metric == 'auc':
   221         6         18.0      3.0      0.0              oob_predictions = model.oob_decision_function_[:, positive_class]
   222                                                       # If n_estimators is small it might be possible that a data point
   223                                                       # was never left out during the bootstrap. In this case,
   224                                                       # oob_decision_function_ might contain NaN.
   225         6         11.0      1.8      0.0              oob_score = metrics[metric](
   226         6       3891.0    648.5      0.1                              y_true=y[~np.isnan(oob_predictions)],
   227         6     323401.0  53900.2      5.4                              y_score=oob_predictions[~np.isnan(oob_predictions)]
   228                                                                   )
   229                                                   else:
   230                                                       oob_predictions = model.oob_decision_function_
   231                                                       oob_predictions = np.argmax(oob_predictions, axis=1)
   232                                                       oob_score = metrics[metric](
   233                                                                       y_true=y[~np.isnan(oob_predictions)],
   234                                                                       y_score=oob_predictions[~np.isnan(oob_predictions)]
   235                                                                   )
   236                                                       
   237         6         15.0      2.5      0.0          oob_scores.append(oob_score)
   238                                                   
   239         6         11.0      1.8      0.0          early_stopping = check_early_stopping(
   240         6          5.0      0.8      0.0                              scores             = oob_scores,
   241         6          3.0      0.5      0.0                              metric             = metric,
   242         6          6.0      1.0      0.0                              stopping_rounds    = stopping_rounds,
   243         6          5.0      0.8      0.0                              stopping_tolerance = stopping_tolerance,
   244         6          5.0      0.8      0.0                              max_runtime_sec    = max_runtime_sec,
   245         6       1222.0    203.7      0.0                              start_time         = start_time
   246                                                                    )    
   247                                                   
   248         6          8.0      1.3      0.0          if early_stopping:
   249         1          1.0      1.0      0.0              logging.debug(
   250         1          7.0      7.0      0.0                  f"Early stopping activated at round {i + 1}: n_estimators = {n_estimators}"
   251                                                       )
   252         1          1.0      1.0      0.0              break
   253                                                   
   254         1          8.0      8.0      0.0      logging.debug(f"Out of bag score = {oob_scores[-1]}")
   255                                               
   256         1          5.0      5.0      0.0      return np.array(oob_scores), scoring_points[:len(oob_scores)]

Información de sesión

In [11]:
from sinfo import sinfo
sinfo()
The `sinfo` package has changed name and is now called `session_info` to become more discoverable and self-explanatory. The `sinfo` PyPI package will be kept around to avoid breaking old installs and you can downgrade to 0.3.2 if you want to use it without seeing this message. For the latest features and bug fixes, please install `session_info` instead. The usage and defaults also changed slightly, so please review the latest README at https://gitlab.com/joelostblom/session_info.
-----
line_profiler       3.3.0
matplotlib          3.3.2
numpy               1.19.2
pandas              1.2.4
sinfo               0.3.4
sklearn             0.24.2
tqdm                4.60.0
-----
IPython             7.22.0
jupyter_client      6.1.7
jupyter_core        4.6.3
jupyterlab          2.1.3
notebook            6.4.0
-----
Python 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
Linux-5.8.0-59-generic-x86_64-with-debian-bullseye-sid
8 logical CPU cores, x86_64
-----
Session information updated at 2021-07-25 16:59

¿Cómo citar este documento?

Grid search de modelos Random Forest con out-of-bag error y early stopping por Joaquín Amat Rodrigo, disponible con licencia CC BY-NC-SA 4.0 en https://www.cienciadedatos.net/documentos/py36-grid-search-random-forest-out-of-bag-error-early-stopping.html DOI


¿Te ha gustado el artículo? Tu ayuda es importante

Mantener un sitio web tiene unos costes elevados, tu contribución me ayudará a seguir generando contenido divulgativo gratuito. ¡Muchísimas gracias! 😊


Creative Commons Licence
Este contenido, creado por Joaquín Amat Rodrigo, tiene licencia Attribution-NonCommercial-ShareAlike 4.0 International.