More about Data Science and Statistics

Introduction¶

Identifying the type of distribution that a variable follows is a fundamental step in virtually all studies involving data, from hypothesis testing to building statistical learning and machine learning models.

Having a function that approximately describes the data offers multiple advantages. For example, it allows calculating the probability density of an observation, estimating the probability that it falls within a certain range, or simulating new synthetic values that follow the same behavior as the original data.

In general terms, fitting a distribution consists of finding the mathematical function that best describes a dataset. Among all candidate functions, the goal is usually to find the one that, with the highest probability (likelihood), could have generated the observed data.

One of the most practical approaches is to use parametric distributions. These are families of theoretical distributions whose behavior is governed by a fixed and finite number of parameters. For example, the normal distribution is fully determined by its mean ($\mu$) and standard deviation ($\sigma$); see more about this type of fitting in Distribution fitting and selection with Python.

However, when no standard parametric distribution fits the data correctly, it is necessary to resort to nonparametric fitting methods. Their goal is to generate a density function that adapts to the shape of the data without forcing them to follow a rigid predefined structure. One of the most commonly used methods is kernel density estimation (KDE).

In Python, there are several libraries that allow fitting distributions using KDE:

SciPy: gaussian_kde.
Statsmodels: KDEUnivariate and KDEMultivariate.
Scikit-learn: KernelDensity.

Throughout this document, the theoretical foundations of the KDE method and examples of how to use the Scikit-learn implementation are described.

✏️ Note

The term nonparametric distribution can be misleading. It does not mean they lack parameters, but rather that the number and nature of these parameters are not fixed beforehand by a closed formula (as in the Normal or Poisson distributions), but the model's complexity grows and adapts according to the available data.

Kernel Density Estimation (KDE)¶

In statistics, Kernel Density Estimation (KDE) is a nonparametric method that allows estimating the probability density function of a continuous random variable from a finite number of observations (sample). It was initially proposed by Fix and Hodges (1951) and later developed by Rosenblatt (1956) and Parzen (1962).

Given a value $x$, the function learned by the kernel density estimator returns the density of the distribution at that point. This density, whose value must be non-negative (range $[0, +\infty)$), is interpreted as a relative measure of likelihood. It is important to note that, for continuous variables, density is not a direct probability. However, it allows comparisons: if the density at point A is greater than at B, it indicates that observations are more likely to be found in the vicinity of A than B. Frequently, to facilitate numerical calculations and avoid overflow problems, the logarithm of the density (log-likelihood) is used; the interpretation remains the same: the higher its value, the greater the evidence that the observation is consistent with the estimated distribution.

Histogram¶

An intuitive way to understand how KDE works is by starting from the histogram, the most classic tool for representing data distributions. Its construction follows these steps:

The range of observed values (minimum and maximum) is identified.
That range is divided into intervals (bins), generally of equal size.
The number of observations falling within each interval is counted.
A bar is represented for each interval, where the height is proportional to the frequency (count).
Optionally, the height of the bars is normalized so that the total area of the histogram sums to 1. In this case, the $y$ axis represents the empirical probability density.

# Histogram example bimodal distribution
# ==============================================================================
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

sample_1 = np.random.normal(loc=20, scale=5, size=200)
sample_2 = np.random.normal(loc=40, scale=5, size=500)
data     = np.hstack((sample_1, sample_2))

fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data, bins=30, density=True, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
ax.set_title('Histogram (normalized)  \n 700 observations')
ax.set_xlabel('x')
ax.set_ylabel('density');

The main drawback of histograms is that they depend greatly on the number of intervals (bins) used, and that the resulting distribution is stepped instead of continuous. This effect becomes more evident when there are few observations.

# Histogram example bimodal distribution with few observations
# ==============================================================================
sample_1 = np.random.normal(loc=20, scale=5, size=5)
sample_2 = np.random.normal(loc=40, scale=5, size=5)
data     = np.hstack((sample_1, sample_2))

fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(8, 3))
axs[0].hist(data, bins=30, density=True, color="#3182bd", alpha=0.5)
axs[0].plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
axs[0].set_title('Histogram (normalized) \n 10 observations, 30 bins')
axs[0].set_xlabel('x')
axs[0].set_ylabel('density');

axs[1].hist(data, bins=10, density=True, color="#3182bd", alpha=0.5)
axs[1].plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
axs[1].set_title('Histogram (normalized) \n 10 observations, 10 bins')
axs[1].set_xlabel('x')
axs[1].set_ylabel('density');

With only 10 observations, even with different numbers of bins, it is difficult for the histogram to achieve an adequate approximation to the true distribution from which the data come.

Fitting a KDE¶

Intuitive idea¶

Kernel density estimation (KDE) expands the idea of the histogram, "each observation increases the probability density in the area where it is located", but it does so in a way that the contributions come together creating a continuous and smooth curve. How does it achieve this?

Suppose there is a set of observations $X$:

# Create 4 observations
# ======================================================================================
X = np.array([2, 3, 4, 7])
fig, ax = plt.subplots(figsize=(5, 2.5))
ax.plot(X, np.full_like(X, 0.05), '|k', markeredgewidth=5)
ax.set_ylim(-0.05, 1)
ax.set_title('Observation positions')
ax.set_xlabel('x')
ax.set_ylabel('');

Instead of placing each observation $x_i$ in a rigid "bucket" (as in the histogram), the KDE method places a small mathematical function, called Kernel ($K$), centered exactly on each of the observed points $x_i$. One of the most commonly used kernels is the normal distribution.

Continuing with the example, a normal distribution is centered on each observation, with mean equal to the observation value and standard deviation of 1 (the choice of this value is detailed later). In this way, each observation contributes precisely at the position it occupies but also, gradually, in the nearby regions.

# Create normal distributions centered on each observation
# ======================================================================================
from scipy import stats
fig, ax = plt.subplots(figsize=(5, 2.5))
ax.plot(X, np.full_like(X, 0), '|k', markeredgewidth=4)

Xgrid = np.linspace(min(X) - 1, max(X) + 1, num=500)
for x in X:
    density = stats.norm.pdf(Xgrid, loc=x, scale=1)
    ax.axvline(x=x, linestyle='--', color='black')
    ax.plot(Xgrid, density)
    
ax.set_title('Normal distributions (one per observation)')
ax.set_xlabel('x')
ax.set_ylabel('density');

Finally, if the individual contributions are summed and divided by the total number of curves (observations), a final curve is obtained that describes the distribution of the observations.

# Aggregate at each point the probability density of the 4 distributions
# ======================================================================================
fig, ax = plt.subplots(figsize=(5, 2.5))
ax.plot(X, np.full_like(X, 0), '|k', markeredgewidth=4)

Xgrid = np.linspace(min(X) - 1, max(X) + 1, num=500)
suma = np.full_like(Xgrid, 0)
for x in X:
    density = stats.norm.pdf(Xgrid, loc=x, scale=1)
    suma = suma + density
    ax.axvline(x=x, linestyle='--', color='black')
    ax.plot(Xgrid, density)
    
suma = suma / len(X)
ax.plot(Xgrid, suma, color='b')
ax.fill_between(Xgrid, suma, alpha=0.3, color='b')
ax.set_title('Sum of distributions')
ax.set_xlabel('x')
ax.set_ylabel('density');

This simple yet powerful idea is what underlies the kernel density estimation (KDE) method: approximating a density function as the sum of functions (kernel) of each observation.

Mathematical definition¶

Given a dataset $x = \{x_1, x_2, \ldots, x_n \}$, the probability density function $f(x)$ can be approximated using a kernel density estimation (KDE) such that:

$$\hat{f}(x) = \frac{1}{n}\sum_{i=1}^{n}K_h(x-x_i) = \frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-x_{i}}{h})$$

$n$: is the number of data points (observations). Each of them is the center on which a kernel is placed.
$h$: is the bandwidth (bandwidth or smoothing parameter). It controls how much the influence of each observation is expanded. If a normal distribution is used as the kernel, it is equivalent to the standard deviation. This is the most determining value when fitting a KDE, since it conditions the level of overfitting.
$K$: is the Kernel, a symmetric and positive function that integrates to 1, which defines the shape and distribution of the influence (weight) associated with each observation. In the previous examples, the normal distribution has been used as the kernel.

Bandwidth selection¶

The bandwidth is crucial when estimating a density function using the KDE method. If its value is too low, overfitting is generated and the resulting function will be too influenced by the "noise" in the data. If its value is too high, the resulting function will not be able to learn the underlying distribution.

fig, axs = plt.subplots(ncols=3, nrows=1, figsize=(9, 3))
X = np.array([2, 3, 4, 7])
Xgrid = np.linspace(min(X) - 1, max(X) + 1, num=500)

axs[0].plot(X, np.full_like(X, 0), '|k', markeredgewidth=4)
suma = np.full_like(Xgrid, 0)
for x in X:
    density = stats.norm.pdf(Xgrid, loc=x, scale=10)
    suma = suma + density
suma = suma / len(X)
axs[0].plot(Xgrid, suma, color='b')
axs[0].fill_between(Xgrid, suma, alpha=0.3, color='b')
axs[0].set_title('Kernel: gaussian\n Bandwidth: 10')
axs[0].set_xlabel('x')
axs[0].set_ylabel('density');

axs[1].plot(X, np.full_like(X, 0), '|k', markeredgewidth=4)
suma = np.full_like(Xgrid, 0)
for x in X:
    density = stats.norm.pdf(Xgrid, loc=x, scale=1)
    suma = suma + density 
suma = suma / len(X)
axs[1].plot(Xgrid, suma, color='b')
axs[1].fill_between(Xgrid, suma, alpha=0.3, color='b');
axs[1].set_title('Kernel: gaussian \n Bandwidth: 1')
axs[1].set_xlabel('x')

axs[1].plot(X, np.full_like(X, 0), '|k', markeredgewidth=4)
suma = np.full_like(Xgrid, 0)
for x in X:
    density = stats.norm.pdf(Xgrid, loc=x, scale=0.1)
    suma = suma + density 
suma = suma / len(X)
axs[2].plot(Xgrid, suma, color='b')
axs[2].fill_between(Xgrid, suma, alpha=0.3, color='b');
axs[2].set_title('Kernel: gaussian \n Bandwidth: 0.1')
axs[2].set_xlabel('x');

Although there is no closed formula to know the optimal bandwidth value in advance, there are established strategies to identify appropriate values.

Empirical Rules (Heuristics)

These are quick formulas that assume some smoothness in the underlying data. The most common ones for Gaussian kernels are:

Scott's rule: $h \approx 1.06 \cdot \hat{\sigma} n^{-1/5}$
Silverman's rule: $h = 0.9 \cdot \min ( \hat{\sigma}, IQR/1.35 ) n^{-1/5}$

While they are computationally immediate, these methods tend to oversmooth when the true data distribution is multimodal or very different from a normal.

Cross-validation

It is the most rigorous method, as it does not assume any predefined form for the data. It is based on trying different values of $h$ and measuring how well they generalize to unseen data.

This method requires higher computational cost, but is applicable to any distribution. In Scikit-learn, this is typically implemented via GridSearchCV, using log-likelihood as the evaluation metric. When the number of observations is very low, it is highly recommended to use the Leave-One-Out Cross-Validation (LOOCV) strategy to maximize the use of available information in each iteration.

Kernel type¶

The kernel is the function that determines how the influence of each observation is distributed, therefore, it can have a notable impact on the estimation of the resulting density function. Although in the vast majority of cases a Gaussian kernel (normal distribution) is used, there are other possibilities. Those implemented in scikit-learn are:

Gaussian: assigns weights following the normal distribution with a standard deviation equivalent to the bandwidth.

$$K(x; h) \propto \exp(- \frac{x^2}{2h^2} )$$

Epanechnikov: observations that are at a distance between 0 and h have a weight between $\frac{3}{4}$ and 0 with quadratic decrease. Any observation outside this range has weight 0.

$$K(x; h) \propto 1 - \frac{x^2}{h^2}$$

Tophat: Assigns the same weight to all observations that are within the bandwidth.

$$K(x; h) \propto 1 \text{ if } x < h$$

Exponential: the weight decays exponentially.

$$K(x; h) \propto \exp(-x/h)$$

Linear: the weight decays linearly within the bandwidth. Beyond this the weight is 0.

$$K(x; h) \propto 1 - x/h \text{ if } x < h$$

Cosine: the weight within the bandwidth is proportional to the cosine.

$$K(x; h) \propto \cos(\frac{\pi x}{2h}) \text{ if } x < h$$

# Plots of available kernels (source: scikit-learn documentation)
# ----------------------------------------------------------------------
from sklearn.neighbors import KernelDensity
X_plot = np.linspace(-6, 6, 1000)[:, None]
X_src = np.zeros((1, 1))

fig, ax = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(8,4))
fig.subplots_adjust(left=0.05, right=0.95, hspace=0.05, wspace=0.05)

def format_func(x, loc):
    if x == 0:
        return '0'
    elif x == 1:
        return 'h'
    elif x == -1:
        return '-h'
    else:
        return '%ih' % x

for i, kernel in enumerate(['gaussian', 'tophat', 'epanechnikov',
                            'exponential', 'linear', 'cosine']):
    axi = ax.ravel()[i]
    log_dens = KernelDensity(kernel=kernel).fit(X_src).score_samples(X_plot)
    axi.fill(X_plot[:, 0], np.exp(log_dens), '-k', fc='#AAAAFF')
    axi.text(-2.6, 0.95, kernel)

    axi.xaxis.set_major_formatter(plt.FuncFormatter(format_func))
    axi.xaxis.set_major_locator(plt.MultipleLocator(1))
    axi.yaxis.set_major_locator(plt.NullLocator())

    axi.set_ylim(0, 1.05)
    axi.set_xlim(-2.9, 2.9)

ax[0, 1].set_title('Available Scikit-learn kernels');

Univariate example¶

In the following example, the kernel density estimation (KDE) method is used to try to fit the density function of data simulated from a bimodal normal distribution (overlap of two normal distributions).

Libraries¶

The libraries used in this example are:

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

# Distribution fitting
# ==============================================================================
from scipy import stats
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data¶

For this demonstration, simulated data are used from a mixture distribution formed by two normal distributions:

Distribution 1: $\mathcal{N}(1,\,0.5)$, weight 0.25
Distribution 2: $\mathcal{N}(-1,\,0.5)$, weight 0.75

The weights indicate the proportion in which each individual distribution contributes to the final distribution.

# Data
# ==============================================================================
n = 1000
np.random.seed(123)
sample_1 = np.random.normal(loc=1, scale=0.5, size=int(n * 0.75))
sample_2 = np.random.normal(loc=-1, scale=0.5, size=int(n * 0.25))
data = np.hstack((sample_1, sample_2))

Since the data have been simulated, the true density curve can be calculated.

# Histogram overlaid with the true distribution
# ==============================================================================
X_grid = np.linspace(-3, 4, 1000)
true_density = (stats.norm.pdf(loc=1, scale=0.5, x=X_grid)*0.75
              + stats.norm.pdf(loc=-1, scale=0.5, x=X_grid)*0.25)

fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data, bins=30, density=True, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
ax.plot(X_grid, true_density, color = 'blue', label='True distribution')
ax.set_title('Observed vs true distribution')
ax.set_xlabel('x')
ax.set_ylabel('density')
ax.legend();

KDE model¶

The KernelDensity class from Scikit-learn implements the algorithm to estimate the density function of univariate and multivariate distributions.

First, a model is trained using a linear kernel and a bandwidth of 1.

# KDE model fitting
# ==============================================================================
kde_model = KernelDensity(kernel='linear', bandwidth=1)
kde_model.fit(X=data.reshape(-1, 1))

KernelDensity(bandwidth=1, kernel='linear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Once the model is trained, the score_samples() method can predict the estimated density for any new value. The returned value is on a logarithmic scale.

# Density prediction
# ==============================================================================
new_X = np.array([1])
log_density_pred = kde_model.score_samples(X=new_X.reshape(-1, 1))

# Apply exponent to undo the logarithm
density_pred = np.exp(log_density_pred)
density_pred

array([0.46014262])

If the prediction is calculated for the entire range of observed values, the obtained density function can be represented.

# Plot of estimated (prediction) vs true density distribution
# ==============================================================================
X_grid = np.linspace(-3, 4, 1000)
true_density = (
    stats.norm.pdf(loc=1, scale=0.5, x=X_grid) * 0.75 
    + stats.norm.pdf(loc=-1, scale=0.5, x=X_grid) * 0.25
)

log_density_pred = kde_model.score_samples(X_grid.reshape((-1, 1)))
# Apply exponent to undo the logarithm
density_pred = np.exp(log_density_pred)

fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data, bins=30, density=True, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
ax.plot(X_grid, true_density, color = 'blue', label='True distribution')
ax.plot(X_grid, density_pred, color = 'red', label='KDE')
ax.set_title('Distribution comparison')
ax.set_xlabel('x')
ax.set_ylabel('density')
ax.legend();

Kernel and bandwidth selection¶

The kernel type and bandwidth are hyperparameters whose optimal value cannot be known in advance. Cross-validation is used to try to identify the combination that achieves the best results.

When GridSearchCV is combined with KernelDensity, the score used is the total log-likelihood. The closer to zero, the better the KDE fits the data.

# Cross-validation to identify kernel and bandwidth
# ==============================================================================
param_grid = {
    'kernel': ['gaussian', 'epanechnikov', 'exponential', 'linear'],
    'bandwidth': np.logspace(-2, 3, 20)
}

grid = GridSearchCV(
        estimator  = KernelDensity(),
        param_grid = param_grid,
        n_jobs     = -1,
        cv         = 10, 
        verbose    = 0
      )
_ = grid.fit(X = data.reshape((-1, 1)))
results = (
    pd.DataFrame(grid.cv_results_)
    .loc[:, ['param_bandwidth', 'param_kernel', 'mean_test_score']]
    .sort_values(by='mean_test_score', ascending=False)
)
results.head(10)

	param_bandwidth	param_kernel	mean_test_score
25	0.379269	epanechnikov	-136.081077
27	0.379269	linear	-136.137295
20	0.206914	gaussian	-136.243434
18	0.112884	exponential	-136.427078
16	0.112884	gaussian	-136.603019
22	0.206914	exponential	-136.850927
31	0.695193	linear	-137.022066
29	0.695193	epanechnikov	-137.508029
14	0.061585	exponential	-137.599076
24	0.379269	gaussian	-138.651583

# Best hyperparameters by cross-validation
# ==============================================================================
print("----------------------------------------")
print("Best hyperparameters found (cv)")
print("----------------------------------------")
print(grid.best_params_, ":", grid.best_score_)

final_kde_model = grid.best_estimator_

----------------------------------------
Best hyperparameters found (cv)
----------------------------------------
{'bandwidth': np.float64(0.37926901907322497), 'kernel': 'epanechnikov'} : -136.081077189232

# Final model density distribution plots
# ==============================================================================
X_grid = np.linspace(-3, 4, 1000)
true_density = (
    stats.norm.pdf(loc=1, scale=0.5, x=X_grid) * 0.75
    + stats.norm.pdf(loc=-1, scale=0.5, x=X_grid) * 0.25
)

log_density_pred = final_kde_model.score_samples(X_grid.reshape((-1, 1)))
# Apply exponent to undo the logarithm
density_pred = np.exp(log_density_pred)

fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data, bins=30, density=True, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.001), '|k', markeredgewidth=1)
ax.plot(X_grid, true_density, color = 'blue', label='True distribution')
ax.plot(X_grid, density_pred, color = 'red', label='Kernel: epanechnikov \n bw:0.34')
ax.set_title('Distribution comparison')
ax.set_xlabel('x')
ax.set_ylabel('density')
ax.legend();

The density function obtained using an epanechnikov kernel with a bandwidth of 0.38 manages to closely approximate the true distribution.

It is worth remembering that this comparison is only possible because the data have been simulated. In practice, the true distribution is usually not known, otherwise there would be no need to apply a KDE approximation.

Multivariate example¶

Everything explained about fitting using kernel density estimation (KDE) applies to distributions of more than one dimension (multivariate). However, it is important to keep in mind that, due to the problem known as curse of dimensionality, results deteriorate exponentially as the number of variables increases.

Libraries¶

The libraries used in this document are:

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

# Distribution fitting
# ==============================================================================
from scipy import stats
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV, LeaveOneOut
from sklearn.preprocessing import StandardScaler

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data¶

The geyser dataset contains information about the eruptions of the Old Faithful geyser located in Yellowstone National Park, Wyoming. Specifically, it records information about the duration of 272 eruptions, as well as the time interval between them.

# Data reading
# ==============================================================================
data = sns.load_dataset("geyser")
data = data[['duration', 'waiting']]
print(data.info())
data.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   duration  272 non-null    float64
 1   waiting   272 non-null    int64  
dtypes: float64(1), int64(1)
memory usage: 4.4 KB
None

	duration	waiting
0	3.600	79
1	1.800	54
2	3.333	74

# Histogram of each variable
# ==============================================================================
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))
axs[0].hist(data['duration'], bins=30, density=True, color="#3182bd", alpha=0.5)
axs[0].set_title('Duration distribution')
axs[0].set_xlabel('duration')
axs[0].set_ylabel('density')

axs[1].hist(data['waiting'], bins=30, density=True, color="#3182bd", alpha=0.5)
axs[1].set_title('Waiting distribution')
axs[1].set_xlabel('waiting')
axs[1].set_ylabel('density');

# Plot
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
ax.scatter(data['duration'], data['waiting'], color="#3182bd", alpha=0.5)
ax.set_title('Geyser eruptions distribution')
ax.set_xlabel('duration')
ax.set_ylabel('waiting');

Scaling data¶

In multivariate kernel density estimation (KDE), scaling the data is essential because the estimator relies on distances between points in the feature space. In the scikit-learn implementation, KernelDensity uses a single bandwidth for all dimensions together with a Euclidean distance metric by default. As a result, if the variables have different numerical scales or units, dimensions with larger ranges will dominate the distance computation and therefore the shape of the estimated density. This causes the KDE to reflect differences in scale rather than the true joint structure of the data. Scaling the variables (and, if necessary, whitening to handle correlations) ensures that each dimension contributes appropriately to the distance metric, makes the bandwidth parameter meaningful, and leads to a more faithful multivariate density estimate when using sklearn’s KDE.

# Scale data
# ==============================================================================
scaler = StandardScaler().set_output(transform='pandas')
data_scaled = scaler.fit_transform(data)

KDE model¶

Cross-validation is used to identify the best kernel type and bandwidth. Since there is limited data, the LeaveOneOut strategy is used. However, if computational resources are limited or many observations are available, K-Fold cross-validation with a reasonable number of splits (e.g., 5 or 10) can be used as an alternative.

# Cross-validation to identify kernel and bandwidth
# ==============================================================================
param_grid = {
    'kernel': ['gaussian', 'epanechnikov', 'exponential', 'linear'],
    'bandwidth': np.logspace(-2, 2, 20)
}

grid = GridSearchCV(
        estimator  = KernelDensity(),
        param_grid = param_grid,
        n_jobs     = -1,
        cv         = LeaveOneOut(), 
        verbose    = 0
      )
_ = grid.fit(X=data_scaled)

results = (
    pd.DataFrame(grid.cv_results_)
    .loc[:, ['param_bandwidth', 'param_kernel', 'mean_test_score']]
    .sort_values(by='mean_test_score', ascending=False)
)
results.head(10)

	param_bandwidth	param_kernel	mean_test_score
24	0.183298	gaussian	-1.469368
35	0.483293	linear	-1.471507
22	0.112884	exponential	-1.476535
33	0.483293	epanechnikov	-1.478855
18	0.069519	exponential	-1.490627
20	0.112884	gaussian	-1.492119
26	0.183298	exponential	-1.572039
28	0.297635	gaussian	-1.577044
39	0.784760	linear	-1.598214
14	0.042813	exponential	-1.628505

# Best hyperparameters by cross-validation
# ==============================================================================
print("----------------------------------------")
print("Best hyperparameters found (cv)")
print("----------------------------------------")
print(grid.best_params_, ":", grid.best_score_)
kde_model = grid.best_estimator_

----------------------------------------
Best hyperparameters found (cv)
----------------------------------------
{'bandwidth': np.float64(0.18329807108324356), 'kernel': 'gaussian'} : -1.4693680944036116

If the prediction is calculated for the entire range of observed values, the obtained density function can be represented. Since it is two-dimensional, a grid must be created that covers the entire surface.

# Probability density map
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 4))

# Grid of values extended a bit beyond data limits
x = np.linspace(data['duration'].min() * 0.8, data['duration'].max() * 1.1, 200)
y = np.linspace(data['waiting'].min() * 0.8, data['waiting'].max() * 1.1, 200)
xx, yy = np.meshgrid(x, y)
grid = pd.DataFrame(
    np.column_stack([xx.ravel(), yy.ravel()]),
    columns=['duration', 'waiting']
)
grid = scaler.transform(grid)
# Probability density of each grid value
log_density_pred = kde_model.score_samples(grid)
density_pred = np.exp(log_density_pred).reshape(xx.shape)

ax.contour(
    xx, yy,
    density_pred,
    levels=12,
    alpha=0.6
)
ax.scatter(data['duration'], data['waiting'], color="#3182bd", alpha=0.5)
ax.set_title('Estimated density function')
ax.set_xlabel("Duration")
ax.set_ylabel("Waiting time");

# Representation as heatmap
#===============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 4))
ax.scatter(grid['duration'], grid['waiting'], alpha=0.6, c=density_pred)
ax.set_title('Estimated density function')
ax.set_xlabel("Waiting time")
ax.set_ylabel("Duration");

# 3D representation
#===============================================================================
fig = plt.figure(figsize=(5, 5))
ax = plt.axes(projection='3d')
ax.plot_surface(xx, yy, density_pred, cmap='viridis')
ax.set_xlabel("Waiting time")
ax.set_ylabel("Duration")
ax.set_zlabel('density')
ax.set_title('KDE estimated density function');
plt.tight_layout();

Prediction¶

Once the density function is estimated, its value can be predicted for any other observation. For example, that an eruption has an interval of 120 and a duration of 1 (lower corner of the 3D plot).

new_data = np.array([120, 1]).reshape(1, 2)
log_density_pred = kde_model.score_samples(new_data)
density_pred = np.exp(log_density_pred)
density_pred

array([0.])

Among the observed data, there is no eruption with these characteristics, hence its density value is very low. Again, it should be noted that the obtained value is a probability density, not probability, so its range is bounded between [0, +$\infty$].

This value allows comparing how likely one observation is compared to another in this same distribution, i.e., it serves to make relative comparisons.

Session information¶

import session_info
session_info.show(html=False)

-----
matplotlib          3.10.8
mpl_toolkits        NA
numpy               2.2.6
pandas              2.3.3
scipy               1.15.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-20 00:04

Citation instructions¶

How to cite this document?

If you use this document or any part of it, we would appreciate your citation. Thank you very much!

Kernel Density Estimation (KDE) with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pystats02-kernel-density-estimation-kde-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

It is allowed to:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Non-Commercial: You may not use the material for commercial purposes.
Share-Alike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

	bandwidth	1
	algorithm	'auto'
	kernel	'linear'
	metric	'euclidean'
	atol	0
	rtol	0
	breadth_first	True
	leaf_size	40
	metric_params	None