kernel¶

A class to implement diffusion kernels.

class pydiffmap.kernel.Kernel(kernel_type='gaussian', epsilon='bgh', k=64, neighbor_params=None, metric='euclidean', metric_params=None, bandwidth_type=None)[source]¶

Class abstracting the evaluation of kernel functions on the dataset.

Parameters:

kernel_type (string or callable, optional) – Type of kernel to construct. Currently the only option is ‘gaussian’ (the default), but more will be implemented.
epsilon (string, optional) – Method for choosing the epsilon. Currently, the only options are to provide a scalar (epsilon is set to the provided scalar) ‘bgh’ (Berry, Giannakis and Harlim), and ‘bgh_generous’ (‘bgh’ method, with answer multiplied by 2.
k (int, optional) – Number of nearest neighbors over which to construct the kernel.
neighbor_params (dict or None, optional) – Optional parameters for the nearest Neighbor search. See scikit-learn NearestNeighbors class for details.
metric (string, optional) – Distance metric to use in constructing the kernel. This can be selected from any of the scipy.spatial.distance metrics, or a callable function returning the distance.
metric_params (dict or None, optional) – Optional parameters required for the metric given.
bandwidth_type (callable, number, string, or None, optional) – Type of bandwidth to use in the kernel. If None (default), a fixed bandwidth kernel is used. If a callable function, the data is passed to the function, and the bandwidth is output (note that the function must take in an entire dataset, not the points 1-by-1). If a number, e.g. -.25, a kernel density estimate is performed, and the bandwidth is taken to be q**(input_number). For a string input, the input is assumed to be an evaluatable expression in terms of the dimension d, e.g. “-1/(d+2)”. The dimension is then estimated, and the bandwidth is set to q**(evaluated input string).

build_bandwidth_fxn(bandwidth_type)[source]¶

Parses an input string or function specifying the bandwidth.

Parameters:	bandwidth_fxn (string or number or callable) – Bandwidth to use. If a number, taken to be the beta parameter in [1]_. If a string, taken to again be beta, but with an evaluatable expression as a function of the intrinsic dimension d, e.g. ‘1/(d+2)’. If a function, taken to be a function that outputs the bandwidth.

References

[1]	T. Berry, and J. Harlim, Applied and Computational Harmonic Analysis 40, 68-96 (2016).

choose_optimal_epsilon(epsilon=None)[source]¶

Chooses the optimal value of epsilon and automatically detects the dimensionality of the data.

Parameters:	epsilon (string or scalar, optional) – Method for choosing the epsilon. Currently, the only options are to provide a scalar (epsilon is set to the provided scalar) or ‘bgh’ (Berry, Giannakis and Harlim).
Returns:	self (the object itself)

compute(Y=None, return_bandwidths=False)[source]¶

Computes the sparse kernel matrix.

Parameters:

Y (array-like, shape (n_query, n_features), optional.) – Data against which to calculate the kernel values. If not provided, calculates against the data provided in the fit.
return_bandwidths (boolean, optional) – If True, also returns the computed bandwidth for each y point.

Returns:

K (array-like, shape (n_query_X, n_query_Y)) – Values of the kernel matrix.
y_bandwidths (array-like, shape (n_query_y)) – Bandwidth evaluated at each point Y. Only returned if return_bandwidths is True.

fit(X)[source]¶

Fits the kernel to the data X, constructing the nearest neighbor tree.

Parameters:	X (array-like, shape (n_query, n_features)) – Data upon which to fit the nearest neighbor tree.
Returns:	self (the object itself)

class pydiffmap.kernel.NNKDE(neighbors, k=8)[source]¶

Class building a kernel density estimate with a variable bandwidth built from the k nearest neighbors.

Parameters:	neighbors (scikit-learn NearestNeighbors object) – NearestNeighbors object to use in constructing the KDE. k (int, optional) – Number of nearest neighbors to use in the construction of the bandwidth. This must be less or equal to the number of nearest neighbors used by the nearest neighbor object.

compute(Y)[source]¶

Computes the density at each query point in Y.

Parameters:	Y (array-like, shape (n_query, n_features)) – Data against which to calculate the kernel values. If not provided, calculates against the data provided in the fit.
Returns:	q (array-like, shape (n_query)) – Density evaluated at each point Y.

fit()[source]¶: Fits the kde object to the data provided in the nearest neighbor object.

pydiffmap.kernel.choose_optimal_epsilon_BGH(scaled_distsq, epsilons=None)[source]¶

Calculates the optimal epsilon for kernel density estimation according to the criteria in Berry, Giannakis, and Harlim.

Parameters:

scaled_distsq (numpy array) – Values for scaled distance squared values, in no particular order or shape. (This is the exponent in the Gaussian Kernel, aka the thing that gets divided by epsilon).
epsilons (array-like, optional) – Values of epsilon from which to choose the optimum. If not provided, uses all powers of 2. from 2^-40 to 2^40

Returns:

epsilon (float) – Estimated value of the optimal length-scale parameter.
d (int) – Estimated dimensionality of the system.

Notes

This code explicitly assumes the kernel is gaussian, for now.

References

The algorithm given is based on [1]_. If you use this code, please cite them.

[1]	T. Berry, D. Giannakis, and J. Harlim, Physical Review E 91, 032915 (2015).