kernel¶
A class to implement diffusion kernels.
-
class
pydiffmap.kernel.
Kernel
(kernel_type='gaussian', epsilon='bgh', k=64, neighbor_params=None, metric='euclidean', metric_params=None, bandwidth_type=None)[source]¶ Class abstracting the evaluation of kernel functions on the dataset.
Parameters: - kernel_type (string or callable, optional) – Type of kernel to construct. Currently the only option is ‘gaussian’ (the default), but more will be implemented.
- epsilon (string, optional) – Method for choosing the epsilon. Currently, the only options are to provide a scalar (epsilon is set to the provided scalar) ‘bgh’ (Berry, Giannakis and Harlim), and ‘bgh_generous’ (‘bgh’ method, with answer multiplied by 2.
- k (int, optional) – Number of nearest neighbors over which to construct the kernel.
- neighbor_params (dict or None, optional) – Optional parameters for the nearest Neighbor search. See scikit-learn NearestNeighbors class for details.
- metric (string, optional) – Distance metric to use in constructing the kernel. This can be selected from any of the scipy.spatial.distance metrics, or a callable function returning the distance.
- metric_params (dict or None, optional) – Optional parameters required for the metric given.
- bandwidth_type (callable, number, string, or None, optional) – Type of bandwidth to use in the kernel. If None (default), a fixed bandwidth kernel is used. If a callable function, the data is passed to the function, and the bandwidth is output (note that the function must take in an entire dataset, not the points 1-by-1). If a number, e.g. -.25, a kernel density estimate is performed, and the bandwidth is taken to be q**(input_number). For a string input, the input is assumed to be an evaluatable expression in terms of the dimension d, e.g. “-1/(d+2)”. The dimension is then estimated, and the bandwidth is set to q**(evaluated input string).
-
build_bandwidth_fxn
(bandwidth_type)[source]¶ Parses an input string or function specifying the bandwidth.
Parameters: bandwidth_fxn (string or number or callable) – Bandwidth to use. If a number, taken to be the beta parameter in [1]_. If a string, taken to again be beta, but with an evaluatable expression as a function of the intrinsic dimension d, e.g. ‘1/(d+2)’. If a function, taken to be a function that outputs the bandwidth. References
[1] T. Berry, and J. Harlim, Applied and Computational Harmonic Analysis 40, 68-96 (2016).
-
choose_optimal_epsilon
(epsilon=None)[source]¶ Chooses the optimal value of epsilon and automatically detects the dimensionality of the data.
Parameters: epsilon (string or scalar, optional) – Method for choosing the epsilon. Currently, the only options are to provide a scalar (epsilon is set to the provided scalar) or ‘bgh’ (Berry, Giannakis and Harlim). Returns: self (the object itself)
-
compute
(Y=None, return_bandwidths=False)[source]¶ Computes the sparse kernel matrix.
Parameters: - Y (array-like, shape (n_query, n_features), optional.) – Data against which to calculate the kernel values. If not provided, calculates against the data provided in the fit.
- return_bandwidths (boolean, optional) – If True, also returns the computed bandwidth for each y point.
Returns: - K (array-like, shape (n_query_X, n_query_Y)) – Values of the kernel matrix.
- y_bandwidths (array-like, shape (n_query_y)) – Bandwidth evaluated at each point Y. Only returned if return_bandwidths is True.
-
class
pydiffmap.kernel.
NNKDE
(neighbors, k=8)[source]¶ Class building a kernel density estimate with a variable bandwidth built from the k nearest neighbors.
Parameters: - neighbors (scikit-learn NearestNeighbors object) – NearestNeighbors object to use in constructing the KDE.
- k (int, optional) – Number of nearest neighbors to use in the construction of the bandwidth. This must be less or equal to the number of nearest neighbors used by the nearest neighbor object.
-
compute
(Y)[source]¶ Computes the density at each query point in Y.
Parameters: Y (array-like, shape (n_query, n_features)) – Data against which to calculate the kernel values. If not provided, calculates against the data provided in the fit. Returns: q (array-like, shape (n_query)) – Density evaluated at each point Y.
-
pydiffmap.kernel.
choose_optimal_epsilon_BGH
(scaled_distsq, epsilons=None)[source]¶ Calculates the optimal epsilon for kernel density estimation according to the criteria in Berry, Giannakis, and Harlim.
Parameters: - scaled_distsq (numpy array) – Values for scaled distance squared values, in no particular order or shape. (This is the exponent in the Gaussian Kernel, aka the thing that gets divided by epsilon).
- epsilons (array-like, optional) – Values of epsilon from which to choose the optimum. If not provided, uses all powers of 2. from 2^-40 to 2^40
Returns: - epsilon (float) – Estimated value of the optimal length-scale parameter.
- d (int) – Estimated dimensionality of the system.
Notes
This code explicitly assumes the kernel is gaussian, for now.
References
The algorithm given is based on [1]_. If you use this code, please cite them.
[1] T. Berry, D. Giannakis, and J. Harlim, Physical Review E 91, 032915 (2015).