Theory

Diffusion maps is a dimension reduction technique that can be used to discover low dimensional structure in high dimensional data. It assumes that the data points, which are given as points in a high dimensional metric space, actually live on a lower dimensional structure. To uncover this structure, diffusion maps builds a neighborhood graph on the data based on the distances between nearby points. Then a graph Laplacian L is constructed on the neighborhood graph. Many variants exist that approximate different differential operators. For example, standard diffusion maps approximates the differential operator

\[\mathcal{L}f = \Delta f - 2(1-\alpha)\nabla f \cdot \frac{\nabla q}{q}\]

where \(\Delta\) is the Laplace Beltrami operator, \(\nabla\) is the gradient operator and \(q\) is the sampling density. The normalization parameter \(\alpha\), which is typically between 0.0 and 1.0, determines how much \(q\) is allowed to bias the operator \(\mathcal{L}\). Standard diffusion maps on a dataset X, which has to given as a numpy array with different rows corresponding to different observations, is implemented in pydiffmap as:

mydmap = diffusion_map.DiffusionMap.from_sklearn(epsilon = my_epsilon, alpha = my_alpha)
mydmap.fit(X)

Here epsilon is a scale parameter used to rescale distances between data points. We can also choose epsilon automatically due to an an algorithm by Berry, Harlim and Giannakis:

mydmap = dm.DiffusionMap.from_sklearn(alpha = my_alpha, epsilon = 'bgh')

For additional optional arguments of the DiffusionMap class, see usage and documentation.

A variant of diffusion maps, ‘TMDmap’, unbiases with respect to \(q\) and approximates the differential operator

\[\mathcal{L}f = \Delta f + \nabla (\log\pi) \cdot \nabla f\]

where \(\pi\) is a ‘target distribution’ that defines the drift term and has to be known up to a normalization constant. TMDmap is implemented in pydiffmap as:

mydmap = diffusion_map.TMDmap(epsilon = my_epsilon, alpha = 1.0, change_of_measure=com_fxn)
mydmap.fit(X)

where com_fxn is function that takes in a coordinate and outputs the value of the target distribution \(\pi\) .