Kernel Trick

For machine learning algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space V (equipped with its natural norm), without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V. Linear classifications in V are equivalent to generic classifications in S. The trick to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.

For on, certain functions can be expressed as an inner product (in usually a different space). K is often referred to as a kernel or a kernel function. The word kernel is used in different ways throughout mathematics.

If one is lucky or insightful regarding a particular machine learning problem, one may manually construct such that

and verify that is indeed an inner product.

Furthermore, one does not even require an explicit representation for : it suffices to know that V is an inner product space. Conveniently, based on Mercer's theorem, it suffices to equip S with one's choice of measure and verify that in fact, satisfies Mercer's condition.

Mercer's theorem is stated in a general mathematical setting with implications in the theory of integral equations. However, the general statement is overkill for what is required for understanding the kernel trick. Given a finite observation set S, one can simply select the measure for all . Then the integral in Mercer's theorem reduces to a simple summation

for all finite sequences of points x1, ..., xn of S and all choices of real numbers c1, ..., cn (cf. positive definite kernel).

Some algorithms that depend on arbitrary relationships in the native space would, in fact, have a linear interpretation in a different setting: the range space of . The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute directly during computation, as is the case with support vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it to justify the meanings and properties of existing algorithms.

The kernel trick was first published in 1964 by Aizerman et al.

Theoretically, a kernel matrix K must be positive semi-definite (PSD). Empirically, for machine learning heuristics, choices of K that do not satisfy Mercer's condition may still perform reasonably if K at least approximates the intuitive idea of similarity. Regardless of whether K is a Mercer kernel, K can still be referred to a "kernel". Suppose K is any square matrix, then is a PSD matrix.

Read more about Kernel Trick:  Applications

Famous quotes containing the words kernel and/or trick:

    We should never stand upon ceremony with sincerity. We should never cheat and insult and banish one another by our meanness, if there were present the kernel of worth and friendliness. We should not meet thus in haste.
    Henry David Thoreau (1817–1862)

    I have come back
    but disorder is not what it was.
    I have lost the trick of it!
    The innocence of it!
    Anne Sexton (1928–1974)