Feature map for the Gaussian kernel

Question

In SVM, the Gaussian kernel is defined as:
$$K(x,y)=expleft({-frac{|x-y|_2^2}{2sigma^2}}right)=phi(x)^Tphi(y)$$ where $x, yin mathbb{R^n}$.
I do not know the explicit equation of $phi$. I want to know it.

I also want to know whether
$$sum_ic_iphi(x_i)=phi left(sum_ic_ix_i right)$$ where $c_iin mathbb R$. Now, I think it is not equal, because using a kernel handles the situation where the linear classier does not work. I know $phi$ projects x to a infinite space. So if it still remains linear, no matter how many dimensions it is, svm still can not make a good classification.

fr_andres · Answer

EXPLICIT EXPRESSION AND DERIVATION VIA DIRECT PROOF
The explicit expression for $phi$ you are asking for is the following:
Lemma:

Given the Gaussian RBF Kernel $K_sigma$ between two $n$-dimensional vectors ($x$ and another), for each $j$ from 0 to infinity and for every combination of $n$ indices (labeled as $k$) that add up to $j$, the feature vector $phi(x)$ has a feature that looks like this:

$$
phi_{sigma, j, k}(x) =  c_{sigma, j}(x) cdot f_{j, k}(x)
$$
Where:
$$
begin{aligned}
c_{sigma, j}(x) &= frac{K_sigma(x, 0)}{sigma^j sqrt{j!}}\
f_{j, k}(x) &= begin{pmatrix} j\k_1,k_2, dots, k_n end{pmatrix}^{frac{1}{2}} prod_{d=1}^n{x_d^{k_d}}
end{aligned}
$$
This can be directly derived as follows:

Definitions:

Gaussian RBF: https://en.wikipedia.org/wiki/Radial_basis_function
Taylor expansion of the exponential function: https://en.wikipedia.org/wiki/Exponential_function
Multinomial theorem: https://en.wikipedia.org/wiki/Multinomial_theorem

$$
begin{aligned}
K_sigma(x, y) = &e^{-frac{|x-y|_2^2}{2sigma^2}}\
epsilon := &e^{frac{1}{sigma^2}}\
epsilon^x = &sum_{j=0}^{infty}left{ frac{x^j}{sigma^{2j} cdot j!} right}\
(x_1 + x_2 + dots + x_n)^j = &sum_{k_1+k_2+dots+k_n=j}left{ begin{pmatrix} j\k_1,k_2, dots, k_n end{pmatrix} prod_{d=1}^n{x_d^{k_d}} right}\
end{aligned}
$$

Direct Proof:
First, we decompose the squared euclidean distance into its components, and perform the Taylor expansion for the $xy$ component:
$$
begin{aligned}
K(x,y)= &e^{-frac{|x-y|_2^2}{2sigma^2}} =epsilon^{langle x, y rangle} cdotepsilon^{-frac{|x|_2^2}{2}} cdot epsilon^{-frac{|y|_2^2}{2}}\
= &sum_{j=0}^{infty}left{ frac{langle x, y rangle^j}{sigma^{2j} cdot j!} right} cdotepsilon^{-frac{|x|_2^2}{2}} cdot epsilon^{-frac{|y|_2^2}{2}}
end{aligned}
$$
For further convenience, we refactor the expression (using $c$ for more compact notation):
$$
begin{aligned}
K(x,y) = &sum_{j=0}^{infty}left{frac{epsilon^{-frac{|x|_2^2}{2}}}{sigma^j cdot sqrt{j!}} cdot frac{epsilon^{-frac{|y|_2^2}{2}}}{sigma^j cdot sqrt{j!}}  cdot langle x, y rangle^j right}\
= &sum_{j=0}^{infty}left{ c_{sigma, j}(x) cdot c_{sigma, j}(y)  cdot langle x, y rangle^j right}\
end{aligned}
$$
And with help of the multinomial theorem, we can express the power of the dot product as follows (using $f$ for more compact notation):
$$
begin{aligned}
langle x, y rangle^j = &left(sum_{d=1}^n x_d y_d right)^j\
 = &sum_{k_1+k_2+dots+k_n=j}left{ begin{pmatrix} j\k_1,k_2, dots, k_n end{pmatrix} prod_{d=1}^n{(x_dy_d)^{k_d}} right}\
= &sum_{k_1+k_2+dots+k_n=j}left{ begin{pmatrix} j\k_1,dots, k_n end{pmatrix}^{frac{1}{2}} prod_{d=1}^n{x_d^{k_d}}   cdot  
begin{pmatrix} j\k_1, dots, k_n end{pmatrix}^{frac{1}{2}} prod_{d=1}^n{y_d^{k_d}}   right}\
=: &sum_{k_1+k_2+dots+k_n=j}left{f_{j,k}(x) cdot f_{j, k}(y)  right}\
end{aligned}
$$
Now replacing in $K$ will allow us to end the proof:
$$
begin{aligned}
K(x,y) = &sum_{j=0}^{infty}left{ c_{sigma, j}(x) cdot c_{sigma, j}(y)  cdot sum_{k_1+k_2+dots+k_n=j}left{f_{j,k}(x) cdot f_{j, k}(y)  right} right}\
= &sum_{j=0}^{infty} sum_{k_1+k_2+dots+k_n=j}left{ c_{sigma, j}(x) f_{j,k}(x) cdot c_{sigma, j}(y) f_{j, k}(y)  right}\
= &langle phi(x), phi(y) rangle\
&square
end{aligned}
$$
Where each $phi$ is a vector with one entry for every combination of $n$ indices (labeled as $k$) that add up to $j$, and this for each $j$ from 0 to infinity.

hope this helps! Cheers,
Andres

Danica · Answer

For any valid psd kernel $k : mathcal X times mathcal X to mathbb R$, there exists a feature map $varphi : mathcal X to mathcal H$ such that $k(x, y) = langle varphi(x), varphi(y) rangle_{mathcal H}$. The space $mathcal H$ and embedding $varphi$ in fact need not be unique, but there is an important unique $mathcal H$ known as the reproducing kernel Hilbert space (RKHS).

The RKHS is discussed by: Steinwart, Hush and Scovel, An Explicit Description of the Reproducing Kernel Hilbert Spaces of Gaussian RBF Kernels, IEEE Transactions on Information Theory 2006 (doi, free citeseer pdf).

It's somewhat complicated, and they need to analyze it via the extension of the Gaussian kernel to complex inputs and outputs, but it boils down to this: define $e_n : mathbb R to mathbb R$ as
$$
e_n(x) := sqrt{frac{(2 sigma^2)^n}{n!}} x^n e^{-sigma^2 x^2}
$$
and, for a tuple $nu = (nu_1, cdots, nu_d) in mathbb N_0^d$, its tensor product $e_nu : mathbb R^d to mathbb R$ as
$$
e_nu(x) = e_{nu_1}(x_1) cdots e_{nu_d}(x_d) 
.$$
Then their Proposition 3.6 says that any function $f in mathcal H_sigma$, the RKHS for a Gaussian kernel of bandwidth $sigma > 0$, can be written as
$$
f(x)
= sum_{nu in mathbb N_0^d} b_nu e_nu(x)
qquad
lVert f rVert_{mathcal H_sigma(X)}^2
= sum_{nu in mathbb N_0^d} b_nu^2
.$$

We can think of $mathcal H_sigma$ as being essentially the space of square-summable coefficients $(b_nu)_{nu in mathbb N_0^d}$.

The question remains, though: what is the the sequence $b_nu$ for the function $phi(x)$? The paper doesn't seem to directly answer this question (unless I'm missing it as an obvious implication somewhere).

The do also give a more straightforward embedding into $L_2(mathbb R^d)$, the Hilbert space of square-integrable functions from $mathbb R^d to mathbb R$:
$$
Phi(x) = frac{(2 sigma)^{frac{d}{2}}}{pi^{frac{d}{4}}} e^{- 2 sigma^2 lVert x - cdot rVert_2^2}
.$$
Note that $Phi(x)$ is itself a function from $mathbb R^d$ to $mathbb R$.
It's basically the density of a $d$-dimensional Gaussian with mean $x$ and covariance $frac{1}{4 sigma^2} I$; only the normalizing constant is different.
Thus when we take 
$$langle Phi(x), Phi(y) rangle_{L_2}
= int [Phi(x)](t) ; [Phi(y)](t) ,mathrm d t
,$$
we're taking the product of Gaussian density functions, which is itself a certain constant times a Gaussian density functions. When you do that integral by $t$, then, the constant that falls out ends up being exactly $k(x, y)$.

These are not the only embeddings that work.

Another is based on the Fourier transform, which the celebrated paper of Rahimi and Recht (Random Features for Large-Scale Kernel Machines, NIPS 2007) approximates to great effect.

You can also do it using Taylor series: effectively the infinite version of Cotter, Keshet, and Srebro, Explicit Approximations of the Gaussian Kernel, arXiv:1109.4603.

Marc Claesen · Answer

You can obtain the explicit equation of $phi$ for the Gaussian kernel via the Tailor series expansion of $e^x$. For notational simplicity, assume $xin mathbb{R}^1$:

$$phi(x) = e^{-x^2/2sigma^2} Big[ 1, sqrt{frac{1}{1!sigma^2}}x,sqrt{frac{1}{2!sigma^4}}x^2,sqrt{frac{1}{3!sigma^6}}x^3,ldotsBig]^T$$

This is also discussed in more detail in these slides by Chih-Jen Lin of NTU (slide 11 specifically). Note that in the slides $gamma=frac{1}{2sigma^2}$ is used as kernel parameter.

The equation in the OP only holds for the linear kernel.

Dikran Marsupial · Answer

It seems to me that your second equation will only be true if $phi$ is a linear mapping (and hence $K$ is a linear kernel). As the Gaussian kernel is non-linear, the equality will not hold (except perhaps in the limit as $sigma$ goes to zero).

Answered by Dikran Marsupial on December 21, 2021

Feature map for the Gaussian kernel

4 Answers

EXPLICIT EXPRESSION AND DERIVATION VIA DIRECT PROOF

Lemma:

Definitions:

Direct Proof:

Add your own answers!

Ask a Question