Support Vector Machine (SVM) From a Mathematical Perspective: Solving Optimization Problems
Support Vector Machine (SVM) is a classic algorithm in machine learning. This article focuses on the formula derivation in SVM, such as detailed reasoning of the margin distance formula, and the formulation of the primal and dual problems. It delves into optimization problems, including constructing the Lagrangian function to handle constrained optimization problems and using KKT conditions to find optimal solutions. It also covers the characteristics of polynomial and Gaussian kernel functions.
Margin Distance Derivation
In Support Vector Machine (SVM), the equations for the positive and negative hyperplanes are respectively: $$ \vec{w} \cdot \vec{x} + b = 1 \quad \text{(Positive Hyperplane)} $$ $$ \vec{w} \cdot \vec{x} + b = -1 \quad \text{(Negative Hyperplane)} $$ where $\vec{w}=(w_1, w_2)$ is the weight vector, $b$ is the bias term, and $\vec{x}=(x_1, x_2)$ is the data point.
Assume $\vec{x_m}$ is a point on the positive hyperplane, and $\vec{x_n}$ is a point on the negative hyperplane, then: $$ w_1 x_{1m} + w_2 x_{2m} + b = 1 \quad \text{(1)} $$ $$ w_1 x_{1n} + w_2 x_{2n} + b = -1 \quad \text{(2)} $$
Subtracting equation (2) from equation (1), we get: $$ w_1 (x_{1m} - x_{1n}) + w_2 (x_{2m} - x_{2n}) = 2 $$ In vector form: $$ \vec{w} \cdot (\vec{x_m} - \vec{x_n}) = 2 \quad \text{(3)} $$ Consider two points $\vec{x_0}$ and $\vec{x_p}$ on the decision hyperplane, which satisfy the decision hyperplane equation $\vec{w} \cdot \vec{x} + b = 0$, i.e.: $$ w_1 x_{10} + w_2 x_{20} + b = 0 $$ $$ w_1 x_{1p} + w_2 x_{2p} + b = 0 $$ Subtracting these two equations gives: $$ w_1 (x_{10} - x_{1p}) + w_2 (x_{20} - x_{2p}) = 0 $$ In vector form: $$ \vec{w} \cdot (\vec{x_0} - \vec{x_p}) = 0 \quad \text{(4)} $$ Equation (4) indicates that $\vec{w}$ is perpendicular to the vector difference between any two points on the decision hyperplane.
From equations (3) and (4), we know that the dot product of $\vec{w}$ and $(\vec{x_m} - \vec{x_n})$ is 2. According to the definition of vector dot product $\vec{a} \cdot \vec{b}=|\vec{a}| \cdot |\vec{b}| \cdot \cos \theta$, where $\theta$ is the angle between $\vec{w}$ and $(\vec{x_m} - \vec{x_n})$, we have: $$ |\vec{x_m} - \vec{x_n}| \cdot \cos \theta \cdot |\vec{w}| = 2 $$ Let $L = |\vec{x_m} - \vec{x_n}| \cdot \cos \theta$, then: $$ L \cdot |\vec{w}| = 2 $$ Solving for $L$ gives: $$ L=\frac{2}{|\vec{w}|} $$
Here, $L$ is the margin distance of the SVM.
In deriving the margin distance, we utilized the geometric meaning of the vector dot product, i.e., $\vec{a} \cdot \vec{b}=|\vec{a}| \cdot |\vec{b}| \cdot \cos \theta$, where $\theta$ is the angle between the two vectors. Through this relationship, we transformed the dot product into a relationship involving vector magnitudes and angles, thus deriving the expression for the margin distance.
Dual Equivalence Proof
In linear Support Vector Machine (SVM), the primal problem is to find the weight vector $w$ and bias $b$ that minimize the objective function:
$$ \min_w f(w) = \frac{1}{2} |w|^2 $$
Here, $|w|^2$ represents the Euclidean norm squared of the vector $w$, i.e., the $L_2$ norm. The goal is to minimize the width of the decision boundary to achieve better generalization ability. This problem is subject to the following constraints:
$$ y_j (w^T x_j + b) - 1 \geq 0 $$
Here, $x_j$ is the $j$-th training sample, and $y_j$ is the corresponding label, taking values of +1 or -1. This ensures that all data points are correctly classified and are at least one unit distance from the decision boundary.
To handle this constrained optimization problem, we construct the Lagrangian function:
$$ L(w, b, \alpha) = f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) $$
Here, $\alpha_j \geq 0$ are the Lagrange multipliers used to introduce the constraint conditions of the primal problem $g_j(w, b) = y_j (w^T x_j + b) - 1 \geq 0$.
Next, we define the dual function $q(\alpha)$ as:
$$ q(\alpha) = \min_{w, b} L(w, b, \alpha) = \min_{w, b} \left( f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) \right) $$
Since $\alpha_j \geq 0$ and $g_j(w^{*}, b^{*}) \geq 0$, we can derive:
$$ q(\alpha) = \min_{w, b} \left( f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) \right) \leq f(w^*) - \sum_{j = 1}^n \alpha_j g_j(w^*, b^*) \leq f(w^*) \leq f(w) $$
This means the dual function provides a lower bound for the primal problem. Next, we need to find an $\alpha^*$ such that:
$$ q(\alpha) \leq q(\alpha^*) \leq f(w^*) \leq f(w) $$
The primal and dual problems of SVM can be expressed as:
$$ \max_{\alpha} q(\alpha) = \max_{\alpha} \min_{w, b} L(w, b, \alpha) $$
With the constraint: $ \alpha_i \geq 0 $
And when weak duality holds, we have $q(\alpha^*) \leq f(w^*)$; when strong duality holds, i.e., Slater’s condition is satisfied, we have $q(\alpha^*) = f(w^*)$. Slater’s condition requires the existence of a feasible solution such that all inequality constraints are strictly satisfied, and linear SVM is linearly separable, which automatically satisfies Slater’s condition.
Thus, we have:
$$ f(w) \geq q(\alpha^*) = f(w^*) \geq q(\alpha_i) $$
From the above equation, we can deduce:
$$ q(\alpha^*) \geq q(\alpha_i) $$ $$ f(w^*) \leq f(w) $$
$f(w)$ finds the minimum value (primal problem), $q(\alpha)$ finds the maximum value (dual problem), and the optimal solutions of the primal and dual problems are equal, i.e.:
$ w^*, b^* $ are the solutions to the primal problem, $\alpha^*$ is the solution to the dual problem, and $f(w^*) = q(\alpha^*)$.
We can see that in linear SVM, when specific conditions (Slater’s condition) are met, the solutions to the primal and dual problems are consistent. This provides an effective way to solve complex optimization problems, especially when the primal problem is difficult to solve directly, the dual problem can be solved indirectly.
Simple Example
To more intuitively understand the equivalence of the solutions to the primal and dual problems mentioned above, consider a simple optimization problem defined as follows:
The primal problem is: $$ \min_x f(x) = x^2 $$ With the constraint: $$ x - 1 \geq 0 $$
The goal of this problem is to minimize the function $f(x) = x^2$, while $x$ needs to satisfy $x \geq 1$. Intuitively, we know that when $x = 1$, $f(x) = 1$, which is the minimum value under the given constraint.
To verify duality, we construct the Lagrangian function:
$$ q(\alpha) = \min_x L(x, \alpha) = \min_x (x^2 - \alpha(x - 1)) $$
Here, $\alpha \geq 0$ is the Lagrange multiplier used to introduce the constraint condition $x - 1 \geq 0$ from the primal problem. By constructing the Lagrangian function, we convert the constrained optimization problem into an unconstrained problem.
Next, we take the partial derivative of $L(x, \alpha)$ with respect to $x$ and set it to zero:
$$ \frac{\partial L}{\partial x} = 0 2x - \alpha = 0 $$
Solving for $x$ gives:
$$ x = \frac{\alpha}{2} $$
Substituting $x = \frac{\alpha}{2}$ into $q(\alpha)$:
$$ q(\alpha) = - \frac{\alpha^2}{4} + \alpha $$
Now we have obtained the form of the dual function $q(\alpha)$. Next, we need to solve for the maximum value of the dual problem $\max_{\alpha} q(\alpha) $
To do this, we take the derivative of $q(\alpha)$ with respect to $\alpha$ and set it to zero:
$$ \frac{dq}{d\alpha} = - \frac{\alpha}{2} + 1 = 0 $$
Solving for $\alpha$ gives $$ \alpha = 2 $$
Substituting $\alpha = 2$ into $x = \frac{\alpha}{2}$, we get: $$ x = 1 $$
At this point, substituting $\alpha = 2$ into $q(\alpha)$, we calculate:
$$ q(\alpha) = - \frac{2^2}{4} + 2 = 1 $$
Through this simple example, we can see that the solution to the primal problem $x = 1$, $f(x) = 1$, is equivalent to the solution to the dual problem $\alpha = 2$, $q(\alpha) = 1$. This verifies that under certain conditions, the solutions to the dual problem and the primal problem are consistent.
By applying dual theory, we not only found the solution to the primal problem but also obtained the same result by solving the dual problem, thus verifying the equivalence of the solutions to the dual problem.
Solving with KKT Conditions
SVM Satisfying KKT Conditions
The original optimization problem of SVM is a convex optimization problem. The objective function of SVM $\frac{1}{2}|w|^2$ is a quadratic function, which is a convex function with respect to $w$. At the same time, the constraint condition $y_i(w \cdot x_i + b) \geq 1$ is linear (affine constraint), and therefore also convex. In convex optimization problems, a local optimal solution is a global optimal solution, and the KKT conditions are necessary and sufficient conditions. This means that if a point satisfies the KKT conditions, it is a global optimal solution.
The objective function $\frac{1}{2}|w|^2$ is continuous and differentiable, and the constraint condition $y_i(w \cdot x_i + b) \geq 1$ is also continuous and differentiable. This smoothness ensures the existence and uniqueness of gradients, allowing the gradient conditions in the KKT conditions (i.e., taking partial derivatives with respect to $w$ and $b$ and setting them to zero) to be effectively applied.
In convex optimization problems, the KKT conditions are not only necessary conditions but also sufficient conditions. That is, if a point satisfies the KKT conditions, it must be a global optimal solution. For SVM, by solving the KKT conditions, we can find the optimal $w^*$ and $b^*$, thereby determining the best separating hyperplane.
Solving Linear Support Vector Machine Using KKT Conditions
The original SVM optimization problem is to minimize $\frac{1}{2}|w|^{2}$ while satisfying the constraint $y_{i}(w\cdot x_{i}+b)\geqslant1$, where $i = 1,2,\cdots,N$.
First, construct the Lagrangian function $L(w,b,\alpha)=\frac{1}{2}|w|^{2}-\sum_{i = 1}^{N}\alpha_{i}(y_{i}(w\cdot x_{i}+b)-1)$, where $\alpha_{i}\geqslant0$ are the Lagrange multipliers. According to the KKT conditions, we have:
$$ \nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0 $$
$$ \nabla_{b}L(w^*,b^*,\alpha^*)=-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}=0 $$
$$ \alpha_{i}^*(y_{i}(w^*\cdot x_{i}+b^*)-1)=0 $$
$$ y_{i}(w^*\cdot x_{i}+b^*)-1\geqslant0 $$
$$ \alpha_{i}^*\geqslant0 $$
These conditions apply to all $i = 1,2,\cdots,N$.
From $\nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0$, we can derive
$$ w^*=\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i} \quad \text{(5)} $$ Since at least one $\alpha_{j}^*>0$ exists (assuming $\alpha_{i}^*=0$ would lead to a contradiction given the solution from $\nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0$).
For solving $b^*$, by substituting $w^*=\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}$ into $y_{j}(w^*\cdot x_{j}+b^*)-1 = 0$ (considering the case where $\alpha_{j}^*>0$ exists), and noting that $y_{j}^{2}=1$, we obtain:
$$ b^*=y_{j}-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x_{i}\cdot x_{j}) \quad \text{(6)} $$
Based on the above theory, the separating hyperplane can be expressed as:
$$ \sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x\cdot x_{i})+b^*=0 $$
Thus, the classification decision function can be written as:
$$ f(x)=\text{sign}(\sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x\cdot x_{i})+b^*) $$
In SVM, the complementary slackness condition $\alpha_i (y_i(w \cdot x_i + b) - 1) = 0$ indicates that if a sample point $x_i$ is not a support vector (i.e., $y_i(w \cdot x_i + b) > 1$), then the corresponding Lagrange multiplier $\alpha_i$ must be zero. Conversely, if a sample point is a support vector (i.e., $y_i(w \cdot x_i + b) = 1$), then the corresponding $\alpha_i$ can be non-zero. This condition ensures that only support vectors contribute to the solution of the optimization problem, simplifying the problem-solving process.
Polynomial and Gaussian Kernel Functions
If the existing problem is not linearly separable, we can map the existing data to a higher-dimensional space, making it a linearly separable problem in that space. However, directly performing calculations in a high-dimensional feature space can be very complex. From equations (5) and (6), we know that we do not need to actually map the data to a high-dimensional space; we only need to know the inner product between data points. The role of the kernel function is to avoid explicitly performing high-dimensional feature mapping by indirectly calculating the inner product in the high-dimensional feature space through the kernel function value in the original feature space.
The Gaussian kernel function is a common kernel function, with the form: $$ K(x, y) = \exp\left(-\gamma |x - y|^2\right) $$
where $\gamma$ is a positive parameter that controls the width of the kernel function.
We can perform a Taylor expansion on the exponential function:
$$ \exp(z) = \sum_{k=0}^{\infty} \frac{z^k}{k!} $$
Substituting $ z = -\gamma |x - y|^2 $ into the above formula, we get:
$$ K(x, y) = \exp\left(-\gamma |x - y|^2\right) = \sum_{k=0}^{\infty} \frac{(-\gamma |x - y|^2)^k}{k!} $$
The polynomial kernel function has the form:
$$ K_{\text{poly}}(x, y) = (x \cdot y + c)^d $$
where $ c $ is a constant term, and $ d $ is the degree of the polynomial.
$|x - y|^2$ can be expanded as:
$$ |x - y|^2 = (x - y) \cdot (x - y) = x \cdot x + y \cdot y - 2 x \cdot y $$
Substituting this expression into the Taylor expansion of the Gaussian kernel function:
$$ K(x, y) = \sum_{k=0}^{\infty} \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $$
We can see that each term $ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $ is essentially a polynomial term, meaning each term can be expressed as a combination of different powers of $ x $ and $ y $.
If we closely observe each term, we can find that the Gaussian kernel function is actually obtained by harmonizing different orders of polynomial kernel functions. Each term $ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $ can be regarded as a weighted form of a $ k $-order polynomial kernel function.
For example, when $ k = 1 $:
$$ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^1}{1!} = -\gamma (x \cdot x + y \cdot y - 2 x \cdot y) $$
When $ k = 2 $:
$$ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^2}{2!} = \frac{\gamma^2 (x \cdot x + y \cdot y - 2 x \cdot y)^2}{2} $$
These terms are polynomial forms of $ x $ and $ y $, and are weighted by the factorial $ k! $.
The Gaussian kernel function can be viewed as being harmonized through different orders of polynomial kernel functions in infinite dimensions. This harmonization allows the Gaussian kernel function to capture more complex nonlinear relationships in high-dimensional feature spaces. Therefore, in many nonlinear task scenarios, the Gaussian kernel function is a good choice.