This blog documents my notes on 16. Learning: Support Vector Machines.
I try to fill the gap in the explanations in Prof Patrick Winston's lecture.
Support Vector Machine problem statement¶
Given two class observations (linearly separable for the sake of problem understanding), find the best line that maximizes the margin (or the distances of the two streets) between + and - observations!
Decision Rule¶
Some unknown observations u, we classify them as wu+b≥0→classified into +wu+b<0→classified into -
How to choose w and b? We need to understand how the margins are defined in SVM.
How is margin defined?¶
Here, margin does not always mean the minimum distance between the pair of + and -. Multiple observations from both groups could contribute to define margin. And it is SVM algorithm's job to find how many will be contributing to define the margin.
For now, heuristically think that we want to draw a line to separate the observations in two classes as much as possible.
Margin definition¶
In case of linearly separable scenario, this line should separate all the + observations to be > 1 and all the - observations to be < 1:
wx++b≥1wx−+b<−1
We can write these two inequality into a single inequality by introducing y.
yi=1 if xi,+ and yi=−1 if xi,−
Then the inequality can be written as:
yi(wx+b)≥1
observations on gutters¶
We also add another more strict constraint onto those observations on gutters. They must satisfy: yi(wx+b)=1
Optimization¶
The weights that maximizing width is the same as the weights that minimizes 12||w||2
yi(wxi+b)≥1
while those observations on gutters must satisfy:
yi(wxgutter+b)=1
This is done by Lagrange multiplier αi. The optimization problem becomes:
L(w)=12||w||2−N∑i=1αi(yi(wxi+b)−1)
Fact:
- Only some of the αi on gutters receive nonzero values in this optimization problem!!!
The first derivative is:
∂L(w)∂w=w−N∑i=1αiyixi=0
So w=N∑i=1αiyixi
∂L(w)∂b=N∑i=1αiyi=0
L(αi,i=1,⋯,N)¶
Substitute these to L(w) and see how L(w) can be represented with respect to αi.
L(αi,i=1,⋯,N)=12∑Ni=1αiyixTi∑Ni=1αiyixi−∑Ni=1αiyi(∑Nk=1αkykxk)xi+b∑Ni=1αiyi−∑Ni=1αi=∑Ni=1αi−12∑Ni=1∑Nj=1αiαjyiyjxTixj
This representation shows that the observations xi comes into play in the loss function only as a pair.
Therefore, when you define kernel to add complexity, Kernel only needs to define the distance of the pair of the observations.
The decision rule also only depends on the dot products!