您的位置:首页 > 其它

Stanford Machine Learning: (5). Support Vector Machines(SVM支持向量机)

2014-07-29 14:22 369 查看
Support Vector Machine (SVM) - Optimization objective

So far, we've seen a range of different algorithms

With supervised learning algorithms - performance is pretty similar

What matters more often is;

The amount of training data
Skill of applying algorithms

One final supervised learning algorithm that is widely used - support vector machine (SVM)

Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning non-linear functions
Later in the course we'll do a survey of different supervised learning algorithms

An alternative view of logistic regression

Start with logistic regression, see how we can modify it to get the SVM

As before, the logistic regression hypothesis is as follows





And the sigmoid activation function looks like this





In order to explain the math, we use z as defined above

What do we want logistic regression to do?

We have an example where y = 1

Then we hope hθ(x)
is close to 1
With hθ(x) close
to 1, (θT x)
must be much larger than 0





Similarly, when y = 0

Then we hope hθ(x)
is close to 0
With hθ(x) close
to 0, (θT x)
must be much less than 0

This is our classic view of logistic regression

Let's consider another way of thinking about the problem

Alternative view of logistic regression

If you look at cost function, each example contributes a term like the one below to the overall cost function



For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term

If you then plug in the hypothesis definition (hθ(x)),
you get an expanded cost function equation;



So each training example contributes that term to the cost function for logistic regression

If y = 1 then only the first term in the objective matters

If we plot the functions vs. z we get the following graph





This plot shows the cost contribution of an example when y = 1 given z

So if z is big, the cost is low - this is good!
But if z is 0 or negative the cost contribution is high
This is why, when logistic regression sees a positive example, it tries to set θT x
to be a very large term

If y = 0 then only the second term matters

We can again plot it and get a similar graph





Same deal, if z is small then the cost is low

But if s is large then the cost is massive

SVM cost functions from logistic regression cost functions

To build a SVM we must redefine our cost functions

When y = 1

Take the y = 1 function and create a new cost function
Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function





Take point (1) on the z axis

Flat from 1 onwards
Grows when we reach 1 or a lower number

This means we have two straight lines

Flat when cost is 0
Straight growing line after 1

So this is the new y=1 cost function

Gives the SVM a computational advantage and an easier optimization problem
We call this function cost1(z)

Similarly

When y = 0

Do the equivalent with the y=0 function plot





We call this function cost0(z)

So here we define the two cost function terms for our SVM graphically

How do we implement this?

The complete SVM cost function

As a comparison/reminder we have logistic regression below



If this looks unfamiliar its because we previously had the - sign outside the expression

For the SVM we take our two logistic regression y=1 and y=0 terms described previously and replace with

cost1(θT x)
cost0(θT x)

So we get



SVM notation is slightly different

In convention with SVM notation we rename a few things here
1) Get rid of the 1/m terms

This is just a slightly different convention
By removing 1/m we should get the same optimal values for

1/m is a constant, so should get same optimization
e.g. say you have a minimization problem which minimizes to u = 5

If your cost function * by a constant, you still generates the minimal value
That minimal value is different, but that's irrelevant

2) For logistic regression we had two terms;

Training data set term (i.e. that we sum over m) = A
Regularization term (i.e. that we sum over n) = B

So we could describe it as A + λB
Need some way to deal with the trade-off between regularization and data set terms
Set different values for λ to parametrize this trade-off

Instead of parameterization this as A + λB

For SVMs the convention is to use a different parameter called C
So do CA + B
If C were equal to 1/λ then the two functions (CA + B and A + λB) would
give the same value

So, our overall equation is



Unlike logistic, hθ(x) doesn't
give us a probability, but instead we get a direct prediction of 1 or 0

So if θT x
is equal to or greater than 0 --> hθ(x) =
1
Else --> hθ(x) =
0

Large margin intuition

Sometimes people refer to SVM as large margin classifiers

We'll consider what that means and what an SVM hypothesis looks like
The SVM cost function is as above, and we've drawn out the cost terms below



Left is cost1 and right
is cost0
What does it take to make terms small

If y =1

cost1(z)
= 0 only when z >= 1

If y = 0

cost0(z)
= 0 only when z <= -1

Interesting property of SVM

If you have a positive example, you only really need z
to be greater or equal to 0

If this is the case then you predict 1

SVM wants a bit more than that - doesn't want to *just* get it right, but have the value be quite a bit bigger than zero

Throws in an extra safety margin factor

Logistic regression does something similar
What are the consequences of this?

Consider a case where we set C to be huge

C = 100,000
So considering we're minimizing CA + B

If C is huge we're going to pick an A value so that A is equal to zero
What is the optimization problem here - how do we make A = 0?

Making A = 0

If y = 1

Then to make our "A" term 0 need to find a value of θ so (θT x)
is greater than or equal to 1

Similarly, if y = 0

Then we want to make "A" = 0 then we need to find a value of θ so (θT x)
is equal to or less than -1

So - if we think of our optimization problem a way to ensure that this first "A" term is equal to 0, we re-factor our optimization problem into just minimizing the "B" (regularization) term, because

When A = 0 --> A*C = 0

So we're minimizing B, under the constraints shown below





Turns out when you solve this problem you get interesting decision boundaries





The green and magenta lines are functional decision boundaries which could be chosen by logistic regression

But they probably don't generalize too well

The black line, by contrast is the the chosen by the SVM because of this safety net imposed by the optimization graph

More robust separator

Mathematically, that black line has a larger minimum distance (margin) from any of the training examples





By separating with the largest margin you incorporate robustness into your decision making process

We looked at this at when C is very large

SVM is more sophisticated than the large margin might look

If you were just using large margin then SVM would be very sensitive to outliers





You would risk making a ridiculous hugely impact your classification boundary

A single example might not represent a good reason to change an algorithm
If C is very large then we do use this quite naive maximize the margin approach





So we'd change the black to the magenta

But if C is reasonably small, or a not too large, then you stick with the black decision boundary

What about non-linearly separable data?

Then SVM still does the right thing if you use a normal size C
So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data

Means we ignore a few outliers

Large margin classification mathematics (optional)

Vector inner products

Have two (2D) vectors u and v - what is the inner product (uT v)?





Plot u on
graph

i.e u1 vs. u2





One property which is good to have is the norm of a vector

Written as ||u||

This is the euclidean length of vector u

So ||u|| = SQRT(u12 + u22)
= real number

i.e. length of the arrow above
Can show via Pythagoras

For the inner product, take v and orthogonally project down onto u

First we can plot v on the same axis in the same way (v1 vs v1)
Measure the length/magnitude of the projection





So here, the green line is the projection

p = length along u to the intersection
p is the magnitude of the projection of vector v onto vector u

Possible to show that

uT v =
p * ||u||

So this is one way to compute the inner product

uT v = u1v1+ u2v2
So therefore

p * ||u|| = u1v1+ u2v2
This is an important rule in linear algebra

We can reverse this too

So we could do

vT u = v1u1+
v2u2
Which would obviously give you the same number

p can be negative if the angle between them is 90 degrees or more





So here p is negative

Use the vector inner product theory to try and understand SVMs a little better

SVM decision boundary





For the following explanation - two simplification

Set θ0=
0 (i.e. ignore intercept terms)
Set n = 2 - (x1,
x2)

i.e. each example has only 2 features

Given we only have two parameters we can simplify our function to





And, can be re-written as





Should give same thing

We may notice that





The term in red is the norm of θ

If we take θ as a 2x1 vector
If we assume θ0 =
0 its still true

So, finally, this means our optimization function can be re-defined as





So the SVM is minimizing the squared norm

Given this, what are the (θT x)
parameters doing?

Given θ and
given example x what is this equal to

We can look at this in a comparable manner to how we just looked at u and v

Say we have a single positive training example (red cross below)





Although we haven't been thinking about examples as vectors it can be described as such





Now, say we have our parameter vector θ and
we plot that on the same axis





The next question is what is the inner product of these two vectors





p, is in fact pi, because it's the length of p for example i

Given our previous discussion we know

(θT xi )
= pi * ||θ||

= θ1xi1 + θ2xi2
So these are both equally valid ways of computing θT xi

What does this mean?

The constraints we defined earlier

(θT x)
>= 1 if y = 1
(θT x)
<= -1 if y = 0

Can be replaced/substituted with the constraints

pi *
||θ|| >=
1 if y = 1
pi *
||θ|| <=
-1 if y = 0

Writing that into our optimization objective





So, given we've redefined these functions let us now consider the training example below





Given this data, what boundary will the SVM choose? Note that we're still assuming θ0 =
0, which means the boundary has to pass through the origin (0,0)

Green line - small margins





SVM would not chose this line

Decision boundary comes very close to examples
Lets discuss why the SVM would not chose this decision boundary

Looking at this line

We can show that θ is at 90 degrees to the decision boundary





θ is
always at 90 degrees to the decision boundary (can show with linear algebra, although we're not going to!)

So now lets look at what this implies for the optimization objective

Look at first example (x1)





Project a line from x1 on
to to the θ vector (so it hits at 90 degrees)

The distance between the intersection and the origin is (p1)

Similarly, look at second example (x2)

Project a line from x2 into
to the θ vector
This is the magenta line, which will be negative (p2)

If we overview these two lines below we see a graphical representation of what's going on;





We find that both these p values are going to be pretty small
If we look back at our optimization objective

We know we need p1 *
||θ|| to be bigger than or equal to 1 for positive examples

If p is small

Means that ||θ||
must be pretty large

Similarly, for negative examples we need p2 * ||θ|| to
be smaller than or equal to -1

We saw in this example p2 is
a small negative number

So
||θ|| must be a large number

Why is this a problem?

The optimization objective is trying to find a set of parameters where the norm of theta is small

So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must
get larger to compensate)

So we should make p values larger which allows ||θ|| to
become smaller

So lets chose a different boundary





Now if you look at the projection of the examples to θ we find that p1 becomes
large and ||θ|| can become small

So with some values drawn in





This means that by choosing this second decision boundary we can make ||θ|| smaller

Which is why the SVM choses this hypothesis as better
This is how we generate the large margin effect





The magnitude of this margin is a function of the p values

So by maximizing these p values we minimize ||θ||

Finally, we did this derivation assuming θ0 =
0,

If this is the case we're entertaining only decision boundaries which pass through (0,0)
If you allow θ0 to
be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)
Can show with basically same logic that this works, and even when θ0 is
non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes

Kernels - 1: Adapting SVM to non-linear classifiers

What are kernels and how do we use them

We have a training set

We want to find a non-linear boundary





Come up with a complex set of polynomial features to fit the data

Have hθ(x)
which

Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
Else return 0

Another way of writing this (new notation) is

That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order
x terms
e.g.

hθ(x) = θ0+ θ1f1+ θ2f2 + θ3f3
Where

f1=
x1
f2 = x1x2
f3 =
...
i.e. not specific values, but each of the terms from your complex polynomial function

Is there a better choice of feature f than the high order polynomials?

As we saw with computer imaging, high order polynomials become computationally expensive

New features

Define three features in this example (ignore x0)
Have a graph of x1 vs. x2 (don't
plot the values, just define the space)
Pick three points in that space





These points l1, l2,
and l3,
were chosen manually and are called landmarks

Given x, define f1 as the similarity between (x, l1)

= exp(- (|| x - l1 ||2 )
/ 2σ2)

=




|| x - l1 || is
the euclidean distance between the point x and the landmark l1 squared

Disussed more later

If we remember our statistics, we know that

σ
is the standard deviation
σ2 is
commonly called the variance

Remember, that as discussed





So, f2 is defined as

f2 = similarity(x, l1)
= exp(-
(|| x - l2 ||2 )
/ 2σ2)

And similarly

f3 = similarity(x, l2) = exp(-
(|| x - l1 ||2 )
/ 2σ2)

This
similarity function is called a kernel

This
function is a Gaussian Kernel

So, instead of writing similarity between x and l we might write

f1 = k(x, l1)

Diving deeper into the kernel

So lets see what these kernels do and why the functions defined make sense

Say x is close to a landmark

Then the squared distance will be ~0

So





Which is basically e-0

Which is close to 1

Say x is far from a landmark

Then the squared distance is big

Gives e-large
number

Which is close to zero

Each landmark defines a new features

If we plot f1 vs the kernel function we get a plot like this

Notice that when x = [3,5] then f1 = 1
As x moves away from [3,5] then the feature takes on values close to zero
So this measures how close x is to this landmark





What does σ do?

σ2 is
a parameter of the Gaussian kernel

Defines the steepness of the rise around the landmark

Above example σ2 =
1
Below σ2 =
0.5





We see here that as you move away from 3,5 the feature f1 falls to zero much more rapidly

The inverse can be seen if σ2 =
3





Given this definition, what kinds of hypotheses can we learn?

With training examples x we predict "1" when
θ0+ θ1f1+ θ2f2 + θ3f3 >=
0

For our example, lets say we've already run an algorithm and got the

θ0 =
-0.5
θ1 =
1
θ2 =
1
θ3 =
0

Given
our placement of three examples, what happens if we evaluate an example at the magenta dot below?





Looking at our formula, we know f1 will be close to 1, but f2 and f3 will be close to 0

So if we look at the formula we have

θ0+ θ1f1+ θ2f2 + θ3f3 >=
0
-0.5 + 1 + 0 + 0 = 0.5

0.5 is greater than 1

If we had another point far
away from all three





This equates to -0.5

So we predict 0

Considering our parameter, for points near l1 and l2 you
predict 1, but for points near l3 you
predict 0
Which means we create a non-linear decision boundary that goes a lil' something like this;





Inside we predict y = 1
Outside we predict y = 0

So this show how we can create a non-linear boundary with landmarks and the kernel function in the support vector machine

But

How do we get/chose the landmarks
What other kernels can we use (other than the Gaussian kernel)

Kernels II

Filling in missing detail and practical implications regarding kernels
Spoke about picking landmarks manually, defining the kernel, and building a hypothesis function

Where do we get the landmarks from?
For complex problems we probably want lots of them

Choosing the landmarks

Take the training data
For each example place a landmark at exactly the same location
So end up with m landmarks

One landmark per location per training example
Means our features measure how close to a training set example something is

Given a new example, compute all the f values

Gives you a feature vector f (f0 to fm)

f0 = 1 always

A more detailed look at generating the f vector

If we had a training example - features we compute would be using (xi,
yi)

So we just cycle through each landmark, calculating how close to that landmark actually xi is

f1i,
= k(xi,
l1)
f2i,
= k(xi,
l2)
...
fmi,
= k(xi,
lm)

Somewhere in the list we compare x to itself... (i.e. when we're at fii)

So because we're using the Gaussian Kernel this evalues to 1

Take these m features (f1, f2 ... fm)
group them into an [m +1 x 1] dimensional vector called f

fi is
the f feature vector for the ith example
And add a 0th term = 1

Given these kernels, how do we use a support vector machine

SVM
hypothesis prediction with kernels


Predict y = 1 if (θT f) >= 0

Because θ = [m+1 x 1]
And f = [m +1 x 1]

So, this is how you make a prediction assuming you already have θ

How do you get θ?

SVM training with kernels

Use the SVM learning algorithm



Now, we minimize using f as the feature vector instead of x
By solving this minimization problem you get the parameters for your SVM

In this setup, m = n

Because number of features is the number of training data examples we have

One final mathematic detail (not crucial to understand)

If we ignore θ0 then
the following is true





What many implementations do is





Where the matrix M depends on the kernel you use
Gives a slightly different minimization - means we determine a rescaled version of θ
Allows more efficient computation, and scale to much bigger training sets
If you have a training set with 10 000 values, means you get 10 000 features

Solving for all these parameters can become expensive
So by adding this in we avoid a for loop and use a matrix multiplication algorithm instead

You can apply kernels to other algorithms

But they tend to be very computationally expensive
But the SVM is far more efficient - so more practical

Lots of good off the shelf software to minimize this function

SVM parameters (C)

Bias and variance trade off
Must chose C

C plays a role similar to 1/LAMBDA (where LAMBDA is the regularization parameter)

Large C gives a hypothesis of low bias high variance --> overfitting
Small C gives a hypothesis of high bias low variance --> underfitting

SVM parameters (σ2)

Parameter for calculating f values

Large
σ2 -
f features vary more smoothly - higher bias, lower variance
Small
σ2 -
f features vary abruptly - low bias, high variance

SVM - implementation and use

So far spoken about SVM in a very abstract manner
What do you need to do this

Use SVM software packages (e.g. liblinear, libsvm) to solve parameters θ
Need to specify

Choice of parameter C
Choice of kernel

Choosing a kernel

We've looked at the Gaussian kernel

Need to define σ (σ2)

Discussed σ2

When would you chose a Gaussian?

If n is small and/or m is large

e.g. 2D training set that's large

If you're using a Gaussian kernel then you may need to implement the kernel function

e.g. a function

fi = kernel(x1,x2)

Returns a real number

Some SVM packages will expect you to define kernel
Although, some SVM implementations include the Gaussian and a few others

Gaussian is probably most popular kernel

NB - make sure you perform feature scaling before using a Gaussian kernel

If you don't features with a large value will dominate the f value

Could use no kernel - linear kernel

Predict y = 1 if (θT x) >= 0

So no f vector
Get a standard linear classifier

Why do this?

If n is large and m is small then

Lots of features, few examples
Not enough data - risk overfitting in a high dimensional feature-space

Other choice of kernel

Linear and Gaussian are most common
Not all similarity functions you develop are valid kernels

Must satisfy Merecer's Theorem
SVM use numerical optimization tricks

Mean certain optimizations can be made, but they must follow the theorem

Polynomial Kernel

We measure the similarity of x and l by doing one of

(xT l)2
(xT l)3
(xT l+1)3

General form is

(xT l+Con)D

If they're similar then the inner product tends to be large
Not used that often
Two parameters

Degree of polynomial (D)
Number you add to l (Con)

Usually performs worse than the Gaussian kernel
Used when x and l are both non-negative

String kernel

Used if input is text strings
Use for text classification

Chi-squared kernel
Histogram intersection kernel

Multi-class classification for SVM

Many packages have built in multi-class classification packages
Otherwise use one-vs all method
Not a big issue

Logistic regression vs. SVM

When should you use SVM and when is logistic regression more applicable
If n (features) is large vs. m (training set)

e.g. text classification problem

Feature vector dimension is 10 000
Training set is 10 - 1000
Then use logistic regression or SVM with a linear kernel

If n is small and m is intermediate

n = 1 - 1000
m = 10 - 10 000
Gaussian kernel is good

If n is small and m is large

n = 1 - 1000
m = 50 000+

SVM will be slow to run with Gaussian kernel

In that case

Manually create or add more features
Use logistic regression of SVM with a linear kernel

Logistic regression and SVM with a linear kernel are pretty similar

Do similar things
Get similar performance

A lot of SVM's power is using diferent kernels to learn complex non-linear functions
For all these regimes a well designed NN should work

But, for some of these problems a NN might be slower - SVM well implemented would be faster

SVM has a convex optimization problem - so you get a global minimum
It's not always clear how to chose an algorithm

Often more important to get enough data
Designing new features
Debugging the algorithm

SVM is widely perceived a very powerful learning algorithm
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: