您的位置:首页 > 其它

Machine Learning - II. Linear Regression with One Variable单变量线性回归 (Week 1)

2015-01-25 17:50 447 查看
http://blog.csdn.net/pipisorry/article/details/43115525

机器学习Machine Learning - Andrew NG
courses学习笔记

Linear regression with one variable[b]单变量线性回归[/b]



模型表示Model representation

例子:



这是Regression Problem(one of supervised learning)并且是Univariate linear regression (Linear regression with one variable.)

变量定义Notation(术语terminology):

m = Number of training examples

x’s = “input” variable / features

y’s = “output” variable / “target” variable

e.g. (x,y)表示一个trainning example 而 (xi,yi)表示ith trainning example.

[b]Model representation[/b]



h代表假设hypothesis,h maps x's to y's(其实就是求解x到y的一个函数)

成本函数cost function

上个例子中h设为下图中的式子,我们要做的是

how to go about choosing these parameter values, theta zero and theta one.



try to minimize the square difference between the output of the hypothesis and the actual price of the house.

the mathematical definition of the cost function





定义这个函数为(J函数就是cost function的一种)



why do we minimize one by 2M?

going to minimize one by 2M.Putting the 2, the constant one half, in front it just makes some of the math a little easier.

why do we take the squares of the errors?


It turns out that the squared error cost function is a reasonable choice and will work well for most problems, for most regression problems. There are other cost functions that will work pretty well, but the squared error cost function is probably the most
commonly used one for regression problems.

cost function intuition I





{简化起见:将theta0设为0, 即h函数过原点}

Each value of theta one corresponds to a different hypothesis, or to a different straight line fit on the left. And for each value of theta one, we could then derive a different value of j of theta one.



for example, theta one=1,corresponded to this straight line through the data. for each value of theta one we wound up with a different value of J of theta one

cost function intuition II

在这里我们又keep both of my parameters, theta zero, and theta one.

cost function用3D图表示(不同theta0\theta1取值下h(x)图形和cost fun J的变化)





cost function用等高线图contour plot/figure表示





梯度下降Gradient descent

(for minimizing the cost function J,minimize other functions as well, not just the cost function J, for linear regression.)

问题:



解决:



梯度下降算法:





符号解释:



A: =B means we will set A to be equal to the value of B. a computer operation, where you set the value of A to be a value.A:=A+1 means take A and increase its value by one.

A=B, is a truth assertion.asserting that the value of A equals to the value of B. just making a claim that the values of A and B are the same.I won't ever write A=A+1.Because that's just wrong.

注意Note:Compute that thing for both theta0 and theta1, and then simultaneously at the same time update theta0 and theta1.

左右两个式子的区别:对于右边if you've already updated theta0 then you would be using the new value of theta0 to compute this derivative term and so this gives you a different value of temp1 than the left hand side, because you've now plugged in the new value
of theta0 into this equation.

If you implement the non-simultaneous update, it will probably work anyway, but this algorithm on the right is not what people refer to as gradient descent and this is some other algorithm with different properties. And for various reasons, this can behave
in slightly stranger ways.

梯度下降Gradient descent intuition

这里假设theta0 = 0

算法每次改变theta1一点点



suppose you initialize theta one at a local minimum.It turns out that at local optimum your derivative would be equal to zero. so,it leaves theta one unchanged.

alpha的设置对cost func的影响





alpha设置太大会导致的另外两种情形:



alpha值怎么选择?

just try running gradient descent with a range of values for alpha, like 0.001, 0.01,..., just plot j of theta as a function of number of iterations and then pick the value of alpha that seems to be causing j of theta to decrease rapidly.

trying out gradient descents with each value being about 3X bigger than the previous value.



gradient descent中alpha值的自动变化:

the derivative here will be even smaller than it was at the green point.

As gradient descent runs. You will automatically take smaller and smaller steps until eventually you are taking very small steps.so actually there is no need to decrease alpha overtime.



Gradient descent algorithm for linear regression

apply gradient descent to minimize our squared error cost function in linear regression.

(将gradient descent应用到linear regression中)



对linear regression model中的cost func求偏导



求偏导结果代入gradient descent algorithm中



local optima

(linear regression中应用gradient descent没有局部极小值问题)

One of the issues we solved gradient descent is that it can be susceptible to local optima.

depending on where you're initializing, you can end up with different local optima.

But, it turns out that the cost function for gradient of cost function for linear regression is always going to be abow-shaped function like this.



The technical term for this is that this is called a convex function.

informally a convex function means a bow-shaped function, this function doesn't have any local optima, except for the one global optimum.

And does gradient descent on this type of cost function which you get whenever you're using linear regression, it will always convert to the global optimum, because there are no other local optima other than global optimum.



算法迭代过程


(如右图沿梯度负方向下降,直到最小值点【左图是theta取值对应的函数图像】)



batch gradient descent

(上面gradient descent方法的别名)

the algorithm that we just went over is sometimes called batch gradient descent, means that refers to the fact that, in every step of gradient descent we're looking at all of the training examples.



随机梯度下降参见:Machine Learning - XVII. Large Scale Machine Learning大规模机器学习 (Week 10)

gradient descent V.S. normal equals method

(直接求解cost func最小值,而不是用gradient descent方法)

there exists a solution for numerically solving for the minimum of the cost function J, without needing to use and iterative algorithm like gradient descent that we had to iterate multiple times.And there is no longer a learning rate alpha that you need
to worry about and set. And so it can be much faster for some problems.

out gradient descent will scale better to larger data sets than that normal equals method.

数据量小的时候,可以对cost func求偏导,然后联立方程组,求解每个theta的值(就如最小二乘法一样);

但是当数据量大的时候(如下图中feature有好多个,就得用多个theta建立一个更复杂的函数J),方程组过大,求解几乎无法进行,就得用gradient descent通过迭代的方法解决。



Linear Regression with Multiple Variables

linear algebra线性代数



(在machine learning中使用linear algebra表示方法的好处)

linear algebra gives us a notation and a set of things or a set of operations that we can do with matrices and vectors.

linear algebra isn't just useful for linear regression models but these ideas of matrices and vectors will be useful for helping us to implement and get computationally efficient implementations for many machines learning models. And these sorts of matrices
and vectors will give us an efficient way to start to organize large amounts of data, when we work with larger training sets.

from:/article/1368997.html

ref:《10 types of regressions. Which one to use?》10种回归类型介绍以及如何选择
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐