Stanford 机器学习课程cs229 数学推导知识
2015-06-09 23:46
405 查看
if x is a row vector,then xTx=(x⋅x)=∥x∥22=tr(xTx)
4000
span class="MathJax" id="MathJax-Element-7-Frame" role="textbox" aria-readonly="true">a is a real number:
trABCtrABCDtrAtr(A+B)traA∇AtrAB∇ATf(A)∇ATtrAB∇AtrABATC∇ATtrABATCifC=Ithen∇AtrABAT∇A|A|∇θJ(θ)XTXθθ=trCAB=trBCA,=trDABC=trCDAB=trBCDA.=trAT=trA+trB=atrA=BT=(∇Af(A))T=B=CAB+CTABT=BTATCT+BATC=BTAT+BAT=|A|(A−1)T=∇θ12(Xθ−y⃗ )T(Xθ−y⃗ )=12∇θ(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θtr(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θ(trθTXTXθ−2try⃗ TXθ)=12(XTXθ+XTXθ−2XTy⃗ )=XTXθ−XTy⃗ =XTy⃗ =(XTX)−1XTy⃗
Locally weighted linear regression
w(i)=exp(−(x(i)−x)22τ2)XTWXθ=XTWy⃗ θ=(XTWX)−1XTWy⃗
Newton’s method:
θ:=θ−H−1∇θℓ(θ).Hij=∂2ℓ(θ)∂θi∂θj.
fit logistic regression using locally weighted lr:
the log-likelihood function for logistic regression:
ℓ(θ)=∑i=1my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))
for any vector z, it holds true that
zTHz
18977
∂ℓ(θ)∂θkHkl≤0.=∑i=1m(y(i)−h(x(i)))x(i)k=∂2ℓ(θ)=∑i=1m−∂h(x(i))∂θlx(i)k=∑i=1m−h(x(i))(1−h(x(i)))x(i)lx(i)k
p(y;η)=b(y)exp(ηTT(y)−a(η))).
Jensen′sinequality
Suppose we start with the inequality in the basic definition of a convex function
f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)for0≤θ≤1.
Using induction, this can be fairly easily extended to convex combinations of more than one point,
f(∑i=1kθixi)≤∑i=1kθif(xi)for∑i=1kθi=1,θi≥0∀i.
In fact, this can also be extended to infinite sums or integrals. In the latter case, the inequality can be written as
f(∫p(x)xdx)≤∫p(x)f(x)dxfor∫p(x)dx=1,p(x)≥0∀x.
Because p(x) integrates to 1, it is common to consider it as a probability density, in which case the previous equation can be written in terms of expectations,
f(E[x])≤E[f(x)]
Learning theory
Lemma. (The union bound). Let A1,A2,…,Ak be k different events (that may not be independent). Then
P(A1∪⋯∪Ak)≤P(A1)+⋯+P(Ak)
Lemma. (Hoeffding inequality) Let Z1,…,Zm be m independent and identically distributed (iid) random variables drawn from a Bernoulli(ϕ) distribution. Let ϕ^=(1/m)∑mi=1Zi be the mean of these random variables, and let any γ>0 be fixed. Then
P(|ϕ−ϕ^>γ|)≤2exp(−2γ2m)
This lemma (which in learning theory is also called the Chernoff bound) says that if we take ϕ^− the average of mBernoulli(ϕ) random variables−to be our estimate of ϕ, then the probability of our being far from the true value is small, so long as m is large.
h^=argminh∈Hϵ^(h)
For a hypothesis h,we define the training error (also called the empirical risk or empirical error in learning theory) to be
ϵ^(h)=1m∑i=1m1{h(x(i))≠y(i)}
ϵ(h)=P(x,y)∼D(h(x)≠y).
ϵ^(hi)=1m∑j=1mZj.
ϵ(h^)≤ϵ(h∗)+2γ
Theorem. Let |H|=k, and let any m,σ be fixed. Then with probability at least 1−σ, we have that
ϵ(h^)≤(minh∈Hϵ(h))+212mlog2kσ−−−−−−−−−√
Corollary. Let |H|=k, and let any γ be fixed. Then for ϵ(h^)≤minh∼Hϵ(h)+2γ to hold with probability at least 1−δ, it suffices that
m≥12γ2log2kδ=O(1γ2logkδ).
Cov(x)=Σ=[Σ11Σ21Σ12Σ22]=E[(x−μ)(x−μ)T]=E⎡⎣(x1−μ1x2−μ2)(x1−μ1x2−μ2)T⎤⎦=E[(x1−μ1)(x1−μ1)T(x2−μ2)(x1−μ1)T(x1−μ1)(x2−μ2)T(x2−μ2)(x2−μ2)T]
μ1|2Σ1|2=μ1+Σ12Σ−122(x2−μ2)=Σ11−Σ12Σ−122Σ21
To deduce the above marginals,we define V∈R(m+n)×(m+n) need the lemma below:
V=[VAAVBAVABVBB]=Σ−1[ACBD]−1=[M−1−D−1CM−1−M−1BD−1D−1+D−1CM−1BD−1]
where M=A−BD−1C. Using this formula, it follows that
[ΣAAΣBAΣABΣBB]=[VAAVBAVABVBB]−1=[(VAA−VABV−1BBVBA)−1−V−1BBVBA(VAA−VABV−1BBVBA)−1−(VAA−VABV−1BBVBA)−1VABV−1BB(VBB−VBAV−1AAVAB)−1]
And the “completion of squares” trick.Consider the quadratic function zTAz+bTz+c where A is a symmetric,nonsingular matrix.Then, one can verify directly that
12zTAz+bTz+c=12(z+A−1b)TA(z+A−1b)+c−12bTA−1b.
EM for factor analysis
In the factor analysis model, we posit a joint distribution on (x,z) as follows,where z∈Rk is a latent random variable:
zx|z∼N(0,I)∼N(μ+Λz,Ψ).
Here, the parameters of our model are the vector μ∈Rn, the matrix Λ∈Rn×k. The value of k is usually chosen to be smaller than n.
Thus, we imagine that each datapoint x(i) is generated by sampling a k dimension multivariate Gaussian z(i). Then, it is mapped to a k-dimensional affine space of Rn by computing μ+Λz(i). Lastly, x(i) is generated by adding covariance Ψ noise to μ+Λz(i).
zϵx∼N(0,I)∼N(0,Ψ)=μ+Λz+ϵ
where ϵ and z are independent.
[zx]∼N([0⃗ μ],[IΛΛTΛΛT+Ψ])
ℓ(μ,Λ,Ψ)=log∏i=1m1(2π)n/2∣∣ΛΛT+Ψ∣∣1/2exp(−12(x(i)−μ)T(ΛΛT+Ψ)−1(x(i)−μ)).
z(i)|x(i);μ,Λ,Ψ∼N(μz(i)|x(i),Σz(i)|x(i))μz(i)|x(i)Σz(i)|x(i)ΛμΦ=ΛT(ΛΛT+Ψ)−1(x(i)−μ),=I−ΛT(ΛΛT+Ψ)−1Λ=(∑i=1m(x(i)−μ)μTz(i)|x(i))(∑i=1mμz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))−1=1m∑i=1mx(i)=1m∑i=1mx(i)x(i)T−x(i)μTz(i)|x(i)ΛT−Λμz(i)|x(i)x(i)T+Λ(μz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))ΛT
setting Ψii=Φii ( i.e. ,letting Ψ be the diagonal matrix containing only the diagonal entries of Φ).
Linear regression
if A and B are square matrices, and <4000
span class="MathJax" id="MathJax-Element-7-Frame" role="textbox" aria-readonly="true">a is a real number:
trABCtrABCDtrAtr(A+B)traA∇AtrAB∇ATf(A)∇ATtrAB∇AtrABATC∇ATtrABATCifC=Ithen∇AtrABAT∇A|A|∇θJ(θ)XTXθθ=trCAB=trBCA,=trDABC=trCDAB=trBCDA.=trAT=trA+trB=atrA=BT=(∇Af(A))T=B=CAB+CTABT=BTATCT+BATC=BTAT+BAT=|A|(A−1)T=∇θ12(Xθ−y⃗ )T(Xθ−y⃗ )=12∇θ(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θtr(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θ(trθTXTXθ−2try⃗ TXθ)=12(XTXθ+XTXθ−2XTy⃗ )=XTXθ−XTy⃗ =XTy⃗ =(XTX)−1XTy⃗
Locally weighted linear regression
w(i)=exp(−(x(i)−x)22τ2)XTWXθ=XTWy⃗ θ=(XTWX)−1XTWy⃗
Newton’s method:
θ:=θ−H−1∇θℓ(θ).Hij=∂2ℓ(θ)∂θi∂θj.
fit logistic regression using locally weighted lr:
the log-likelihood function for logistic regression:
ℓ(θ)=∑i=1my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))
for any vector z, it holds true that
zTHz
18977
∂ℓ(θ)∂θkHkl≤0.=∑i=1m(y(i)−h(x(i)))x(i)k=∂2ℓ(θ)=∑i=1m−∂h(x(i))∂θlx(i)k=∑i=1m−h(x(i))(1−h(x(i)))x(i)lx(i)k
The Exponential family
we say that a class of distribution is in the exponential family if it can be written in the formp(y;η)=b(y)exp(ηTT(y)−a(η))).
Jensen′sinequality
Suppose we start with the inequality in the basic definition of a convex function
f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)for0≤θ≤1.
Using induction, this can be fairly easily extended to convex combinations of more than one point,
f(∑i=1kθixi)≤∑i=1kθif(xi)for∑i=1kθi=1,θi≥0∀i.
In fact, this can also be extended to infinite sums or integrals. In the latter case, the inequality can be written as
f(∫p(x)xdx)≤∫p(x)f(x)dxfor∫p(x)dx=1,p(x)≥0∀x.
Because p(x) integrates to 1, it is common to consider it as a probability density, in which case the previous equation can be written in terms of expectations,
f(E[x])≤E[f(x)]
Learning theory
Lemma. (The union bound). Let A1,A2,…,Ak be k different events (that may not be independent). Then
P(A1∪⋯∪Ak)≤P(A1)+⋯+P(Ak)
Lemma. (Hoeffding inequality) Let Z1,…,Zm be m independent and identically distributed (iid) random variables drawn from a Bernoulli(ϕ) distribution. Let ϕ^=(1/m)∑mi=1Zi be the mean of these random variables, and let any γ>0 be fixed. Then
P(|ϕ−ϕ^>γ|)≤2exp(−2γ2m)
This lemma (which in learning theory is also called the Chernoff bound) says that if we take ϕ^− the average of mBernoulli(ϕ) random variables−to be our estimate of ϕ, then the probability of our being far from the true value is small, so long as m is large.
h^=argminh∈Hϵ^(h)
For a hypothesis h,we define the training error (also called the empirical risk or empirical error in learning theory) to be
ϵ^(h)=1m∑i=1m1{h(x(i))≠y(i)}
ϵ(h)=P(x,y)∼D(h(x)≠y).
ϵ^(hi)=1m∑j=1mZj.
ϵ(h^)≤ϵ(h∗)+2γ
Theorem. Let |H|=k, and let any m,σ be fixed. Then with probability at least 1−σ, we have that
ϵ(h^)≤(minh∈Hϵ(h))+212mlog2kσ−−−−−−−−−√
Corollary. Let |H|=k, and let any γ be fixed. Then for ϵ(h^)≤minh∼Hϵ(h)+2γ to hold with probability at least 1−δ, it suffices that
m≥12γ2log2kδ=O(1γ2logkδ).
Factor Analysis:
Marginals and conditionals of Gaussians,Cov(x)=Σ=[Σ11Σ21Σ12Σ22]=E[(x−μ)(x−μ)T]=E⎡⎣(x1−μ1x2−μ2)(x1−μ1x2−μ2)T⎤⎦=E[(x1−μ1)(x1−μ1)T(x2−μ2)(x1−μ1)T(x1−μ1)(x2−μ2)T(x2−μ2)(x2−μ2)T]
μ1|2Σ1|2=μ1+Σ12Σ−122(x2−μ2)=Σ11−Σ12Σ−122Σ21
To deduce the above marginals,we define V∈R(m+n)×(m+n) need the lemma below:
V=[VAAVBAVABVBB]=Σ−1[ACBD]−1=[M−1−D−1CM−1−M−1BD−1D−1+D−1CM−1BD−1]
where M=A−BD−1C. Using this formula, it follows that
[ΣAAΣBAΣABΣBB]=[VAAVBAVABVBB]−1=[(VAA−VABV−1BBVBA)−1−V−1BBVBA(VAA−VABV−1BBVBA)−1−(VAA−VABV−1BBVBA)−1VABV−1BB(VBB−VBAV−1AAVAB)−1]
And the “completion of squares” trick.Consider the quadratic function zTAz+bTz+c where A is a symmetric,nonsingular matrix.Then, one can verify directly that
12zTAz+bTz+c=12(z+A−1b)TA(z+A−1b)+c−12bTA−1b.
EM for factor analysis
In the factor analysis model, we posit a joint distribution on (x,z) as follows,where z∈Rk is a latent random variable:
zx|z∼N(0,I)∼N(μ+Λz,Ψ).
Here, the parameters of our model are the vector μ∈Rn, the matrix Λ∈Rn×k. The value of k is usually chosen to be smaller than n.
Thus, we imagine that each datapoint x(i) is generated by sampling a k dimension multivariate Gaussian z(i). Then, it is mapped to a k-dimensional affine space of Rn by computing μ+Λz(i). Lastly, x(i) is generated by adding covariance Ψ noise to μ+Λz(i).
zϵx∼N(0,I)∼N(0,Ψ)=μ+Λz+ϵ
where ϵ and z are independent.
[zx]∼N([0⃗ μ],[IΛΛTΛΛT+Ψ])
ℓ(μ,Λ,Ψ)=log∏i=1m1(2π)n/2∣∣ΛΛT+Ψ∣∣1/2exp(−12(x(i)−μ)T(ΛΛT+Ψ)−1(x(i)−μ)).
z(i)|x(i);μ,Λ,Ψ∼N(μz(i)|x(i),Σz(i)|x(i))μz(i)|x(i)Σz(i)|x(i)ΛμΦ=ΛT(ΛΛT+Ψ)−1(x(i)−μ),=I−ΛT(ΛΛT+Ψ)−1Λ=(∑i=1m(x(i)−μ)μTz(i)|x(i))(∑i=1mμz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))−1=1m∑i=1mx(i)=1m∑i=1mx(i)x(i)T−x(i)μTz(i)|x(i)ΛT−Λμz(i)|x(i)x(i)T+Λ(μz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))ΛT
setting Ψii=Φii ( i.e. ,letting Ψ be the diagonal matrix containing only the diagonal entries of Φ).
相关文章推荐
- 用Python从零实现贝叶斯分类器的机器学习的教程
- 也谈 机器学习到底有没有用 ?
- 量子计算机编程原理简介 和 机器学习
- 10个关于人工智能和机器学习的有趣开源项目
- 机器学习实践中应避免的7种常见错误
- 机器学习书单
- 北美常用的机器学习/自然语言处理/语音处理经典书籍
- 如何提升COBOL系统代码分析效率
- 支持向量机(SVM)算法概述
- 神经网络初步学习手记
- 开始spark之旅
- spark的几点备忘
- 关于机器学习的学习笔记(一):机器学习概念
- 关于机器学习的学习笔记(二):决策树算法
- 关于机器学习的学习笔记(三):k近邻算法
- 长期招聘:自然语言处理工程师
- 长期招聘:个性化推荐
- 为什么需要一个推荐引擎平台
- 机器学习之决策树整理
- 最大似然估计_入门