您的位置:首页 > 其它

Stanford 机器学习课程cs229 数学推导知识

2015-06-09 23:46 405 查看
if x is a row vector,then xTx=(x⋅x)=∥x∥22=tr(xTx)

Linear regression

if A and B are square matrices, and <
4000
span class="MathJax" id="MathJax-Element-7-Frame" role="textbox" aria-readonly="true">a is a real number:

trABCtrABCDtrAtr(A+B)traA∇AtrAB∇ATf(A)∇ATtrAB∇AtrABATC∇ATtrABATCifC=Ithen∇AtrABAT∇A|A|∇θJ(θ)XTXθθ=trCAB=trBCA,=trDABC=trCDAB=trBCDA.=trAT=trA+trB=atrA=BT=(∇Af(A))T=B=CAB+CTABT=BTATCT+BATC=BTAT+BAT=|A|(A−1)T=∇θ12(Xθ−y⃗ )T(Xθ−y⃗ )=12∇θ(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θtr(θTXTXθ−θTXTy⃗ −y⃗ TXθ+y⃗ Ty⃗ )=12∇θ(trθTXTXθ−2try⃗ TXθ)=12(XTXθ+XTXθ−2XTy⃗ )=XTXθ−XTy⃗ =XTy⃗ =(XTX)−1XTy⃗

Locally weighted linear regression

w(i)=exp(−(x(i)−x)22τ2)XTWXθ=XTWy⃗ θ=(XTWX)−1XTWy⃗

Newton’s method:

θ:=θ−H−1∇θℓ(θ).Hij=∂2ℓ(θ)∂θi∂θj.

fit logistic regression using locally weighted lr:

the log-likelihood function for logistic regression:

ℓ(θ)=∑i=1my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))

for any vector z, it holds true that

zTHz
18977
∂ℓ(θ)∂θkHkl≤0.=∑i=1m(y(i)−h(x(i)))x(i)k=∂2ℓ(θ)=∑i=1m−∂h(x(i))∂θlx(i)k=∑i=1m−h(x(i))(1−h(x(i)))x(i)lx(i)k

The Exponential family

we say that a class of distribution is in the exponential family if it can be written in the form

p(y;η)=b(y)exp(ηTT(y)−a(η))).

Jensen′sinequality

Suppose we start with the inequality in the basic definition of a convex function

f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)for0≤θ≤1.

Using induction, this can be fairly easily extended to convex combinations of more than one point,

f(∑i=1kθixi)≤∑i=1kθif(xi)for∑i=1kθi=1,θi≥0∀i.

In fact, this can also be extended to infinite sums or integrals. In the latter case, the inequality can be written as

f(∫p(x)xdx)≤∫p(x)f(x)dxfor∫p(x)dx=1,p(x)≥0∀x.

Because p(x) integrates to 1, it is common to consider it as a probability density, in which case the previous equation can be written in terms of expectations,

f(E[x])≤E[f(x)]

Learning theory

Lemma. (The union bound). Let A1,A2,…,Ak be k different events (that may not be independent). Then

P(A1∪⋯∪Ak)≤P(A1)+⋯+P(Ak)

Lemma. (Hoeffding inequality) Let Z1,…,Zm be m independent and identically distributed (iid) random variables drawn from a Bernoulli(ϕ) distribution. Let ϕ^=(1/m)∑mi=1Zi be the mean of these random variables, and let any γ>0 be fixed. Then

P(|ϕ−ϕ^>γ|)≤2exp(−2γ2m)

This lemma (which in learning theory is also called the Chernoff bound) says that if we take ϕ^− the average of mBernoulli(ϕ) random variables−to be our estimate of ϕ, then the probability of our being far from the true value is small, so long as m is large.

h^=argminh∈Hϵ^(h)

For a hypothesis h,we define the training error (also called the empirical risk or empirical error in learning theory) to be

ϵ^(h)=1m∑i=1m1{h(x(i))≠y(i)}

ϵ(h)=P(x,y)∼D(h(x)≠y).

ϵ^(hi)=1m∑j=1mZj.

ϵ(h^)≤ϵ(h∗)+2γ

Theorem. Let |H|=k, and let any m,σ be fixed. Then with probability at least 1−σ, we have that

ϵ(h^)≤(minh∈Hϵ(h))+212mlog2kσ−−−−−−−−−√

Corollary. Let |H|=k, and let any γ be fixed. Then for ϵ(h^)≤minh∼Hϵ(h)+2γ to hold with probability at least 1−δ, it suffices that

m≥12γ2log2kδ=O(1γ2logkδ).

Factor Analysis:

Marginals and conditionals of Gaussians,

Cov(x)=Σ=[Σ11Σ21Σ12Σ22]=E[(x−μ)(x−μ)T]=E⎡⎣(x1−μ1x2−μ2)(x1−μ1x2−μ2)T⎤⎦=E[(x1−μ1)(x1−μ1)T(x2−μ2)(x1−μ1)T(x1−μ1)(x2−μ2)T(x2−μ2)(x2−μ2)T]

μ1|2Σ1|2=μ1+Σ12Σ−122(x2−μ2)=Σ11−Σ12Σ−122Σ21

To deduce the above marginals,we define V∈R(m+n)×(m+n) need the lemma below:

V=[VAAVBAVABVBB]=Σ−1[ACBD]−1=[M−1−D−1CM−1−M−1BD−1D−1+D−1CM−1BD−1]

where M=A−BD−1C. Using this formula, it follows that

[ΣAAΣBAΣABΣBB]=[VAAVBAVABVBB]−1=[(VAA−VABV−1BBVBA)−1−V−1BBVBA(VAA−VABV−1BBVBA)−1−(VAA−VABV−1BBVBA)−1VABV−1BB(VBB−VBAV−1AAVAB)−1]

And the “completion of squares” trick.Consider the quadratic function zTAz+bTz+c where A is a symmetric,nonsingular matrix.Then, one can verify directly that

12zTAz+bTz+c=12(z+A−1b)TA(z+A−1b)+c−12bTA−1b.

EM for factor analysis

In the factor analysis model, we posit a joint distribution on (x,z) as follows,where z∈Rk is a latent random variable:

zx|z∼N(0,I)∼N(μ+Λz,Ψ).

Here, the parameters of our model are the vector μ∈Rn, the matrix Λ∈Rn×k. The value of k is usually chosen to be smaller than n.

Thus, we imagine that each datapoint x(i) is generated by sampling a k dimension multivariate Gaussian z(i). Then, it is mapped to a k-dimensional affine space of Rn by computing μ+Λz(i). Lastly, x(i) is generated by adding covariance Ψ noise to μ+Λz(i).

zϵx∼N(0,I)∼N(0,Ψ)=μ+Λz+ϵ

where ϵ and z are independent.

[zx]∼N([0⃗ μ],[IΛΛTΛΛT+Ψ])

ℓ(μ,Λ,Ψ)=log∏i=1m1(2π)n/2∣∣ΛΛT+Ψ∣∣1/2exp(−12(x(i)−μ)T(ΛΛT+Ψ)−1(x(i)−μ)).

z(i)|x(i);μ,Λ,Ψ∼N(μz(i)|x(i),Σz(i)|x(i))μz(i)|x(i)Σz(i)|x(i)ΛμΦ=ΛT(ΛΛT+Ψ)−1(x(i)−μ),=I−ΛT(ΛΛT+Ψ)−1Λ=(∑i=1m(x(i)−μ)μTz(i)|x(i))(∑i=1mμz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))−1=1m∑i=1mx(i)=1m∑i=1mx(i)x(i)T−x(i)μTz(i)|x(i)ΛT−Λμz(i)|x(i)x(i)T+Λ(μz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))ΛT

setting Ψii=Φii ( i.e. ,letting Ψ be the diagonal matrix containing only the diagonal entries of Φ).
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息