您的位置:首页 > 其它

贝叶斯线性回归小练习

2017-05-26 22:09 197 查看
根据上一篇博客贝叶斯线性回归(单输出)对贝叶斯线性回归的理解,随便找了某地区在售房房价随时间和面积的信息(如下所示),利用贝叶斯线性回归分别针对单变量线性基函数模型、单变量多项式基函数模型、单变量和双变量高斯基函数模型进行线性回归。

房价数据如下表所示:

序号年代面积房价序号年代面积房价
02001100.8341010199995.11565
1200590.9050011200085.57430
22007130.0355012199566.44378
3200478.8841013200394.27498
4200674.22460142007125.10760
5200590.40497152006111.20730
6198364.5937016200888.99430
72000164.0661017200592.13506
82003147.50560182008101.35405
9200358.51408192000158.90615

单变量多项式基函数(包含线性基函数)

首先仿照PRML上线性回归的图示,以面积为自变量,分别取初始状态、1、2、20个点的情形,利用y=a0+a1x的线性模型,获取似然函数、先验概率的分布图像以及回归曲线的图像。然后利用多项式基函数模型得到了单变量3次多项式的回归过程(输入1、2、5、20个数据),由于参数较多,不能一次性对各参数的似然函数和先验后验分布进行显示,因此只显示各过程中的拟合结果、置信区间和采样函数曲线形态。

以下给出多项式基函数的拟合代码:

# 用实际数据,采用线性模型单变量线性回归2(n次多项式)
# 最高次数为max_order次

data = np.array([
[2001,100.83,410],[2005,90.9,500],[2007,130.03,550],[2004,78.88,410],[2006,74.22,460],
[2005,90.4,497],[1983,64.59,370],[2000,164.06,610],[2003,147.5,560],[2003,58.51,408],
[1999,95.11,565],[2000,85.57,430],[1995,66.44,378],[2003,94.27,498],[2007,125.1,760],
[2006,111.2,730],[2008,88.99,430],[2005,92.13,506],[2008,101.35,405],[2000,158.9,615]])

def phi(x, max_order=5):
return x[:, None] ** np.arange(max_order + 1)

max_order = 3
beta = 2.5e-3  # precision matrix for the noise in house price
s = np.sqrt(1.0 / beta)  # standard deviation of data noise
# noise for the initial w, indicates the range of intitial w with a mean of 0
alpha = np.diag((1e-4,) + (1e-2,)*max_order)  # 1, x, x^2, ... , x^max_order
mu_0 = np.array([150]+[0]*max_order)  # initial expectation for w_0,w_1, ..., w_max_order

xx = data[:,1]
tt = data[:,2]
# tt = w0 + w1 * xx + ... + w_max_order *  xx ^ max_order + norm.rvs(loc=0, scale=s, size=(20,))

nums = [1, 2, 5, 20]
_, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
phi_x = phi(xx, max_order)  # x.shape = (20,max_order+1)

for num, ax in zip(nums, axes):  # when nums=[1, 2, 5, 20] in each ax

# get S_num when num data are available
S = linalg.inv(alpha + beta * np.dot(phi_x[:num].T, phi_x[:num]))
# get mu_num when num data are available
mu = beta * np.dot(np.dot(S, phi_x[:num].T), tt[:num]) + np.dot(np.dot(S,alpha),mu_0)

# ###############生成不同的w先验后验概率下抽样拟合函数的图像###############
w = multivariate_normal.rvs(mean=mu, cov=S, size=10)  # randomly sampling 10 vectors for w
x_min,x_max = int(np.min(xx)) - 1, int(np.max(xx)) + 1
out_x = np.linspace(x_min, x_max, 50)
out_phi_x = phi(out_x, max_order)  # shape=(50, max_order+1)
out_y = np.dot(w, out_phi_x.T)  # 10 groups of y after sampling
out_y_mean = np.dot(mu, out_phi_x.T)
out_y_std = np.sqrt(1/beta + np.dot(np.dot(out_phi_x, S), out_phi_x.T).diagonal())
for i in range(10):
ax.plot(out_x, out_y[i], 'gray')
ax.plot(out_x, out_y_mean, 'k',linewidth=2)
ax.fill_between(out_x,out_y_mean-out_y_std,out_y_mean+out_y_std,color="lightgray")
ax.plot(xx[:num], tt[:num], 'ro')
ax.set_xlim(x_min, x_max)
ax.set_ylim(200,800)
ax.set_xlabel('square',fontsize='xx-large')
ax.set_ylabel('price',fontsize='xx-large')
# #############################################################
plt.suptitle('Price Estimating by Square with {}-Polynomial Linear Models, 10 sampled curves'.format(max_order),fontsize='xx-large')
plt.subplots_adjust(left=0.05,right=0.98,top=0.93)
plt.show()


结果如下,由图可见,随着训练过程的进行,w的分布越来越明确,对应的回归曲线的置信区间也越来越窄。线性模型和多项式模型都有这一特征。随着数据点增加,未出现过拟合。





单变量、双变量高斯基函数模型

其实与多项式模型的思路一样,唯一的区别就是在于ϕ(x)的形式上,多项式基函数的ϕ(x)差异体现在幂次的不同,而高斯基函数ϕ(x)的差异体现在各自的均值μ上。在本题中,对于双变量高斯基函数,由于对应的是房子的年代和面积,我们假定这两个变量是不相关的,那么这个二元高斯分布的基函数的Σ是对角阵。

下面给出双变量高斯基函数模型的过程代码,过程依旧是基于前1、2、5、20个数据。

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from scipy.stats import uniform, norm, multivariate_normal
from numpy import linalg
from mpl_toolkits.mplot3d import Axes3D
# 用实际数据,采用线性模型双变量线性回归2(二元高斯基函数5个)
data = np.array([
[2001,100.83,410],[2005,90.9,500],[2007,130.03,550],[2004,78.88,410],[2006,74.22,460],
[2005,90.4,497],[1983,64.59,370],[2000,164.06,610],[2003,147.5,560],[2003,58.51,408],
[1999,95.11,565],[2000,85.57,430],[1995,66.44,378],[2003,94.27,498],[2007,125.1,760],
[2006,111.2,730],[2008,88.99,430],[2005,92.13,506],[2008,101.35,405],[2000,158.9,615]])

# assuming the x and y in X are independent, so the covariance matrix in bivariate normal distribution only has value in diagonals
def phi(X, components=5):  # assuming that the input X .shape=(n,2): (instances, variables)
mu = np.c_[np.linspace(1990,2010,components+1), np.linspace(40,200,components+1)]  # shape=(components+1, 2)
N,m = X.shape[0], mu.shape[0]
def dot_(x, a):
return np.dot(np.dot(x, a), x.T)
alpha0 = np.diag(((mu[1]-mu[0])/0.5)**(-2))
exps = np.array([[np.exp(-0.5*dot_(X[i]-mu[j], alpha0)) for j in range(m)] for i in range(N)])
exps[:,0] = 1  # shape=(n, components+1)
return exps

components = 5
beta = 2.5e-3  # precision matrix for the noise in house price
s = np.sqrt(1/beta)  # standard deviation of data noise
alpha = np.diag((1e-5,) + (1e-5,)*components)  # initial coefficients' precision matrix
# noise for the initial w, indicates the range of intitial w with a mean of 200, 0, ... , 0
mu_0 = np.array([200]+[0]*components)  # initial expectation for w_0,w_1, ..., w_components

X = data[:, :2]
tt = data[:, 2]
# tt = w0 + w1 * phi_1(X) + ... + w_components *  phi_components(X) + norm.rvs(loc=0, scale=s, size=(20,))
nums = [1, 2, 5, 20]
fig = plt.figure(figsize=(12, 12))
phi_x = phi(X, components)  # x.shape = (20,max_order+1)

for i, num in enumerate(nums):  # when nums=[1, 2, 5, 20] in each ax
ax = fig.add_subplot(2, 2, i+1, projection='3d')

# get S_num when num data are available
S = linalg.inv(alpha + beta * np.dot(phi_x[:num].T, phi_x[:num]))
# get mu_num when num data are available
mu = beta * np.dot(np.dot(S, phi_x[:num].T), tt[:num]) + np.dot(np.dot(S,alpha),mu_0)

# ###############生成不同的w先验后验概率下抽样拟合函数的图像###############
#     w = multivariate_normal.rvs(mean=mu, cov=S, size=10)  # randomly sampling 10 vectors for w
x_min, x_max = int(np.min(X[:,1])) - 1, int(np.max(X[:,1])) + 1
out_X0, out_X1 = np.meshgrid(np.linspace(1980, 2010, 30), np.linspace(x_min, x_max, 100))
out_X = np.c_[out_X0.ravel(), out_X1.ravel()]
out_phi_X = phi(out_X, components)  # shape=(2000, max_order+1)
#     out_y = np.dot(w, out_phi_x.T)  # 10 groups of y after sampling
out_y_mean = np.dot(mu, out_phi_X.T).reshape(out_X0.shape)
out_y_std = np.sqrt(1/beta + np.dot(np.dot(out_phi_X, S), out_phi_X.T).diagonal()).reshape(out_X0.shape)

ax.plot_wireframe(out_X0,out_X1,out_y_mean, rstride=10, cstride=1, antialiased=True)
ax.plot_surface(out_X0,out_X1,out_y_mean-out_y_std, alpha=0.3)
ax.plot_surface(out_X0,out_X1,out_y_mean+out_y_std, alpha=0.3)
#     ax.plot(out_x, out_y_mean, 'k',linewidth=2)

ax.scatter(X[:num,0], X[:num,1], tt[:num], c='r',marker='o')
ax.set_xlim(1980, 2010)
ax.set_ylim(x_min, x_max)
ax.set_zlim(200,800)
ax.set_xlabel('year',fontsize='x-large')
ax.set_ylabel('square',fontsize='x-large')
ax.set_zlabel('price',fontsize='x-large')
# #############################################################
plt.suptitle('House Price Estimating from Square based on linear model',fontsize='xx-large')
plt.subplots_adjust(left=0.05,right=0.98,top=0.93,hspace=0.01)
plt.show()


单变量和双变量结果图如下所示,对比单变量高斯基函数和单变量多项式基函数的结果可以发现,两者最后生成的回归曲线非常接近。





当输入实际的二维数据时,由双变量高斯基函数的输出图像可见此时的拟合输出并不好。实际中的数据要取得较好的拟合效果,需要调整α,β,ϕ(x)内的μ,Σ等参数以及基函数的数量M。经过简略调整,个人得到了如下结果,在下图的三维图,预测值用蓝色网格表示,置信区间的顶底面用蓝色、粉红色曲面表示,上方标题中73.3为训练数据与回归曲线对应位置的标准差,而24.4为整个区间内各点的标准差的平均值。右图为左图回归面的二维显示,由图可见,随年代变新、面积增大,房屋售价逐渐增加,我们对(2004, 98.31)的数据点房价预测,得到其预测价格为511.5万元,预测的标准差为21.2万元。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息