您的位置:首页 > 其它

kaggle房价预测模型总结

2020-05-04 09:50 78 查看

房价预测任务

目标:根据房屋属性预测每个房子的最终价格。

(一):分析数据指标

  1. 先查看数据的特征值与目标值:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df_train = pd.read_csv('./data/train.csv')
df_train.columns

Index([‘Id’, ‘MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’,
‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’,
‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘YearRemodAdd’,
‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’,
‘MasVnrArea’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘BsmtQual’,
‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinSF1’,
‘BsmtFinType2’, ‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘Heating’,
‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘1stFlrSF’, ‘2ndFlrSF’,
‘LowQualFinSF’, ‘GrLivArea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’,
‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘KitchenQual’,
‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, ‘FireplaceQu’, ‘GarageType’,
‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, ‘GarageQual’,
‘GarageCond’, ‘PavedDrive’, ‘WoodDeckSF’, ‘OpenPorchSF’,
‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, ‘PoolArea’, ‘PoolQC’,
‘Fence’, ‘MiscFeature’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘SaleType’,
‘SaleCondition’, ‘SalePrice’],
dtype=‘object’)

特征值:

  • MSSubClass:建筑类
  • mszoning:一般的分区分类
  • LotFrontage:街道连接属性线性英尺
  • LotArea:平方英尺批量
  • Street:道路通行方式
  • Alley:通道入口的类型
  • LotShape:房屋的一般形状
  • LandContour:房屋的平整度
  • Utilities:基础设施配套(电、水、煤气)
  • LotConfig:批次配置
  • LandSlope:物业的坡度
  • Neighborhood:Ames市区范围内的物理位置
  • Condition1:邻近主要道路或铁路
  • Condition2:靠近主要道路或铁路(如果第二存在)
  • BldgType:住宅类型
  • housestyle:住宅风格
  • overallqual:评估房屋的整体材料和装饰
  • overallcond:评估房屋的整体状况
  • yearbuilt:原施工日期
  • yearremodadd:重塑日期
  • RoofStyle:屋顶类型
  • RoofMatl:屋顶材料
  • exterior1st:房屋外墙
  • exterior2nd:房屋外墙(如果有多种材料)
  • MasVnrType:砌体饰面类型
  • masvnrarea:砌体饰面面积平方英尺
  • exterqual:外部材料质量
  • extercond:评估外部材料的当前状态
  • Foundation:基础类型
  • BsmtQual:评估地下室的高度
  • bsmtcond:评估地下室的一般状况
  • BsmtExposure:花园层地下室墙
  • bsmtfintype1:质量基底成品区
  • bsmtfinsf1:型完成1平方英尺
  • bsmtfintype2:质量第二成品区(如果有的话)
  • bsmtfinsf2:型完成2平方英尺
  • BsmtUnfSF:未完成的平方英尺的地下室
  • totalbsmtsf:地下室面积总平方英尺
  • 加热:加热类型
  • heatingqc:加热质量和条件
  • 中央:中央空调
  • 电气:电气系统
  • 1stflrsf:一楼平方英尺
  • 2ndflrsf:二楼平方英尺
  • lowqualfinsf:完成平方英尺Low质量(各楼层)
  • grlivarea:以上等级(地)居住面积平方英尺
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath:地下室半浴室
  • FullBath:完整的浴室级以上
  • HalfBath:半浴室级以上
  • 卧室:高于地下室的卧室数
  • 厨房:厨房数量
  • kitchenqual:厨房的品质
  • totrmsabvgrd:房间总级以上(不包括卫生间)
  • 功能:家庭功能评级
  • 一些壁炉壁炉:
  • fireplacequ:壁炉质量
  • GarageType:车库位置
  • GarageYrBlt:建立年车库
  • GarageFinish:车库的室内装修
  • GarageCars:在汽车车库大小的能力
  • GarageArea:在平方英尺的车库规模
  • GarageQual:车库质量
  • garagecond:车库条件
  • paveddrive:铺的车道
  • WoodDeckSF:平方英尺的木甲板面积
  • openporchsf:平方英尺打开阳台的面积
  • enclosedporch:封闭式阳台的面积以平方英尺
  • 3ssnporch:平方英尺三季阳台的面积
  • screenporch:平方英尺纱窗门廊区
  • PoolArea:在平方英尺的游泳池
  • poolqc:池质量
  • 栅栏:栅栏的质量
  • miscfeature:杂项功能在其他类未包括
  • miscval:$杂特征值
  • MoSold:月销售
  • YrSold:年销售
  • SaleType:销售类型
  • salecondition:销售条件

目标值:

  • saleprice:销售价格
df_train['SalePrice'].describe()

count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64

首先来看一下,目标值是否满足正态分布

sns.distplot(df_train['SalePrice']);


计算偏度与峰度:

#skewness and kurtosis
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())

Skewness: 1.882876
Kurtosis: 6.536282

分析:数据符合正太分布特点,但偏度比较大,需要做正太分布变化处理。

#居住面积平方英尺
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));


越大的面积,房价肯定也越贵,但是这里出现了一些离群点。

#地下室面积平方英尺
var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

#整体材料和饰面质量
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

#原施工日期
var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);


分析:除了建造年代越近,价格越贵的因素外,历史因素也是影响价格的一个重要原因。

var = 'Neighborhood'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
#fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

选出与价格因素最相近的10个特征,观察它们的相关性。

k = 10
corrmat = df_train.corr()
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values,cmap='YlGnBu')
plt.show()

#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();


总体上看,都符合特征值越往右偏,价格越贵,但是同时也存在离群点。

(二):观察数据正太性

通过对数变换的方法,使得数据更加拟合正太分布:

转换前的数据分布:

sns.distplot(train['SalePrice'] , fit=norm);

(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#分布图
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

mu = 180932.92 and sigma = 79467.79


转换后的数据分布:

#对数变换log(1+x)
train["SalePrice"] = np.log1p(train["SalePrice"])

#看看新的分布
sns.distplot(train['SalePrice'] , fit=norm);

# 参数
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#画图
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()


(三):数据预处理

观察数据缺失值:

#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)


去掉ID

train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

#看看数据多大的
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#ID先留着,暂时不用
train_ID = train['Id']
test_ID = test['Id']

#去掉ID
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

发现离群点

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()


去掉离群点

train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

缺失值处理:
拼接训练集和测试集:

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)


由于缺失值挺多的,这里的填补策略应该按照各特征值的特点去填补。

#游泳池

all_data["PoolQC"][:5]

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: PoolQC, dtype: object

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["PoolQC"][:5]

0 None
1 None
2 None
3 None
4 None
Name: PoolQC, dtype: object

#没有特征。。。
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
#通道的入口
all_data["Alley"] = all_data["Alley"].fillna("None")
#栅栏
all_data["Fence"] = all_data["Fence"].fillna("None")
#壁炉
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
#到街道的距离
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
#车库的事
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
#地下室的事
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
#砌体
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
#一般分区分类,用众数来吧
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
#Functional家庭功能评定
all_data["Functional"] = all_data["Functional"].fillna("Typ")
#电力系统
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
#厨房的品质
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
#外部
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
#销售类型
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
#建筑类型
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

填完后检查一下缺失值

all_data = all_data.drop(['Utilities'], axis=1)

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

另外某些特征值是数字,但它并不是连续型数据,而是离散型的,将它们转换成文本格式。

all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

all_data['OverallCond'] = all_data['OverallCond'].astype(str)

all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

使用sklearn.preprocessing 的LabelEncoder 将其标签化

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))

# shape
print('Shape all_data: {}'.format(all_data.shape))

Shape all_data: (2917, 78)

#增加一个新特征总面积
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

连续型数据正太性处理
先查看数据

from scipy.stats import norm, skew

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)


选择偏度大于0.75的特征值通过scipy.special 的 boxcox1p进行转换。

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
all_data[feat] = boxcox1p(all_data[feat], lam)

然后通过pd.get_dummies将数据转换成one-hot编码。

all_data = pd.get_dummies(all_data)
train = all_data[:ntrain]
test = all_data[ntrain:]

(四):建模预测房价

from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
n_folds = 5
def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)

make_pipeline:级联起来去做事 RobustScaler:(标准化)更适合处理离群点

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

ElasticNet同时使用l1和l2

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =5)
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=1,
nthread = -1)
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Lasso score: 0.1115 (0.0074)

score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

ElasticNet score: 0.1116 (0.0074)

score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Kernel Ridge score: 0.1153 (0.0075)

score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Gradient Boosting score: 0.1177 (0.0080)

score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Xgboost score: 0.1151 (0.0060)

创建一个平均模型

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models

def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]

for model in self.models_:
model.fit(X, y)

return self

#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR,lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Averaged base models score: 0.1091 (0.0075)

averaged_models.fit(train,y_train)
pred = np.exp(averaged_models.predict(test))
result=pd.DataFrame({'Id':test_ID,'SalePrice':pred})
result.to_csv('submission.csv',index=False)

结果

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: