kaggle房价预测模型总结
房价预测任务
目标:根据房屋属性预测每个房子的最终价格。
(一):分析数据指标
- 先查看数据的特征值与目标值:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy.stats import norm from sklearn.preprocessing import StandardScaler from scipy import stats import warnings warnings.filterwarnings('ignore') %matplotlib inline
df_train = pd.read_csv('./data/train.csv')
df_train.columns
Index([‘Id’, ‘MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’,
‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’,
‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘YearRemodAdd’,
‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’,
‘MasVnrArea’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘BsmtQual’,
‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinSF1’,
‘BsmtFinType2’, ‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘Heating’,
‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘1stFlrSF’, ‘2ndFlrSF’,
‘LowQualFinSF’, ‘GrLivArea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’,
‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘KitchenQual’,
‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, ‘FireplaceQu’, ‘GarageType’,
‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, ‘GarageQual’,
‘GarageCond’, ‘PavedDrive’, ‘WoodDeckSF’, ‘OpenPorchSF’,
‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, ‘PoolArea’, ‘PoolQC’,
‘Fence’, ‘MiscFeature’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘SaleType’,
‘SaleCondition’, ‘SalePrice’],
dtype=‘object’)
特征值:
- MSSubClass:建筑类
- mszoning:一般的分区分类
- LotFrontage:街道连接属性线性英尺
- LotArea:平方英尺批量
- Street:道路通行方式
- Alley:通道入口的类型
- LotShape:房屋的一般形状
- LandContour:房屋的平整度
- Utilities:基础设施配套(电、水、煤气)
- LotConfig:批次配置
- LandSlope:物业的坡度
- Neighborhood:Ames市区范围内的物理位置
- Condition1:邻近主要道路或铁路
- Condition2:靠近主要道路或铁路(如果第二存在)
- BldgType:住宅类型
- housestyle:住宅风格
- overallqual:评估房屋的整体材料和装饰
- overallcond:评估房屋的整体状况
- yearbuilt:原施工日期
- yearremodadd:重塑日期
- RoofStyle:屋顶类型
- RoofMatl:屋顶材料
- exterior1st:房屋外墙
- exterior2nd:房屋外墙(如果有多种材料)
- MasVnrType:砌体饰面类型
- masvnrarea:砌体饰面面积平方英尺
- exterqual:外部材料质量
- extercond:评估外部材料的当前状态
- Foundation:基础类型
- BsmtQual:评估地下室的高度
- bsmtcond:评估地下室的一般状况
- BsmtExposure:花园层地下室墙
- bsmtfintype1:质量基底成品区
- bsmtfinsf1:型完成1平方英尺
- bsmtfintype2:质量第二成品区(如果有的话)
- bsmtfinsf2:型完成2平方英尺
- BsmtUnfSF:未完成的平方英尺的地下室
- totalbsmtsf:地下室面积总平方英尺
- 加热:加热类型
- heatingqc:加热质量和条件
- 中央:中央空调
- 电气:电气系统
- 1stflrsf:一楼平方英尺
- 2ndflrsf:二楼平方英尺
- lowqualfinsf:完成平方英尺Low质量(各楼层)
- grlivarea:以上等级(地)居住面积平方英尺
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath:地下室半浴室
- FullBath:完整的浴室级以上
- HalfBath:半浴室级以上
- 卧室:高于地下室的卧室数
- 厨房:厨房数量
- kitchenqual:厨房的品质
- totrmsabvgrd:房间总级以上(不包括卫生间)
- 功能:家庭功能评级
- 一些壁炉壁炉:
- fireplacequ:壁炉质量
- GarageType:车库位置
- GarageYrBlt:建立年车库
- GarageFinish:车库的室内装修
- GarageCars:在汽车车库大小的能力
- GarageArea:在平方英尺的车库规模
- GarageQual:车库质量
- garagecond:车库条件
- paveddrive:铺的车道
- WoodDeckSF:平方英尺的木甲板面积
- openporchsf:平方英尺打开阳台的面积
- enclosedporch:封闭式阳台的面积以平方英尺
- 3ssnporch:平方英尺三季阳台的面积
- screenporch:平方英尺纱窗门廊区
- PoolArea:在平方英尺的游泳池
- poolqc:池质量
- 栅栏:栅栏的质量
- miscfeature:杂项功能在其他类未包括
- miscval:$杂特征值
- MoSold:月销售
- YrSold:年销售
- SaleType:销售类型
- salecondition:销售条件
目标值:
- saleprice:销售价格
df_train['SalePrice'].describe()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
首先来看一下,目标值是否满足正态分布
sns.distplot(df_train['SalePrice']);
计算偏度与峰度:
#skewness and kurtosis print("Skewness: %f" % df_train['SalePrice'].skew()) print("Kurtosis: %f" % df_train['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282
分析:数据符合正太分布特点,但偏度比较大,需要做正太分布变化处理。
#居住面积平方英尺 var = 'GrLivArea' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
越大的面积,房价肯定也越贵,但是这里出现了一些离群点。
#地下室面积平方英尺 var = 'TotalBsmtSF' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
#整体材料和饰面质量 var = 'OverallQual' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) f, ax = plt.subplots(figsize=(8, 6)) fig = sns.boxplot(x=var, y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000);
#原施工日期 var = 'YearBuilt' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) f, ax = plt.subplots(figsize=(16, 8)) fig = sns.boxplot(x=var, y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000); plt.xticks(rotation=90);
分析:除了建造年代越近,价格越贵的因素外,历史因素也是影响价格的一个重要原因。
var = 'Neighborhood' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) f, ax = plt.subplots(figsize=(8, 6)) fig = sns.boxplot(x=var, y="SalePrice", data=data) #fig.axis(ymin=0, ymax=800000); plt.xticks(rotation=90);
选出与价格因素最相近的10个特征,观察它们的相关性。
k = 10 corrmat = df_train.corr() cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index cm = np.corrcoef(df_train[cols].values.T) sns.set(font_scale=1.25) hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values,cmap='YlGnBu') plt.show()
#scatterplot sns.set() cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt'] sns.pairplot(df_train[cols], size = 2.5) plt.show();
总体上看,都符合特征值越往右偏,价格越贵,但是同时也存在离群点。
(二):观察数据正太性
通过对数变换的方法,使得数据更加拟合正太分布:
转换前的数据分布:
sns.distplot(train['SalePrice'] , fit=norm); (mu, sigma) = norm.fit(train['SalePrice']) print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma)) #分布图 plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best') plt.ylabel('Frequency') plt.title('SalePrice distribution') #QQ图 fig = plt.figure() res = stats.probplot(train['SalePrice'], plot=plt) plt.show()
mu = 180932.92 and sigma = 79467.79
转换后的数据分布:
#对数变换log(1+x) train["SalePrice"] = np.log1p(train["SalePrice"]) #看看新的分布 sns.distplot(train['SalePrice'] , fit=norm); # 参数 (mu, sigma) = norm.fit(train['SalePrice']) print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma)) #画图 plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best') plt.ylabel('Frequency') plt.title('SalePrice distribution') #QQ图 fig = plt.figure() res = stats.probplot(train['SalePrice'], plot=plt) plt.show()
(三):数据预处理
观察数据缺失值:
#missing data total = df_train.isnull().sum().sort_values(ascending=False) percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) missing_data.head(20)
去掉ID
train = pd.read_csv('./data/train.csv') test = pd.read_csv('./data/test.csv') #看看数据多大的 print("The train data size before dropping Id feature is : {} ".format(train.shape)) print("The test data size before dropping Id feature is : {} ".format(test.shape)) #ID先留着,暂时不用 train_ID = train['Id'] test_ID = test['Id'] #去掉ID train.drop("Id", axis = 1, inplace = True) test.drop("Id", axis = 1, inplace = True)
发现离群点
fig, ax = plt.subplots() ax.scatter(x = train['GrLivArea'], y = train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('GrLivArea', fontsize=13) plt.show()
去掉离群点
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index) #Check the graphic again fig, ax = plt.subplots() ax.scatter(train['GrLivArea'], train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('GrLivArea', fontsize=13) plt.show()
缺失值处理:
拼接训练集和测试集:
ntrain = train.shape[0] ntest = test.shape[0] y_train = train.SalePrice.values all_data = pd.concat((train, test)).reset_index(drop=True) all_data.drop(['SalePrice'], axis=1, inplace=True) print("all_data size is : {}".format(all_data.shape))
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30] missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) missing_data.head(20)
f, ax = plt.subplots(figsize=(15, 12)) plt.xticks(rotation='90') sns.barplot(x=all_data_na.index, y=all_data_na) plt.xlabel('Features', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15) plt.title('Percent missing data by feature', fontsize=15)
由于缺失值挺多的,这里的填补策略应该按照各特征值的特点去填补。
#游泳池
all_data["PoolQC"][:5]
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: PoolQC, dtype: object
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["PoolQC"][:5]
0 None
1 None
2 None
3 None
4 None
Name: PoolQC, dtype: object
#没有特征。。。 all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None") #通道的入口 all_data["Alley"] = all_data["Alley"].fillna("None") #栅栏 all_data["Fence"] = all_data["Fence"].fillna("None") #壁炉 all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None") #到街道的距离 all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median())) #车库的事 for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'): all_data[col] = all_data[col].fillna('None') for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'): all_data[col] = all_data[col].fillna(0) #地下室的事 for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'): all_data[col] = all_data[col].fillna(0) for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'): all_data[col] = all_data[col].fillna('None') #砌体 all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None") all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0) #一般分区分类,用众数来吧 all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0]) #Functional家庭功能评定 all_data["Functional"] = all_data["Functional"].fillna("Typ") #电力系统 all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0]) #厨房的品质 all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0]) #外部 all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0]) all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0]) #销售类型 all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0]) #建筑类型 all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
填完后检查一下缺失值
all_data = all_data.drop(['Utilities'], axis=1) all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) missing_data.head()
另外某些特征值是数字,但它并不是连续型数据,而是离散型的,将它们转换成文本格式。
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str) all_data['OverallCond'] = all_data['OverallCond'].astype(str) all_data['YrSold'] = all_data['YrSold'].astype(str) all_data['MoSold'] = all_data['MoSold'].astype(str)
使用sklearn.preprocessing 的LabelEncoder 将其标签化
from sklearn.preprocessing import LabelEncoder cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope', 'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 'YrSold', 'MoSold') # process columns, apply LabelEncoder to categorical features for c in cols: lbl = LabelEncoder() lbl.fit(list(all_data[c].values)) all_data[c] = lbl.transform(list(all_data[c].values)) # shape print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 78)
#增加一个新特征总面积 all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
连续型数据正太性处理
先查看数据
from scipy.stats import norm, skew numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index # Check the skew of all numerical features skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False) print("\nSkew in numerical features: \n") skewness = pd.DataFrame({'Skew' :skewed_feats}) skewness.head(10)
选择偏度大于0.75的特征值通过scipy.special 的 boxcox1p进行转换。
skewness = skewness[abs(skewness) > 0.75] print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0])) from scipy.special import boxcox1p skewed_features = skewness.index lam = 0.15 for feat in skewed_features: all_data[feat] = boxcox1p(all_data[feat], lam)
然后通过pd.get_dummies将数据转换成one-hot编码。
all_data = pd.get_dummies(all_data) train = all_data[:ntrain] test = all_data[ntrain:]
(四):建模预测房价
from sklearn.linear_model import ElasticNet, Lasso from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone from sklearn.model_selection import KFold, cross_val_score, train_test_split from sklearn.metrics import mean_squared_error import xgboost as xgb
n_folds = 5 def rmsle_cv(model): kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values) rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf)) return(rmse)
make_pipeline:级联起来去做事 RobustScaler:(标准化)更适合处理离群点
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ElasticNet同时使用l1和l2
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state =5)
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, silent=1, nthread = -1)
score = rmsle_cv(lasso) print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074)
score = rmsle_cv(ENet) print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
ElasticNet score: 0.1116 (0.0074)
score = rmsle_cv(KRR) print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Kernel Ridge score: 0.1153 (0.0075)
score = rmsle_cv(GBoost) print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Gradient Boosting score: 0.1177 (0.0080)
score = rmsle_cv(model_xgb) print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Xgboost score: 0.1151 (0.0060)
创建一个平均模型
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self, models): self.models = models def fit(self, X, y): self.models_ = [clone(x) for x in self.models] for model in self.models_: model.fit(X, y) return self #Now we do the predictions for cloned models and average them def predict(self, X): predictions = np.column_stack([ model.predict(X) for model in self.models_ ]) return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR,lasso)) score = rmsle_cv(averaged_models) print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Averaged base models score: 0.1091 (0.0075)
averaged_models.fit(train,y_train) pred = np.exp(averaged_models.predict(test)) result=pd.DataFrame({'Id':test_ID,'SalePrice':pred}) result.to_csv('submission.csv',index=False)
结果
- Kaggle竞赛 —— 房价预测江苏快三开奖网站开发 (House Prices)
- Kaggle:一套完整的网站流量预测模型
- Kaggle房价预测:随机森林方法
- 房价预测回归模型--tensorflow2.0学习笔记--tf.keras使用实例
- Kaggle房价预测数据观察和处理入门学习
- ML-项目-02-KAGGLE房价预测项目学习(一)
- 使用飞桨构建波士顿房价预测模型
- EL之Bagging:kaggle比赛之利用泰坦尼克号数据集建立Bagging模型对每个人进行获救是否预测
- kaggle实战之房价预测(一)
- 通过房价预测入门Kaggle
- kaggle预测房价
- Kaggle实战之 房价预测案例
- 房价预测模型
- Kaggle波士顿房价预测数据预处理部分
- Kaggle房价预测:数据预处理——练习
- Kaggle房价预测 随机森林方法
- 使用三种继承回归模型对美国波士顿房价训练数据进行学习,并对测试数据进行预测
- 《动手学——循环神经网络进阶、梯度消失、梯度爆炸以及Kaggle房价预测、过拟合、欠拟合及其解决方案》笔记
- 《动手学深度学习》学习之路01-- Kaggle⽐赛:房价预测
- Kaggle房价预测案例分享