您的位置:首页 > 其它

kaggle预测房价

2018-07-29 22:57 169 查看

kaggle房价预测比赛官方地址:https://www.kaggle.com/c/house-prices-advanced-regression-techniques

kaggle数据集描述:https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Step 1:引入相关的包

 
  1. # coding:utf-8

  2. # 注意读取文件时,Windows系统的\\和Linux系统的/的区别

  3.  
  4. import numpy as np

  5. import pandas as pd

  6. import matplotlib.pyplot as plt

  7. from sklearn.linear_model import Ridge

  8. from sklearn.model_selection import cross_val_score

  9. from sklearn.ensemble import RandomForestRegressor

Step 2:读取数据 

文件的组织形式是house price文件夹下面放house_price.py和input文件夹。input文件夹下面放的是从https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

下载的train.csv test.csv sample_submission.csv 和 data_description.txt 四个文件。

 
  1. # 将csv数据转换为DataFrame数据,方便用pandas进行数据预处理

  2. # 注意将print的注释打开,可以查看输出结果

  1. #不要让pandas自己给数据加编号,这样ID就成为index了

  2. train_df = pd.read_csv(".\\input\\train.csv",index_col = 0)

  3. test_df = pd.read_csv('.\\input\\test.csv',index_col = 0)

  4. # print train_df.shape

  5. # print test_df.shape

  6. # print train_df.head() # 默认展示前五行 这里是5行,80列

  7. # print test_df.head() # 这里是5行,79列

Step 3:合并数据 :特征工程的工作!!!

这么做主要是为了用DF进行数据预处理的时候更加方便。等所有的需要的预处理进行完之后,我们再把他们分隔开。实际项目中,不会这样做。首先,SalePrice作为我们的训练目标,只会出现在训练集中,不会在测试集中。所以,我们先把SalePrice这一列给拿出来,不让它碍事儿。

[code]# 看SalePrice的形状和用log1p处理后的形状
  1. %matplotlib inline

  2. prices = pd.DataFrame({'price':train_df['SalePrice'],'log(price+1)':np.log1p(train_df['SalePrice'])})

  3. ps = prices.hist()

  4. # plt.plot()

  5. # plt.show()

  6.  
  7. # log1p即log(1+x),可以让label平滑化,将数据变为正态分布,目的在于使数据的呈现方式接近我们所希望的前提假设,从而进行更好的统计推断

  8. y_train = np.log1p(train_df.pop('SalePrice')) #提出和test数据不一致的price,马上进行train和test的合并

  9. all_df = pd.concat((train_df,test_df),axis = 0) #合并

  10. # print all_df.shape #查看all_df (2919,79)

  11. # print y_train.head() #查看处理后的标记预测值

Step 4:变量转化:特征工程和数据清洗的工作!!!

正确化变量属性:MSSubClass 的值其实应该是一个category(等级的划分),虽然是数字,但是代表多类别,Pandas是不会懂这些。使用DF的时候,这类数字符号会被默认记成数字。这种东西就很有误导性,我们需要把它变回成string

 
  1. print all_df['MSSubClass'].dtypes #dtype('int64')

  2. all_df['MSSubClass'] = all_df['MSSubClass'].astype(str) #转为string,便于查看他的分布情况

  3. print all_df['MSSubClass'].dtypes

  4. print all_df['MSSubClass'].value_counts()

category的变量转变成numerical表达形式:当我们用numerical来表达categorical的时候,要注意,数字本身有大小的含义,所以乱用数字会给之后的模型学习带来麻烦。于是我们可以用One-Hot的方法来表达category。pandas自带的get_dummies方法,可以帮你一键做到One-Hot。

[code]#不能被计算机理解的变量(字符串,离散型变量等)
  1. print pd.get_dummies(all_df['MSSubClass'],prefix = 'MSSubClass'#处理离散型变量的方法get_dummies,即就是one-hot).head()

  2. all_dummy_df = pd.get_dummies(all_df) #pandas自动选择那些事离散型变量,省去了我们做选择

  3. print all_dummy_df.head()

清洗第二步:处理numerical变量

比如,有一些数据是缺失的

 
  1. print all_dummy_df.isnull().sum().sort_values(ascending = False).head(11) #查看缺失情况,按照缺失情况排序

  2. # 注意:处理缺失情况时要看数据描述,确实值得处理方式工具意义和缺失情况有很大不同,有时确实本身就有意义,我们要把他当

#做一个类型,其他时候要将其补上或者删除这个特征

  1. #我们这里用mean填充

  2. mean_cols = all_dummy_df.mean()

  3. print mean_cols.head(10)

  4. all_dummy_df = all_dummy_df.fillna(mean_cols) #fillna填充

  5. print all_dummy_df.isnull().sum().sum() #输出0

标准化numerical数据:

这一步并不是必要,但是得看你想要用的分类器是什么。一般来说,regression的分类器都需要这一步,最好是把源数据给放在一个标准分布内,不要让数据间的差距太大。我们不需要把One-Hot的那些0/1数据给标准化,因为只有0和1,我们的目标应该是那些本来就是numerical的数据型的特征。

 
  1. numeric_cols = all_df.columns[all_df.dtypes != 'object'] #查看那些是numerical数据,本来就是数字化的数据

  2. print numeric_cols

  1. #标准化numerical数据,让数据更加平滑,更加便于计算:如z-score标准化:(x-x’)/s 【x:原数据;x':平均数;s:标准差】

  2. numeric_col_means = all_dummy_df.loc[:,numeric_cols].mean() #均值

  3. numeric_col_std = all_dummy_df.loc[:,numeric_cols].std() #标准差

  4. all_dummy_df.loc[:,numeric_cols] = (all_dummy_df.loc[:,numeric_cols] - numeric_col_means) / numeric_col_std

Step 5-1: 建立模型【房价预测/Ridge/RandomForest/cross_validation

 
  1. # 把数据处理之后,分回训练集和测试集(起初在数据处理时将train和test数据结合了)

  2. dummy_train_df = all_dummy_df.loc[train_df.index]

  3. dummy_test_df = all_dummy_df.loc[test_df.index]

  4. print dummy_train_df.shape,dummy_test_df.shape #输出((1460,303),(1459,303))

  5.  
  6. # 将DF数据转换成Numpy Array的形式,更好地配合sklearn

  7. X_train = dummy_train_df.values

  8. X_test = dummy_test_df.values

Ridge Regression(回归模型的一种:对于多因子的数据集,可以直接把所有的特征都放进去,不用考虑特征提取)

[code]from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score  #交叉验证来测试模型
  1.  
  2.  

#不是很必要,知识吧DataFrame转换成Numpy Array格式数据
X_train = dummy_train_df.values
X_test = dummy_test_df.values

  1.  
  2.  

#用Sklearn自带的cross_calidation来测试模型

  1. alphas = np.logspace(-3,2,50) #创建等比梳理与,如:10^-3至10^2其中的50个数

  2. test_scores = [] #交叉验证的得分,最后找到最好的参数

  3. for alpha in alphas:

  4. clf = Ridge(alpha)

  5. test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  6. test_scores.append(np.mean(test_score))

plt.plot(alphas,test_scores) #可视化参数与分数
plt.title('Alpha vs CV Error')

  1. plt.show()

  2.  
  1.  
  2.  
  • 存下所有的cv值,看看那个alpha值更好【调参数】

大概alpha=10~20的时候,可以把score达到0.135左右。

Random Forest

[code]from sklearn.ensemble import RandomForestRegressorRF
  1. max_features = [.1,.3,.5,.7,.9,.99]

  2. test_scores = []

  3. for max_feat in max_features:

  4. clf = RandomForestRegressor(n_estimators = 200,max_features = max_feat)

  5. test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 5,scoring = 'neg_mean_squared_error'))

  6. test_scores.append(np.mean(test_score))

  7. plt.plot(max_features,test_scores)

  8. plt.title('Max Features vs CV Error')

  9. plt.show()

max_features=0.3时,RF达到了最优0.137

Step 5-2: 建立模型 【进阶版/bagging/boosting/AdaBoost/XGBoost

从模型的角度考虑,用了bagging、boosting(AdaBoost)、XGBoost三个模型(模型框架)。 

把数据集分回 训练/测试集

 

 
  1. dummy_train_df = all_dummy_df.loc[train_df.index]

  2. dummy_test_df = all_dummy_df.loc[test_df.index]

  3. print dummy_train_df.shape,dummy_test_df.shape

  4.  
  5. # 将DF数据转换成Numpy Array的形式,更好地配合sklearn

  6. X_train = dummy_train_df.values

  7. X_test = dummy_test_df.values

1、bagging: 

单个分类器的效果真的是很有限。我们会倾向于把N多的分类器合在一起,做一个“综合分类器”以达到最好的效果。我们从刚刚的试验中得知,Ridge(alpha=15)给了我们最好的结果

 
  1. ridge = Ridge(alpha=15)

  2. # bagging 把很多小的分类器放在一起,每个train随机的一部分数据,然后把它们的最终结果综合起来(多数投票)

  3. # bagging 算是一种算法框架

  4. params = [1, 10, 15, 20, 25, 30, 40] # 多少个弱分类器

  5. test_scores = []

  6. for param in params:

  7. clf = BaggingRegressor(n_estimators=param,base_estimator = ridge) # #base_estimator = ridge是弱分类器0.132(params=25时)

  8. #clf = BaggingRegressor(n_estimators = param)#用Bagging自带的DecisionTree,最好0.140

  9. test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))

  10. test_scores.append(np.mean(test_score))

  11.  
  12. plt.plot(params, test_scores)

  13. plt.title('n_estimators vs CV Error')

  14. plt.show()

  15.  
  16. br = BaggingRegressor(base_estimator=ridge, n_estimators=25)

  17. br.fit(X_train, y_train)

  18. y_final = np.expm1(br.predict(X_test))

2、boosting 

Boosting比Bagging理论上更高级点,它也是揽来一把的分类器。但是把他们线性排列。下一个分类器把上一个分类器分类得不好的地方加上更高的权重,这样下一个分类器就能在这个部分学得更加“深刻”。

 
  1. from sklearn.ensemble import AdaBoostRegressor

  2. ms = [10,15,20,25,30,35,40,45,50]

  3. test_scores = []

  4. for param in params:

  5. clf = AdaBoostRegressor(base_estimator = ridge,n_estimators = param) #ms=25时,0.132,但是不稳定,需要更多的参数或者更多小分类器

  6. test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  7. test_scores.append(np.mean(test_score))

  8. plt.plot(params,test_scores)

  9. plt.title('n_estimators vs CV Error')

  10. plt.show()

3、XGBoost (kaggle神器)

这依旧是一款Boosting框架的模型,但是却做了很多的改进。 
 

 
  1. from xgboost import XGBRegressor

  2. params = [1,2,3,4,5,6]

  3. test_scores = []

  4. for param in params:

  5. clf = XGBRegressor(max_depth = param) #深度params=5时,错误率达到0.127

  6. test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  7. test_scores.append(np.mean(test_score))

  8. plt.plot(params,test_scores)

  9. plt.title('max_depth vs CV Error')

  10. plt.show()

  11.  
  12. xgb = XGBRegressor(max_depth = 5)

  13. xgb.fit(X_train, y_train)

  14. y_final = np.expm1(xgb.predict(X_test))


Step 6: Ensemble 

这里我们用一个Stacking的思维来汲取两种或者多种模型的优点 ;

首先,我们把最好的parameter拿出来,做成我们最终的model;

 
  1. ridge = Ridge(alpha = 15)

  2. rf = RandomForestRegressor(n_estimators = 500,max_features = .3)

  3. ridge.fit(X_train,y_train)

  4. rf.fit(X_train,y_train)

  5.  
  6.  

#最前面个label做了一个log(1+x),这里需要把predit的值给exp回去,并且戒掉那个‘1’

  1. y_ridge = np.expm1(ridge.predict(X_test))

  2. y_rf = np.expm1(rf.predict(X_test))

  3. #把所有的model的预测结果作为新的输入,最简单的就是不下直接【平均化】

  4. y_final = (y_ridge + y_rf) / 2

Step 7: 提交结果 

注意提交的格式!包括大小写、索引、列头等小细节。

 
  1. submission_df = pd.DataFrame(data = {'Id':test_df.index,'SalePrice':y_final})

  2. print submission_df.head(10)

  3. submission_df.to_csv('.\\input\\submission.csv',columns = ['Id','SalePrice'],index =False)

Step5-1版完整练习代码:

 
  1. # coding:utf-8

  2. # 注意Windows系统的\\和Linux系统的/的区别

  3.  
  4. import numpy as np

  5. import pandas as pd

  6. import matplotlib.pyplot as plt

  7. from sklearn.linear_model import Ridge

  8. from sklearn.model_selection import cross_val_score

  9. from sklearn.ensemble import RandomForestRegressor

  10.  
  11. # 文件的组织形式是house price文件夹下面放house_price.py和input文件夹

  12. # input文件夹下面放的是从https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data下载的train.csv test.csv sample_submission.csv 和 data_description.txt 四个文件

  13.  
  14. # step1 检查源数据集,读入数据,将csv数据转换为DataFrame数据

  15. train_df = pd.read_csv(".\\input\\train.csv",index_col = 0)

  16. test_df = pd.read_csv('.\\input\\test.csv',index_col = 0)

  17. # print train_df.shape

  18. # print test_df.shape

  19. # print train_df.head() # 默认展示前五行 这里是5行,80列

  20. # print test_df.head() # 这里是5行,79列

  21.  
  22. # step2 合并数据,进行数据预处理

  23. prices = pd.DataFrame({'price':train_df['SalePrice'],'log(price+1)':np.log1p(train_df['SalePrice'])})

  24. # ps = prices.hist()

  25. # plt.plot()

  26. # plt.show()

  27.  
  28. y_train = np.log1p(train_df.pop('SalePrice'))

  29. all_df = pd.concat((train_df,test_df),axis = 0)

  30. # print all_df.shape

  31. # print y_train.head()

  32.  
  33. # step3 变量转化

  34. print all_df['MSSubClass'].dtypes

  35. all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)

  36. print all_df['MSSubClass'].dtypes

  37. print all_df['MSSubClass'].value_counts()

  38. # 把category的变量转变成numerical表达形式

  39. # get_dummies方法可以帮你一键one-hot

  40. print pd.get_dummies(all_df['MSSubClass'],prefix = 'MSSubClass').head()

  41. all_dummy_df = pd.get_dummies(all_df)

  42. print all_dummy_df.head()

  43.  
  44. # 处理好numerical变量

  45. print all_dummy_df.isnull().sum().sort_values(ascending = False).head(11)

  46. # 我们这里用mean填充

  47. mean_cols = all_dummy_df.mean()

  48. print mean_cols.head(10)

  49. all_dummy_df = all_dummy_df.fillna(mean_cols)

  50. print all_dummy_df.isnull().sum().sum()

  51.  
  52. # 标准化numerical数据

  53. numeric_cols = all_df.columns[all_df.dtypes != 'object']

  54. print numeric_cols

  55. numeric_col_means = all_dummy_df.loc[:,numeric_cols].mean()

  56. numeric_col_std = all_dummy_df.loc[:,numeric_cols].std()

  57. all_dummy_df.loc[:,numeric_cols] = (all_dummy_df.loc[:,numeric_cols] - numeric_col_means) / numeric_col_std

  58.  
  59. # step4 建立模型

  60. # 把数据处理之后,送回训练集和测试集

  61. dummy_train_df = all_dummy_df.loc[train_df.index]

  62. dummy_test_df = all_dummy_df.loc[test_df.index]

  63. print dummy_train_df.shape,dummy_test_df.shape

  64.  
  65. # 将DF数据转换成Numpy Array的形式,更好地配合sklearn

  66.  
  67. X_train = dummy_train_df.values

  68. X_test = dummy_test_df.values

  69.  
  70. # Ridge Regression

  71. # alphas = np.logspace(-3,2,50)

  72. # test_scores = []

  73. # for alpha in alphas:

  74. # clf = Ridge(alpha)

  75. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  76. # test_scores.append(np.mean(test_score))

  77. # plt.plot(alphas,test_scores)

  78. # plt.title('Alpha vs CV Error')

  79. # plt.show()

  80.  
  81. # random forest

  82. # max_features = [.1,.3,.5,.7,.9,.99]

  83. # test_scores = []

  84. # for max_feat in max_features:

  85. # clf = RandomForestRegressor(n_estimators = 200,max_features = max_feat)

  86. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 5,scoring = 'neg_mean_squared_error'))

  87. # test_scores.append(np.mean(test_score))

  88. # plt.plot(max_features,test_scores)

  89. # plt.title('Max Features vs CV Error')

  90. # plt.show()

  91.  
  92. # Step 5: ensemble

  93. # 用stacking的思维来汲取两种或者多种模型的优点

  94.  
  95. ridge = Ridge(alpha = 15)

  96. rf = RandomForestRegressor(n_estimators = 500,max_features = .3)

  97. ridge.fit(X_train,y_train)

  98. rf.fit(X_train,y_train)

  99.  
  100. y_ridge = np.expm1(ridge.predict(X_test))

  101. y_rf = np.expm1(rf.predict(X_test))

  102.  
  103. y_final = (y_ridge + y_rf) / 2

  104.  
  105. # Step 6: 提交结果

  106. submission_df = pd.DataFrame(data = {'Id':test_df.index,'SalePrice':y_final})

  107. print submission_df.head(10)

  108. submission_df.to_csv('.\\input\\submission.csv',columns = ['Id','SalePrice'],index =

Step5-2版完整练习代码:

 
  1. # coding:utf-8

  2. # 注意Windows系统的\\和Linux系统的/的区别

  3.  
  4. import numpy as np

  5. import pandas as pd

  6. import matplotlib.pyplot as plt

  7. from sklearn.linear_model import Ridge

  8. from sklearn.model_selection import cross_val_score

  9. from sklearn.ensemble import RandomForestRegressor

  10. from sklearn.ensemble import BaggingRegressor

  11. from sklearn.ensemble import AdaBoostRegressor

  12. from xgboost import XGBRegressor

  13.  
  14. # 文件的组织形式是house price文件夹下面放house_price.py和input文件夹

  15. # input文件夹下面放的是从https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data下载的train.csv test.csv sample_submission.csv 和 data_description.txt 四个文件

  16.  
  17. # step1 检查源数据集,读入数据,将csv数据转换为DataFrame数据

  18. train_df = pd.read_csv("./input/train.csv",index_col = 0)

  19. test_df = pd.read_csv('./input/test.csv',index_col = 0)

  20. # print train_df.shape

  21. # print test_df.shape

  22. # print train_df.head() # 默认展示前五行 这里是5行,80列

  23. # print test_df.head() # 这里是5行,79列

  24.  
  25. # step2 合并数据,进行数据预处理

  26. prices = pd.DataFrame({'price':train_df['SalePrice'],'log(price+1)':np.log1p(train_df['SalePrice'])})

  27. # ps = prices.hist()

  28. # plt.plot()

  29. # plt.show()

  30.  
  31. y_train = np.log1p(train_df.pop('SalePrice'))

  32. all_df = pd.concat((train_df,test_df),axis = 0)

  33. # print all_df.shape

  34. # print y_train.head()

  35.  
  36. # step3 变量转化

  37. print all_df['MSSubClass'].dtypes

  38. all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)

  39. print all_df['MSSubClass'].dtypes

  40. print all_df['MSSubClass'].value_counts()

  41. # 把category的变量转变成numerical表达形式

  42. # get_dummies方法可以帮你一键one-hot

  43. print pd.get_dummies(all_df['MSSubClass'],prefix = 'MSSubClass').head()

  44. all_dummy_df = pd.get_dummies(all_df)

  45. print all_dummy_df.head()

  46.  
  47. # 处理好numerical变量

  48. print all_dummy_df.isnull().sum().sort_values(ascending = False).head(11)

  49. # 我们这里用mean填充

  50. mean_cols = all_dummy_df.mean()

  51. print mean_cols.head(10)

  52. all_dummy_df = all_dummy_df.fillna(mean_cols)

  53. print all_dummy_df.isnull().sum().sum()

  54.  
  55. # 标准化numerical数据

  56. numeric_cols = all_df.columns[all_df.dtypes != 'object']

  57. print numeric_cols

  58. numeric_col_means = all_dummy_df.loc[:,numeric_cols].mean()

  59. numeric_col_std = all_dummy_df.loc[:,numeric_cols].std()

  60. all_dummy_df.loc[:,numeric_cols] = (all_dummy_df.loc[:,numeric_cols] - numeric_col_means) / numeric_col_std

  61.  
  62. # step4 建立模型

  63. # 把数据处理之后,送回训练集和测试集

  64. dummy_train_df = all_dummy_df.loc[train_df.index]

  65. dummy_test_df = all_dummy_df.loc[test_df.index]

  66. print dummy_train_df.shape,dummy_test_df.shape

  67.  
  68. # 将DF数据转换成Numpy Array的形式,更好地配合sklearn

  69.  
  70. X_train = dummy_train_df.values

  71. X_test = dummy_test_df.values

  72.  
  73. # Ridge Regression

  74. # alphas = np.logspace(-3,2,50)

  75. # test_scores = []

  76. # for alpha in alphas:

  77. # clf = Ridge(alpha)

  78. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  79. # test_scores.append(np.mean(test_score))

  80. # plt.plot(alphas,test_scores)

  81. # plt.title('Alpha vs CV Error')

  82. # plt.show()

  83.  
  84. # random forest

  85. # max_features = [.1,.3,.5,.7,.9,.99]

  86. # test_scores = []

  87. # for max_feat in max_features:

  88. # clf = RandomForestRegressor(n_estimators = 200,max_features = max_feat)

  89. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 5,scoring = 'neg_mean_squared_error'))

  90. # test_scores.append(np.mean(test_score))

  91. # plt.plot(max_features,test_scores)

  92. # plt.title('Max Features vs CV Error')

  93. # plt.show()

  94.  
  95. # ensemble

  96. # 用stacking的思维来汲取两种或者多种模型的优点

  97.  
  98. # ridge = Ridge(alpha = 15)

  99. # rf = RandomForestRegressor(n_estimators = 500,max_features = .3)

  100. # ridge.fit(X_train,y_train)

  101. # rf.fit(X_train,y_train)

  102. # y_ridge = np.expm1(ridge.predict(X_test))

  103. # y_rf = np.expm1(rf.predict(X_test))

  104. # y_final = (y_ridge + y_rf) / 2

  105.  
  106. # 做一点高级的ensemble

  107. ridge = Ridge(alpha = 15)

  108. # bagging 把很多小的分类器放在一起,每个train随机的一部分数据,然后把它们的最终结果综合起来(多数投票)

  109. # bagging 算是一种算法框架

  110. # params = [1,10,15,20,25,30,40]

  111. # test_scores = []

  112. # for param in params:

  113. # clf = BaggingRegressor(base_estimator = ridge,n_estimators = param)

  114. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  115. # test_scores.append(np.mean(test_score))

  116. # plt.plot(params,test_scores)

  117. # plt.title('n_estimators vs CV Error')

  118. # plt.show()

  119.  
  120. # br = BaggingRegressor(base_estimator = ridge,n_estimators = 25)

  121. # br.fit(X_train,y_train)

  122. # y_final = np.expm1(br.predict(X_test))

  123.  
  124. # boosting 比bagging更高级,它是弄来一把分类器,把它们线性排列,下一个分类器把上一个分类器分类不好的地方加上更高的权重,这样,下一个分类器在这部分就能学习得更深刻

  125. # params = [10,15,20,25,30,35,40,45,50]

  126. # test_scores = []

  127. # for param in params:

  128. # clf = AdaBoostRegressor(base_estimator = ridge,n_estimators = param)

  129. # test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  130. # test_scores.append(np.mean(test_score))

  131. # plt.plot(params,test_scores)

  132. # plt.title('n_estimators vs CV Error')

  133. # plt.show()

  134.  
  135. # xgboost

  136. params = [1,2,3,4,5,6]

  137. test_scores = []

  138. for param in params:

  139. clf = XGBRegressor(max_depth = param)

  140. test_score = np.sqrt(-cross_val_score(clf,X_train,y_train,cv = 10,scoring = 'neg_mean_squared_error'))

  141. test_scores.append(np.mean(test_score))

  142. plt.plot(params,test_scores)

  143. plt.title('max_depth vs CV Error')

  144. plt.show()

  145.  
  146. xgb = XGBRegressor(max_depth = 5)

  147. xgb.fit(X_train, y_train)

  148. y_final = np.expm1(xgb.predict(X_test))

  149.  
  150. # 提交结果

  151. submission_df = pd.DataFrame(data = {'Id':test_df.index,'SalePrice':y_final})

  152. print submission_df.head(10)

  153. submission_df.to_csv('./input/submission_xgboosting.csv',columns = ['Id','SalePrice'

总结:

并不是所有的数据源都是整齐划一的x=[var1,var2,var3...]

方法:

【非标准-现实生活数据】-->降维、取特征、数字化表达(特征工程)-->【高维数据】

【文本数据】-->单词出现次数、单词出现频率、语义网络等(特征工程)-->【数据】

【图片数据】-->RGB点阵-->【数组】

【视频数据】-->分为【音轨】and【视频轨】-->【音轨:声波/语音识别】【视频轨:一维图片/图片识别】

 

可参考:

https://www.cnblogs.com/irenelin/p/7400388.html

http://blog.csdn.net/qilixuening/article/details/75153131

http://blog.csdn.net/qilixuening/article/details/75151026

http://blog.csdn.net/chris_lee_hehe/article/details/78700140

阅读更多
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: