xgboost特征工程分析数据
2018-03-05 21:58
645 查看
In this exploration notebook, we shall try to uncover the basic information about the dataset which will help us build our models / features.Let us first import the necessary modules.In [1]:
Wow. This dataset looks interesting. It has numerical features, categorical features, date feature, text features and image features.Let us load the test data as well and check the number of rows in train and test to start with.In [3]:
Interest level is low for most of the cases followed by medium and then high which makes sense.Now let us start looking into the numerical features present in the dataset. Numerical features arebathrooms
bedrooms
price
latitude
longitude
The last two are actually not numerical variables, but for now let us just consider it to be numerical.Bathrooms:Let us first start with bathrooms.In [5]:
In [6]:
Looks like evenly distributed across the interest levels. Now let us look at the next feature 'bedrooms'.Bedrooms:In [7]:
In [8]:
Price:Now let us look at the price variable distribution.In [9]:
Looks like there are some outliers in this feature. So let us remove them and then plot again.In [10]:
The distribution is right skewed as we can see.Now let us look at the latitude and longitude variables.Latitude & Longitude:In [11]:
So the latitude values are primarily between 40.6 and 40.9. Now let us look at the longitude values.In [12]:
The longitude values range between -73.8 and -74.02. So the data corresponds to the New York City.Now let us plot the same in a map. Thanks to this kernel by Dotman.In [13]:
Created:Now let us look at the date column 'created'In [14]:
So we have data from April to June 2016 in our train set. Now let us look at the test set as well and see if they are also from the same date range.In [15]:
Looks very similar to the train set dates and so we are good to go.!We shall also look at the hour-wise listing trend (Just for fun)In [16]:
Looks like listings are created during the early hours of the day (1 to 7am). May be that is when the traffic is less and so the updates are happening.Now let us look at some of the categorical variables.Display Address:In [17]:
Most of the display addresses occur less than 100 times in the given dataset. None of the display address occur more than 500 times.Number of Photos:This competition also has a huge database of photos of the listings. To start with, let us look at the number of photos given for listings.In [18]:
In [19]:
Let us now look at the number of features variable and see its distribution.Number of features:In [20]:
In [21]:
Word Clouds:Next we shall look into some for the text features.In [22]:
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns color = sns.color_palette() %matplotlib inline pd.options.mode.chained_assignment = None # default='warn'Loading the training dataset and looking at the top few rows.In [2]:
train_df = pd.read_json("../input/train.json") train_df.head()Out[2]:
bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | listing_id | longitude | manager_id | photos | price | street_address | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | 7211212 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | [https://photos.renthop.com/2/7211212_1ed4542e... | 3000 | 792 Metropolitan Avenue |
10000 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 | Columbus Avenue | [Doorman, Elevator, Fitness Center, Cats Allow... | low | 40.7947 | 7150865 | -73.9667 | 7533621a882f71e25173b27e3139d83d | [https://photos.renthop.com/2/7150865_be3306c5... | 5465 | 808 Columbus Avenue | |
100004 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | [Laundry In Building, Dishwasher, Hardwood Flo... | high | 40.7388 | 6887163 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | [https://photos.renthop.com/2/6887163_de85c427... | 2850 | 241 W 13 Street |
100007 | 1.0 | 1 | 28d9ad350afeaab8027513a3e52ac8d5 | 2016-04-18 02:22:02 | Building Amenities - Garage - Garden - fitness... | East 49th Street | [Hardwood Floors, No Fee] | low | 40.7539 | 6888711 | -73.9677 | 1067e078446a7897d2da493d2f741316 | [https://photos.renthop.com/2/6888711_6e660cee... | 3275 | 333 East 49th Street |
100013 | 1.0 | 4 | 0 | 2016-04-28 01:32:41 | Beautifully renovated 3 bedroom flex 4 bedroom... | West 143rd Street | [Pre-War] | low | 40.8241 | 6934781 | -73.9493 | 98e13ad4b495b9613cef886d79a6291f | [https://photos.renthop.com/2/6934781_1fa4b41a... | 3350 | 500 West 143rd Street |
test_df = pd.read_json("../input/test.json") print("Train Rows : ", train_df.shape[0]) print("Test Rows : ", test_df.shape[0])
Train Rows : 49352 Test Rows : 74659Target VariableBefore delving more into the features, let us first have a look at the target variable 'interest level'In [4]:
int_level = train_df['interest_level'].value_counts() plt.figure(figsize=(8,4)) sns.barplot(int_level.index, int_level.values, alpha=0.8, color=color[1]) plt.ylabel('Number of Occurrences', fontsize=12) plt.xlabel('Interest level', fontsize=12) plt.show()
Interest level is low for most of the cases followed by medium and then high which makes sense.Now let us start looking into the numerical features present in the dataset. Numerical features arebathrooms
bedrooms
price
latitude
longitude
The last two are actually not numerical variables, but for now let us just consider it to be numerical.Bathrooms:Let us first start with bathrooms.In [5]:
cnt_srs = train_df['bathrooms'].value_counts() plt.figure(figsize=(8,4)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[0]) plt.ylabel('Number of Occurrences', fontsize=12) plt.xlabel('bathrooms', fontsize=12) plt.show()
In [6]:
train_df['bathrooms'].ix[train_df['bathrooms']>3] = 3 plt.figure(figsize=(8,4)) sns.violinplot(x='interest_level', y='bathrooms', data=train_df) plt.xlabel('Interest level', fontsize=12) plt.ylabel('bathrooms', fontsize=12) plt.show()
Looks like evenly distributed across the interest levels. Now let us look at the next feature 'bedrooms'.Bedrooms:In [7]:
cnt_srs = train_df['bedrooms'].value_counts() plt.figure(figsize=(8,4)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[2]) plt.ylabel('Number of Occurrences', fontsize=12) plt.xlabel('bedrooms', fontsize=12) plt.show()
In [8]:
plt.figure(figsize=(8,6)) sns.countplot(x='bedrooms', hue='interest_level', data=train_df) plt.ylabel('Number of Occurrences', fontsize=12) plt.xlabel('bedrooms', fontsize=12) plt.show()
Price:Now let us look at the price variable distribution.In [9]:
plt.figure(figsize=(8,6)) plt.scatter(range(train_df.shape[0]), np.sort(train_df.price.values)) plt.xlabel('index', fontsize=12) plt.ylabel('price', fontsize=12) plt.show()
Looks like there are some outliers in this feature. So let us remove them and then plot again.In [10]:
ulimit = np.percentile(train_df.price.values, 99) train_df['price'].ix[train_df['price']>ulimit] = ulimit plt.figure(figsize=(8,6)) sns.distplot(train_df.price.values, bins=50, kde=True) plt.xlabel('price', fontsize=12) plt.show()
The distribution is right skewed as we can see.Now let us look at the latitude and longitude variables.Latitude & Longitude:In [11]:
llimit = np.percentile(train_df.latitude.values, 1) ulimit = np.percentile(train_df.latitude.values, 99) train_df['latitude'].ix[train_df['latitude']<llimit] = llimit train_df['latitude'].ix[train_df['latitude']>ulimit] = ulimit plt.figure(figsize=(8,6)) sns.distplot(train_df.latitude.values, bins=50, kde=False) plt.xlabel('latitude', fontsize=12) plt.show()
So the latitude values are primarily between 40.6 and 40.9. Now let us look at the longitude values.In [12]:
llimit = np.percentile(train_df.longitude.values, 1) ulimit = np.percentile(train_df.longitude.values, 99) train_df['longitude'].ix[train_df['longitude']<llimit] = llimit train_df['longitude'].ix[train_df['longitude']>ulimit] = ulimit plt.figure(figsize=(8,6)) sns.distplot(train_df.longitude.values, bins=50, kde=False) plt.xlabel('longitude', fontsize=12) plt.show()
The longitude values range between -73.8 and -74.02. So the data corresponds to the New York City.Now let us plot the same in a map. Thanks to this kernel by Dotman.In [13]:
from mpl_toolkits.basemap import Basemap from matplotlib import cm west, south, east, north = -74.02, 40.64, -73.85, 40.86 fig = plt.figure(figsize=(14,10)) ax = fig.add_subplot(111) m = Basemap(projection='merc', llcrnrlat=south, urcrnrlat=north, llcrnrlon=west, urcrnrlon=east, lat_ts=south, resolution='i') x, y = m(train_df['longitude'].values, train_df['latitude'].values) m.hexbin(x, y, gridsize=200, bins='log', cmap=cm.YlOrRd_r);
Created:Now let us look at the date column 'created'In [14]:
train_df["created"] = pd.to_datetime(train_df["created"]) train_df["date_created"] = train_df["created"].dt.date cnt_srs = train_df['date_created'].value_counts() plt.figure(figsize=(12,4)) ax = plt.subplot(111) ax.bar(cnt_srs.index, cnt_srs.values, alpha=0.8) ax.xaxis_date() plt.xticks(rotation='vertical') plt.show()
So we have data from April to June 2016 in our train set. Now let us look at the test set as well and see if they are also from the same date range.In [15]:
test_df["created"] = pd.to_datetime(test_df["created"]) test_df["date_created"] = test_df["created"].dt.date cnt_srs = test_df['date_created'].value_counts() plt.figure(figsize=(12,4)) ax = plt.subplot(111) ax.bar(cnt_srs.index, cnt_srs.values, alpha=0.8) ax.xaxis_date() plt.xticks(rotation='vertical') plt.show()
Looks very similar to the train set dates and so we are good to go.!We shall also look at the hour-wise listing trend (Just for fun)In [16]:
train_df["hour_created"] = train_df["created"].dt.hour cnt_srs = train_df['hour_created'].value_counts() plt.figure(figsize=(12,6)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[3]) plt.xticks(rotation='vertical') plt.show()
Looks like listings are created during the early hours of the day (1 to 7am). May be that is when the traffic is less and so the updates are happening.Now let us look at some of the categorical variables.Display Address:In [17]:
cnt_srs = train_df.groupby('display_address')['display_address'].count() for i in [2, 10, 50, 100, 500]: print('Display_address that appear less than {} times: {}%'.format(i, round((cnt_srs < i).mean() * 100, 2))) plt.figure(figsize=(12, 6)) plt.hist(cnt_srs.values, bins=100, log=True, alpha=0.9) plt.xlabel('Number of times display_address appeared', fontsize=12) plt.ylabel('log(Count)', fontsize=12) plt.show()
Display_address that appear less than 2 times: 63.22% Display_address that appear less than 10 times: 89.6% Display_address that appear less than 50 times: 97.73% Display_address that appear less than 100 times: 99.26% Display_address that appear less than 500 times: 100.0%
Most of the display addresses occur less than 100 times in the given dataset. None of the display address occur more than 500 times.Number of Photos:This competition also has a huge database of photos of the listings. To start with, let us look at the number of photos given for listings.In [18]:
train_df["num_photos"] = train_df["photos"].apply(len) cnt_srs = train_df['num_photos'].value_counts() plt.figure(figsize=(12,6)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8) plt.xlabel('Number of Photos', fontsize=12) plt.ylabel('Number of Occurrences', fontsize=12) plt.show()
In [19]:
train_df['num_photos'].ix[train_df['num_photos']>12] = 12 plt.figure(figsize=(12,6)) sns.violinplot(x="num_photos", y="interest_level", data=train_df, order =['low','medium','high']) plt.xlabel('Number of Photos', fontsize=12) plt.ylabel('Interest Level', fontsize=12) plt.show()
Let us now look at the number of features variable and see its distribution.Number of features:In [20]:
train_df["num_features"] = train_df["features"].apply(len) cnt_srs = train_df['num_features'].value_counts() plt.figure(figsize=(12,6)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8) plt.ylabel('Number of Occurrences', fontsize=12) plt.xlabel('Number of features', fontsize=12) plt.show()
In [21]:
train_df['num_features'].ix[train_df['num_features']>17] = 17 plt.figure(figsize=(12,10)) sns.violinplot(y="num_features", x="interest_level", data=train_df, order =['low','medium','high']) plt.xlabel('Interest Level', fontsize=12) plt.ylabel('Number of features', fontsize=12) plt.show()
Word Clouds:Next we shall look into some for the text features.In [22]:
from wordcloud import WordCloud text = '' text_da = '' text_desc = '' for ind, row in train_df.iterrows(): for feature in row['features']: text = " ".join([text, "_".join(feature.strip().split(" "))]) text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))]) #text_desc = " ".join([text_desc, row['description']]) text = text.strip() text_da = text_da.strip() text_desc = text_desc.strip() plt.figure(figsize=(12,6)) wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text) wordcloud.recolor(random_state=0) plt.imshow(wordcloud) plt.title("Wordcloud for features", fontsize=30) plt.axis("off") plt.show() # wordcloud for display address plt.figure(figsize=(12,6)) wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text_da) wordcloud.recolor(random_state=0) plt.imshow(wordcloud) plt.title("Wordcloud for Display Address", fontsize=30) plt.axis("off") plt.show()
相关文章推荐
- 机器学习 数据特征分析 特征工程
- 机器学习 数据特征分析 特征工程
- 机器学习 数据特征分析 特征工程
- Python数据挖掘与机器学习_通信信用风险评估实战(3)——特征工程
- 3.2 -3.4 数据特征分析
- 读书笔记_金融数据分析 | 金融数据及其特征
- 平均数编码:针对高基数定性特征(类别特征)的数据预处理/特征工程
- 特征工程与数据预处理
- 实习笔记——如何把实验室科研数据分析项目变成能部署到生产环境中的工程
- [转载]kaldi学习笔记:run.sh(egs/timit/s5)详细分析:从数据准备到特征提取
- 基于storm的实时GPS数据客流特征分析系统 源码分析之(一):GPSReceiverSpout
- 数据挖掘sklearn中的的特征工程处理
- 从数据预处理到特征工程
- 【数据平台】sklearn库特征工程之特征选择和降维
- 数据预处理 | 机器学习之特征工程
- 机器学习——特征工程之数据预处理
- 基于storm的实时GPS数据客流特征分析系统 源码分析之(二):DistrictMatchingBolt
- 简单的特征工程:数据处理
- 【特征工程】2 机器学习中的数据清洗与特征处理综述
- 雷击风险评估之--地闪数据时空特征分析 V3.0中国版 使用说明