您的位置:首页 > 其它

xgboost特征工程分析数据

2018-03-05 21:58 645 查看
In this exploration notebook, we shall try to uncover the basic information about the dataset which will help us build our models / features.Let us first import the necessary modules.In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
Loading the training dataset and looking at the top few rows.In [2]:
train_df = pd.read_json("../input/train.json")
train_df.head()
Out[2]:
 bathroomsbedroomsbuilding_idcreateddescriptiondisplay_addressfeaturesinterest_levellatitudelisting_idlongitudemanager_idphotospricestreet_address
101.5353a5b119ba8f7b61d4e010512e0dfc852016-06-24 07:54:24A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...Metropolitan Avenue[]medium40.71457211212-73.94255ba989232d0489da1b5f2c45f6688adc[https://photos.renthop.com/2/7211212_1ed4542e...3000792 Metropolitan Avenue
100001.02c5c8a357cba207596b04d1afd1e4f1302016-06-12 12:19:27 Columbus Avenue[Doorman, Elevator, Fitness Center, Cats Allow...low40.79477150865-73.96677533621a882f71e25173b27e3139d83d[https://photos.renthop.com/2/7150865_be3306c5...5465808 Columbus Avenue
1000041.01c3ba40552e2120b0acfc3cb5730bb2aa2016-04-17 03:26:41Top Top West Village location, beautiful Pre-w...W 13 Street[Laundry In Building, Dishwasher, Hardwood Flo...high40.73886887163-74.0018d9039c43983f6e564b1482b273bd7b01[https://photos.renthop.com/2/6887163_de85c427...2850241 W 13 Street
1000071.0128d9ad350afeaab8027513a3e52ac8d52016-04-18 02:22:02Building Amenities - Garage - Garden - fitness...East 49th Street[Hardwood Floors, No Fee]low40.75396888711-73.96771067e078446a7897d2da493d2f741316[https://photos.renthop.com/2/6888711_6e660cee...3275333 East 49th Street
1000131.0402016-04-28 01:32:41Beautifully renovated 3 bedroom flex 4 bedroom...West 143rd Street[Pre-War]low40.82416934781-73.949398e13ad4b495b9613cef886d79a6291f[https://photos.renthop.com/2/6934781_1fa4b41a...3350500 West 143rd Street
Wow. This dataset looks interesting. It has numerical features, categorical features, date feature, text features and image features.Let us load the test data as well and check the number of rows in train and test to start with.In [3]:
test_df = pd.read_json("../input/test.json")
print("Train Rows : ", train_df.shape[0])
print("Test Rows : ", test_df.shape[0])
Train Rows :  49352
Test Rows :  74659
Target VariableBefore delving more into the features, let us first have a look at the target variable 'interest level'In [4]:
int_level = train_df['interest_level'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(int_level.index, int_level.values, alpha=0.8, color=color[1])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Interest level', fontsize=12)
plt.show()


Interest level is low for most of the cases followed by medium and then high which makes sense.Now let us start looking into the numerical features present in the dataset. Numerical features arebathrooms
bedrooms
price
latitude
longitude
The last two are actually not numerical variables, but for now let us just consider it to be numerical.Bathrooms:Let us first start with bathrooms.In [5]:
cnt_srs = train_df['bathrooms'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[0])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('bathrooms', fontsize=12)
plt.show()


In [6]:
train_df['bathrooms'].ix[train_df['bathrooms']>3] = 3
plt.figure(figsize=(8,4))
sns.violinplot(x='interest_level', y='bathrooms', data=train_df)
plt.xlabel('Interest level', fontsize=12)
plt.ylabel('bathrooms', fontsize=12)
plt.show()


Looks like evenly distributed across the interest levels. Now let us look at the next feature 'bedrooms'.Bedrooms:In [7]:
cnt_srs = train_df['bedrooms'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[2])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('bedrooms', fontsize=12)
plt.show()


In [8]:
plt.figure(figsize=(8,6))
sns.countplot(x='bedrooms', hue='interest_level', data=train_df)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('bedrooms', fontsize=12)
plt.show()


Price:Now let us look at the price variable distribution.In [9]:
plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.price.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()


Looks like there are some outliers in this feature. So let us remove them and then plot again.In [10]:
ulimit = np.percentile(train_df.price.values, 99)
train_df['price'].ix[train_df['price']>ulimit] = ulimit

plt.figure(figsize=(8,6))
sns.distplot(train_df.price.values, bins=50, kde=True)
plt.xlabel('price', fontsize=12)
plt.show()


The distribution is right skewed as we can see.Now let us look at the latitude and longitude variables.Latitude & Longitude:In [11]:
llimit = np.percentile(train_df.latitude.values, 1)
ulimit = np.percentile(train_df.latitude.values, 99)
train_df['latitude'].ix[train_df['latitude']<llimit] = llimit
train_df['latitude'].ix[train_df['latitude']>ulimit] = ulimit

plt.figure(figsize=(8,6))
sns.distplot(train_df.latitude.values, bins=50, kde=False)
plt.xlabel('latitude', fontsize=12)
plt.show()


So the latitude values are primarily between 40.6 and 40.9. Now let us look at the longitude values.In [12]:
llimit = np.percentile(train_df.longitude.values, 1)
ulimit = np.percentile(train_df.longitude.values, 99)
train_df['longitude'].ix[train_df['longitude']<llimit] = llimit
train_df['longitude'].ix[train_df['longitude']>ulimit] = ulimit

plt.figure(figsize=(8,6))
sns.distplot(train_df.longitude.values, bins=50, kde=False)
plt.xlabel('longitude', fontsize=12)
plt.show()


The longitude values range between -73.8 and -74.02. So the data corresponds to the New York City.Now let us plot the same in a map. Thanks to this kernel by Dotman.In [13]:
from mpl_toolkits.basemap import Basemap
from matplotlib import cm

west, south, east, north = -74.02, 40.64, -73.85, 40.86

fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111)
m = Basemap(projection='merc', llcrnrlat=south, urcrnrlat=north,
llcrnrlon=west, urcrnrlon=east, lat_ts=south, resolution='i')
x, y = m(train_df['longitude'].values, train_df['latitude'].values)
m.hexbin(x, y, gridsize=200,
bins='log', cmap=cm.YlOrRd_r);


Created:Now let us look at the date column 'created'In [14]:
train_df["created"] = pd.to_datetime(train_df["created"])
train_df["date_created"] = train_df["created"].dt.date
cnt_srs = train_df['date_created'].value_counts()

plt.figure(figsize=(12,4))
ax = plt.subplot(111)
ax.bar(cnt_srs.index, cnt_srs.values, alpha=0.8)
ax.xaxis_date()
plt.xticks(rotation='vertical')
plt.show()


So we have data from April to June 2016 in our train set. Now let us look at the test set as well and see if they are also from the same date range.In [15]:
test_df["created"] = pd.to_datetime(test_df["created"])
test_df["date_created"] = test_df["created"].dt.date
cnt_srs = test_df['date_created'].value_counts()

plt.figure(figsize=(12,4))
ax = plt.subplot(111)
ax.bar(cnt_srs.index, cnt_srs.values, alpha=0.8)
ax.xaxis_date()
plt.xticks(rotation='vertical')
plt.show()


Looks very similar to the train set dates and so we are good to go.!We shall also look at the hour-wise listing trend (Just for fun)In [16]:
train_df["hour_created"] = train_df["created"].dt.hour
cnt_srs = train_df['hour_created'].value_counts()

plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.show()


Looks like listings are created during the early hours of the day (1 to 7am). May be that is when the traffic is less and so the updates are happening.Now let us look at some of the categorical variables.Display Address:In [17]:
cnt_srs = train_df.groupby('display_address')['display_address'].count()

for i in [2, 10, 50, 100, 500]:
print('Display_address that appear less than {} times: {}%'.format(i, round((cnt_srs < i).mean() * 100, 2)))

plt.figure(figsize=(12, 6))
plt.hist(cnt_srs.values, bins=100, log=True, alpha=0.9)
plt.xlabel('Number of times display_address appeared', fontsize=12)
plt.ylabel('log(Count)', fontsize=12)
plt.show()
Display_address that appear less than 2 times: 63.22%
Display_address that appear less than 10 times: 89.6%
Display_address that appear less than 50 times: 97.73%
Display_address that appear less than 100 times: 99.26%
Display_address that appear less than 500 times: 100.0%


Most of the display addresses occur less than 100 times in the given dataset. None of the display address occur more than 500 times.Number of Photos:This competition also has a huge database of photos of the listings. To start with, let us look at the number of photos given for listings.In [18]:
train_df["num_photos"] = train_df["photos"].apply(len)
cnt_srs = train_df['num_photos'].value_counts()

plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)
plt.xlabel('Number of Photos', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()


In [19]:
train_df['num_photos'].ix[train_df['num_photos']>12] = 12
plt.figure(figsize=(12,6))
sns.violinplot(x="num_photos", y="interest_level", data=train_df, order =['low','medium','high'])
plt.xlabel('Number of Photos', fontsize=12)
plt.ylabel('Interest Level', fontsize=12)
plt.show()


Let us now look at the number of features variable and see its distribution.Number of features:In [20]:
train_df["num_features"] = train_df["features"].apply(len)
cnt_srs = train_df['num_features'].value_counts()

plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Number of features', fontsize=12)
plt.show()


In [21]:
train_df['num_features'].ix[train_df['num_features']>17] = 17
plt.figure(figsize=(12,10))
sns.violinplot(y="num_features", x="interest_level", data=train_df, order =['low','medium','high'])
plt.xlabel('Interest Level', fontsize=12)
plt.ylabel('Number of features', fontsize=12)
plt.show()


Word Clouds:Next we shall look into some for the text features.In [22]:
from wordcloud import WordCloud

text = ''
text_da = ''
text_desc = ''
for ind, row in train_df.iterrows():
for feature in row['features']:
text = " ".join([text, "_".join(feature.strip().split(" "))])
text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))])
#text_desc = " ".join([text_desc, row['description']])
text = text.strip()
text_da = text_da.strip()
text_desc = text_desc.strip()

plt.figure(figsize=(12,6))
wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for features", fontsize=30)
plt.axis("off")
plt.show()

# wordcloud for display address
plt.figure(figsize=(12,6))
wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text_da)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Display Address", fontsize=30)
plt.axis("off")
plt.show()



内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: