您的位置:首页 > 编程语言 > Python开发

用 Python 进行贝叶斯模型建模 (0)

2017-06-25 21:18 246 查看
第0节:导论

欢迎来到“用Python进行贝叶斯模型建模”——用 Python 学习贝叶斯统计的教程,你可以在 项目首页(https://github.com/markdregan/Hangout-with-PyMC3) 中找到本教程的目录。

统计学是我大学时期从未认同过的一门学科。我们被传授的频率论方法(如 p 值等)看起来很牵强。最终,我对统计学失去兴趣并放弃了它。

直到我偶然发现贝叶斯统计——统计学的一个分支,与大多数大学教授的传统频率论统计有很大不同。许多出版物、博客和视频给了我指引,我强烈推荐这些作为贝叶斯统计的入门教材。它们包括:

Doing Bayesian Data Analysis by John Kruschke

Python port of John Kruschke’s examples by Osvaldo Martin

John Kruschke 的例子中的 Python 接口

黑客的贝叶斯方法 在我学习贝叶斯统计中给了我很多启示。为了显示这种影响,我采用了和 BMH 相同的视觉风格。

While My MCMC Gently Samples blog by Thomas Wiecki

Healthy Algorithms blog by Abraham Flaxman

Scipy Tutorial 2014 by Chris Fonnesbeck

希望这篇教程对别人有用,能够帮助他们学习贝叶斯方法,就像上面那些资源帮助过我一样。欢迎来自社区的联系/评论/贡献。

载入你的 Google Hangout 聊天数据

在这篇教程中,我们将使用我的 Google Hangout 聊天数据作为数据集。我已经删除了消息内容并且对我朋友的名字采用了化名,数据集的剩下部分都是未改变的。

如果你想使用自己的Hangout聊天数据同时按照这篇教程学习,你可以从 Google Takeout 下载自己的 Google Hangout 数据。Hangout 数据的 JSON 格式是可以下载的。下载完成后,替换掉数据文件夹中的 hangouts.json 文件。

json 文件的结构深度嵌套,包含大量冗余信息。

import json

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import seaborn.apionly as sns

from datetime import datetime

%matplotlib inline

plt.style.use('bmh')

colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00',

'#CC79A7', '#56B4E9', '#009E73', '#F0E442', '#0072B2']

下面的代码用于载入 json 数据,把每一条信息解析成 pandas DataFrame 的一行。

注意:data/ 路径下缺少 hangouts.json 文件,你必须按照上述方法下载并添加自己的 JSON 文件。或者,你可以跳过进入下一节,在下一节,我会载入匿名处理后的数据集。

# Import json data

with open('data/hangouts.json') as json_file:

json_data = json.load(json_file)

# Generate map from gaia_id to real name

def user_name_mapping(data):

user_map = {'gaia_id': ''}

for state in data['conversation_state']:

participants = state['conversation_state']['conversation']['participant_data']

for participant in participants:

if 'fallback_name' in participant:

user_map[participant['id']['gaia_id']] = participant['fallback_name']

return user_map

user_dict = user_name_mapping(json_data)

# Parse data into flat list

def fetch_messages(data):

messages = []

for state in data['conversation_state']:

conversation_state = state['conversation_state']

conversation = conversation_state['conversation']

conversation_id = conversation_state['conversation']['id']['id']

participants = conversation['participant_data']

all_participants = []

for participant in participants:

if 'fallback_name' in participant:

user = participant['fallback_name']

else:

#
Scope to call G+ API to get name

user = participant['id']['gaia_id']

all_participants.append(user)

num_participants = len(all_participants)

for event in conversation_state['event']:

try:

sender = user_dict[event['sender_id']['gaia_id']]

except:

sender = event['sender_id']['gaia_id']

timestamp = datetime.fromtimestamp(float(float(event['timestamp'])/10**6.))

event_id = event['event_id']

if 'chat_message' in event:

content = event['chat_message']['message_content']

if 'segment' in content:

segments = content['segment']

for segment in segments:

if 'text' in segment:

message = segment['text']

message_length = len(message)

message_type = segment['type']

if len(message) > 0:

messages.append((conversation_id,

event_id,

timestamp,

sender,

message,

message_length,

all_participants,

',
'.join(all_participants),

num_participants,

message_type))

messages.sort(key=lambda x: x[0])

return messages

# Parse data into data frame

cols = ['conversation_id', 'event_id', 'timestamp', 'sender',

'message', 'message_length', 'participants', 'participants_str',

'num_participants', 'message_type']

messages = pd.DataFrame(fetch_messages(json_data), columns=cols).sort(['conversation_id', 'timestamp'])

# Engineer features

messages['prev_timestamp'] = messages.groupby(['conversation_id'])['timestamp'].shift(1)

messages['prev_sender'] = messages.groupby(['conversation_id'])['sender'].shift(1)

# Exclude messages are are replies to oneself (not first reply)

messages = messages[messages['sender'] != messages['prev_sender']]

# Time delay

messages['time_delay_seconds'] = (messages['timestamp'] - messages['prev_timestamp']).astype('timedelta64[s]')

messages = messages[messages['time_delay_seconds'].notnull()]

messages['time_delay_mins'] = np.ceil(messages['time_delay_seconds'].astype(int)/60.0)

# Time attributes

messages['day_of_week'] = messages['timestamp'].apply(lambda x: x.dayofweek)

messages['year_month'] = messages['timestamp'].apply(lambda x: x.strftime("%Y-%m"))

messages['is_weekend'] = messages['day_of_week'].isin([5,6]).apply(lambda x: 1 if x == True else 0)

# Limit to messages sent by me and exclude all messages between me and Alison

messages = messages[(messages['sender'] == 'Mark
Regan') & (messages['participants_str'] != 'Alison
Darcy, Mark Regan')]

# Remove messages not responded within 60 seconds

# This introduces an issue by right censoring the data (might return to address)

messages = messages[messages['time_delay_seconds'] < 60]

messages.head(1)

此处原本有一个宽度超大、列数很多的表格,这里放不下。请到英文原文查看。

现在,我们有了数据模型,使用更加方便。上面的表格显示了 pandas DataFrame 的一行。我对我多久才回复消息感兴趣,我们通过一些图来描述我的一些有代表性的回复时间。

fig = plt.figure(figsize=(10,7))

ax = fig.add_subplot(211)

order = np.sort(messages['year_month'].unique())

sns.boxplot(x=messages['year_month'], y=messages['time_delay_seconds'], order=order, orient="v", color=colors[5],linewidth=1, ax=ax)

_ = ax.set_title('Response
time distribution by month')

_ = ax.set_xlabel('Month-Year')

_ = ax.set_ylabel('Response
time')

_ = plt.xticks(rotation=30)

ax = fig.add_subplot(212)

plt.hist(messages['time_delay_seconds'].values, range=[0, 60], bins=60, histtype='stepfilled', color=colors[0])

_ = ax.set_title('Response
time distribution')

_ = ax.set_xlabel('Response
time (seconds)')

_ = ax.set_ylabel('Number
of messages')

plt.tight_layout()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐