您的位置：首页 > 其它

Datawhale数据重构

2020-08-24 20:58 28 查看

数据分析-数据重构

concat
join
merge
append
DataFrame-->Series
Groupby
agg函数

concat

concat :join another DataFrame
重要参数：
axis = 1 横向合并
axis = 0(default) 纵向合并
ignore_index: default False: 此时的index只是两个DataFrame的组合，并没有重新排序
True: index重新排序

# axis =1 各表在横向拼接
result_up = pd.concat([text_left_up, text_right_up], axis = 1)
result_down = pd.concat([text_left_down, text_right_down], axis = 1)

# 默认axis=0， 纵向拼接
result = pd.concat([result_up, result_down],ignore_index = 'True')

join

用于横向拼接
lsuffix=’_caller’: 若出现重复的列索引,在原有数组的列索引上加上后缀’_caller’
rsuffix=’_other’ 若出现重复的列索引，在加上的数组列索引上添加后缀’_other’

#join将都多个pandas对象横向拼接
text_up = text_left_up.join(text_right_up)
text_down = text_left_down.join(text_right_down)

merge

用于横向拼接
必须要设置合并索引：left_index;right_index默认值为False, 都设置为正时意味着两个列表以各自index为基准进行合并

result_up = pd.merge(text_left_up, text_right_up,left_index=True, right_index = True)
result_down = pd.merge(text_left_down,text_right_down,left_index=True, right_index = True)

append

用于纵向合并，同样需要设置ignore_index=True来重置index

result = result_up.append(result_down,ignore_index=True)

DataFrame–>Series

可以利用stack函数将DataFrame中的各数据特征变为series类型

unit_result=text.stack().head(20)
# stack:将DataFrame中的每条数据各个特征拆开叠加在一起

Groupby

目的：将数据集通过某种方式分组 group the data by sth. 之后可以分组进行运算

#先将text数据以Sex列中元素的的不同取值分组：male, female
#之后再将两组中‘Fare’这一列的元素提出
#最后对每一组数据中的Fare值取平均
df  = text.groupby(['Sex'])
Sex_fare = df['Fare']
Sex_fare.mean()

#还可以写成另外两种简单的形式
df  = text.groupby(['Sex'])['Fare'].mean()
df  = text['Fare'].groupby(['Sex']).mean()

除了对分组之后的数据进行取均值外还可以进行其他操作

#用sum函数统计泰坦尼克号中男女的存活人数
survived_sex = text['Survived'].groupby(text['Sex']).sum()

#用sum计算客舱不同等级的存活人数
survived_pclass = text['Survived'].groupby(text['Pclass']).sum()

agg函数

agg函数可以用来同时完成分组的多种数据运算

#同时完成对不同客舱平均票价和存活人数的统计，并重命名Column_name
text.groupby('Pclass').agg({'Fare':'mean','Survived':'sum'}).rename(columns = {'Fare':'mean_fare','Survived':'Survived_sum'})

计算存活人数最多的年龄，然后计算存活人数最高的存活率

# 不同年龄的总的存活人数
survived_age = text['Survived'].groupby(text['Age']).sum()
#找出其中存活人数的最大值以及其对应年龄
survived_age.max()
survived_age[survived_age.values==survived_age.max()]  #Age 24.0  15

#总存活人数
sum_total = text['Survived'].sum()

#存活人数最多的年龄段的总人数
sum_= text.groupby(text['Age']).size()
sum_[24.0]  #30  证明年龄为24的人中有一半存活了

#在整个船上的存活率
percent = survived_age.max()/sum_total
print('Percent of survive: '+ str(percent))

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航