您的位置:首页 > 其它

pandas数据分组运算:groupby

2018-01-08 11:29 579 查看

groupby:pandas中最为常用的分组函数

(1)、按列分组

import pandas as pd
import numpy as np
df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
df


data1   data2   key1    key2
0   -1.488061   -0.002241   a   one
1   0.707773    0.338733    a   two
2   -1.689161   0.647643    b   one
3   0.987463    -0.584322   b   two
4   -0.560973   -1.147191   a   one


依据单个列名’key1’进行为分组

group1 = df.groupby('key1')
[x for x in group1]


[('a',       data1     data2 key1 key2
0 -1.488061 -0.002241    a  one
1  0.707773  0.338733    a  two
4 -0.560973 -1.147191    a  one),
('b',       data1     data2 key1 key2
2 -1.689161  0.647643    b  one
3  0.987463 -0.584322    b  two)]


依据多个列名[‘key1’,’key2’]进行分组

group2 = df.groupby(['key1','key2'])
[x for x in group2]


[(('a', 'one'),       data1     data2 key1 key2
0 -1.488061 -0.002241    a  one
4 -0.560973 -1.147191    a  one),
(('a', 'two'),       data1     data2 key1 key2
1  0.707773  0.338733    a  two),
(('b', 'one'),       data1     data2 key1 key2
2 -1.689161  0.647643    b  one),
(('b', 'two'),       data1     data2 key1 key2
3  0.987463 -0.584322    b  two)]


其中,group1是一个中间分组变量,为GroupBy类型;

推导式[x for x in group1]用于显示分组内容

(2)、分组统计

对分组group1、group2分别应用size()、sum()、count()等统计函数,可分别统计分组的数量、不同列的分组和、不同列的分组数量。

group1.size()


key1
a    3
b    2
dtype: int64


group1.sum()


data1   data2
key1
a   -1.341260   -0.810698
b   -0.701698   0.063321


group2.size()


key1  key2
a     one     2
two     1
b     one     1
two     1
dtype: int64


group2.count()


data1   data2
key1    key2
a       one     2   2
two     1   1
b       one     1   1
two     1   1


(3)、agg()

agg(func)可对分组后的某一列或者多个列的数据应用func函数,也可推广到同时作用于多个列和多个函数上。

例:对分组后的’data1’列求均值

group1['data1'].agg('mean')


key1
a   -0.447087
b   -0.350849
Name: data1, dtype: float64


例:对分组后的’data1’和’data2’列分别求均值、求和

group1['data1','data2'].agg(['mean','sum'])


data1            data2
mean     sum     mean    sum
key1
a   -0.447087   -1.341260   -0.270233   -0.810698
b   -0.350849   -0.701698   0.031660    0.063321


(4)、apply()

不同于agg()之处:apply()应用于dataframe的各个列,后者仅作用于指定的列。

df.groupby('key1').apply(np.mean)


data1       data2
key1
a   -0.447087   -0.270233
b   -0.350849   0.031660


df.groupby(['key1','key2']).apply(np.mean)


data1       data2
key1    key2
a       one     -1.024517   -0.574716
two     0.707773    0.338733
b       one     -1.689161   0.647643
two     0.987463    -0.584322


(5)、reset_index()

通过reset_index()函数可以将groupby()的分组结果转换成DataFrame对象,进而保存。

group1['data1','data2'].agg(['mean','sum']).reset_index()


key1   data1                    data2
mean           sum       mean      sum
0   a   -0.447087   -1.341260   -0.270233   -0.810698
1   b   -0.350849   -0.701698   0.031660    0.063321
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: