您的位置:首页 > 其它

pandas聚合和分组运算之groupby - 2

2018-03-14 15:42 651 查看
import numpy as np
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import pandas as pd

np.random.seed(12345)  # 记录随机数的种子,确保每次执行,都是相同的随机数
plt.rc('figure', figsize=(10, 6))

from sklearn.metrics import confusion_matrix
'''
在机器学习领域,混淆矩阵(confusion matrix),又称为可能性表格或是错误矩阵。
它是一种特定的矩阵用来呈现算法性能的可视化效果,通常是监督学习(非监督学习,通常用匹配矩阵matching matrix)。
其每一列代表预测值,每一行代表的是实际的类别。这个名字来源于它可非常容易的表明多个类别是否有混淆(也就是一个class被预测成另一个class)。
'''
y_true=[2,1,0,1,2,0]
y_pred=[2,0,0,1,2,1]

C=confusion_matrix(y_true, y_pred)
print( C )
'''
[[1 1 0]
[1 1 0]
[0 0 2]]
'''

y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
cc = confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
print( cc )
'''
[[2 0 0]
[0 0 1]
[1 0 2]]
'''

df = pd.DataFrame({'key1':list('aaaab'),
'key2': ['one','two','one','two','one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
print( df )
'''
data1     data2 key1 key2
0 -0.848172 -2.385541    a  one
1 -0.453870  0.156512    a  two
2 -0.336633 -0.323486    a  one
3 -1.258714  1.339105    a  two
4  0.669843  0.511622    b  one
'''

print (df.groupby('key1')) # <pandas.core.groupby.DataFrameGroupBy object at 0x000000000DB1EC18>
print (df.groupby('key1').agg('sum'))

'''
得到df如下:
df为pd中的dataframe,groupby(‘列名’),相当于以这一列进行预分类。
打印结果为:
data1     data2
key1
a    -1.094335  2.781858
b    -0.548833  1.198655

然后agg()是对上面内容的操作。这里是sum,所以累加:
PS:试图只选取data1这一列进行计算,从而写了个df['data1'],不行。这样做只单单选中了data1这一列!
PS:df['data1']是series类型,df[['data1']]是dataframe类型
'''

#### 面向列的多函数应用
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})

tips = pd.read_csv('data/tips.csv')

tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

grouped = tips.groupby(['sex', 'smoker'])

grouped_pct = grouped['tip_pct']

'''
pandas 中有关agg函数和apply函数的区别
在利用python进行数据分析 这本书中其实没有明确表明这两个函数的却别,而是说apply更一般化.

其实在这本书的第九章‘数组及运算和转换’点到了两者的一点点区别:
agg是用来聚合运算的,所谓的聚合当然是合成的成分比较大些,这一节开头就点到了:
聚合只不过是分组运算的其中一种而已。它是数据转换的一个特例,也就是说,它接受能够将一维数组简化为标量值的函数。

当然这两个函数都是作用在groupby对象上的,也就是分完组的对象上的,分完组之后针对某一组,
如果值是一维数组,在利用完特定的函数之后,能做到简化的话,agg就能调用,
反之,如果比如自定义的函数是排序,或者像是书中278页所定义的top这一类的函数,
当然是agg所不能解决的,这时候用apply就可以解决。因为他更一般化,不存在什么简化,什么一维数组,什么标量值。
'''
grouped_pct.agg('mean')
#grouped_pct.agg(['mean', 'std', peak_to_peak])
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
print( result )
'''
tip_pct                     total_bill
count      mean       max      count       mean    max
sex    smoker
Female No          54  0.156921  0.252672         54  18.105185  35.83
Yes         33  0.182150  0.416667         33  17.977879  44.30
Male   No          97  0.160669  0.291990         97  19.791237  48.33
Yes         60  0.152771  0.710345         60  22.284500  50.81
'''

print( result['tip_pct'] )
'''
count      mean       max
sex    smoker
Female No         54  0.156921  0.252672
Yes        33  0.182150  0.416667
Male   No         97  0.160669  0.291990
Yes        60  0.152771  0.710345
'''
print( '-------------- ftuples --------------' )

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

print( ftuples )
'''
[('Durchschnitt', 'mean'), ('Abweichung', <function var at 0x00000000035C5510>)]
'''

print( '-------------- grouped_1 --------------' )
grouped_1 = grouped['tip_pct', 'total_bill'].agg(ftuples)

print( grouped_1 )
'''
tip_pct              total_bill
Durchschnitt Abweichung Durchschnitt Abweichung
sex    smoker
Female No         0.156921   0.001327    18.105185  53.092422
Yes        0.182150   0.005126    17.977879  84.451517
Male   No         0.160669   0.001751    19.791237  76.152961
Yes        0.152771   0.008206    22.284500  98.244673
'''

print( '-------------- grouped_2 --------------' )
grouped_2 = grouped.agg({'tip' : np.max, 'size' : 'sum'})

print( grouped_2 )
'''
size   tip
sex    smoker
Female No       140   5.2
Yes       74   6.5
Male   No       263   9.0
Yes      150  10.0
'''

print( '-------------- grouped_3 --------------' )
grouped_3 = grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],'size' : 'sum'})
print( grouped_3 )
'''
size   tip_pct
sum       min       max      mean       std
sex    smoker
Female No      140  0.056797  0.252672  0.156921  0.036421
Yes      74  0.056433  0.416667  0.182150  0.071595
Male   No      263  0.071804  0.291990  0.160669  0.041849
Yes     150  0.035638  0.710345  0.152771  0.090588
'''

###分组级运算和转换
print( '-------------- df --------------' )
print( df )
'''
data1     data2 key1 key2
0  1.007189  0.886429    a  one
1 -1.296221 -2.001637    a  two
2  0.274992 -0.371843    b  one
3  0.228913  1.669025    b  two
4  1.352917 -0.438570    a  one
'''

k1_means = df.groupby('key1').mean().add_prefix('mean_')

print( '-------------- k1_means --------------' )
print( k1_means )
'''
mean_data1  mean_data2
key1
a       0.354628   -0.517926
b       0.251952    0.648591
'''
_merge_11 = pd.merge(df, k1_means, left_on='key1', right_index=True)

print( '-------------- _merge_11 --------------' )
print( _merge_11 )
'''
data1     data2 key1 key2  mean_data1  mean_data2
0  1.007189  0.886429    a  one    0.354628   -0.517926
1 -1.296221 -2.001637    a  two    0.354628   -0.517926
4  1.352917 -0.438570    a  one    0.354628   -0.517926
2  0.274992 -0.371843    b  one    0.251952    0.648591
3  0.228913  1.669025    b  two    0.251952    0.648591
'''

people = DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
print( '-------------- people --------------' )
print( people )
'''
a         b         c         d         e
Joe    -0.539741  0.476985  3.248944 -1.021228 -0.577087
Steve   0.124121  0.302614  0.523772  0.000940  1.343810
Wes    -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
Jim     0.560145 -1.265934  0.119827 -1.063512  0.332883
Travis -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
'''

key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()
people.groupby(key).transform(np.mean)

print( '-------------- people-groupby --------------' )
print( people )
'''
a         b         c         d         e
Joe    -0.539741  0.476985  3.248944 -1.021228 -0.577087
Steve   0.124121  0.302614  0.523772  0.000940  1.343810
Wes    -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
Jim     0.560145 -1.265934  0.119827 -1.063512  0.332883
Travis -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
'''

def demean(arr):
return arr - arr.mean()

demeaned = people.groupby(key).transform(demean)

print( '-------------- demeaned --------------' )
print( demeaned )
'''
a         b         c         d         e
Joe     0.664493  0.661556  3.470038  0.263014  0.337871
Steve  -0.218012  0.784274  0.201972  0.532226  0.505464
Wes     0.490691 -0.646583 -2.149137 -0.576519  0.054201
Jim     0.218012 -0.784274 -0.201972 -0.532226 -0.505464
Travis -1.155184 -0.014972 -1.320901  0.313505 -0.392072
'''

demeaned_2 = demeaned.groupby(key).mean()

print( '-------------- demeaned_2 --------------' )
print( demeaned_2 )
'''
a             b             c             d             e
one -7.401487e-17  1.850372e-17 -7.401487e-17  7.401487e-17 -1.110223e-16
two  2.775558e-17 -5.551115e-17 -1.387779e-17  0.000000e+00  0.000000e+00
'''

# ### apply方法
def top(df, n=5, column='tip_pct'):
return df.sort_index(by=column)[-n:]

top(tips, n=6)

tips.groupby('smoker').apply(top)

tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

result = tips.groupby('s
ce93
moker')['tip_pct'].describe()

print( '-------------- result --------------' )
print( result )
'''
smoker
No      count    151.000000
mean       0.159328
std        0.039910
min        0.056797
25%        0.136906
50%        0.155625
75%        0.185014
max        0.291990
Yes     count     93.000000
mean       0.163196
std        0.085119
min        0.035638
25%        0.106771
50%        0.153846
75%        0.195059
max        0.710345
Name: tip_pct, dtype: float64
'''

result_unstack = result.unstack('smoker')

print( '-------------- result_unstack --------------' )
print( result_unstack )
'''
smoker          No        Yes
count   151.000000  93.000000
mean      0.159328   0.163196
std       0.039910   0.085119
min       0.056797   0.035638
25%       0.136906   0.106771
50%       0.155625   0.153846
75%       0.185014   0.195059
max       0.291990   0.710345
'''

#f = lambda x: x.describe()
#grouped.apply(f)

# 禁止分组键
tips.groupby('smoker', group_keys=False).apply(top)

# ### 分位数和桶分析
frame = DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)
factor[:10]

def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)
grouped.apply(get_stats).unstack()

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

# ### 用特定于分组的值填充缺失值
s = Series(np.random.randn(6))
s[::2] = np.nan
print( s )
'''
0         NaN
1   -0.438053
2         NaN
3    0.401587
4         NaN
5   -0.574654
dtype: float64
'''

s.fillna(s.mean())

states = ['Ohio', 'New York', 'Vermont', 'Florida',
'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
print( data )
'''
Ohio          0.786210
New York     -1.393822
Vermont            NaN
Florida       1.170900
Oregon        0.678661
Nevada             NaN
California    0.150581
Idaho              NaN
dtype: float64
'''
data.groupby(group_key).mean()

fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)
print( data )
'''
Ohio          0.786210
New York     -1.393822
Vermont            NaN
Florida       1.170900
Oregon        0.678661
Nevada             NaN
California    0.150581
Idaho              NaN
dtype: float64
'''
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息