您的位置：首页 > 编程语言 > Python开发

Python数据分析--pandas部分笔记

2017-08-17 16:44 746 查看

1、Series相关

Series类似于一个列向量，只是在其左侧加了索引，其包括values和index两个属性，Series.values和Series.index。Series对象本身以及其索引都有一个name属性，即Series.name和Series.index.name，能够对Series和其索引命名，与pandas其他功能联系紧密。

DataFrame类似于数组，只是对于行和列都有了索引。取行：frame.ix[i]；取列：frame[i]或者frame.i，其中i为行或者列的索引。

2、apply方法: apply(func())是调用func()函数，例如：

func函数是无参数时：

Input:      def say():
print 'say in'
apply(say)
Output:     say in

func函数有参数时：

Input:      def say(a, b):
print a, b
apply(say, ('hello', 'zhangsan'))
Output:     hello zhangsan

在DataFrame中，如果需要对各行或者各列进行函数操作，可以利用apply函数来实现。如下例子中，apply()中默认axis是0，即将DataFrame的所有行带入函数进行操作，如果令apply(f, axis=1)意味着对所有列进行操作。

Input:      frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
f = lambda x: x.max() - x.min()
print frame
print frame.apply(f)
print frame.apply(f, axis=1)
Output:                    b         d         e
Utah   -0.613367 -0.689123  0.001532
Ohio    0.835977  1.377497 -0.681188
Texas  -1.865279 -0.587092  0.057747
Oregon -0.770581  1.244155  0.060371          #frame

b    2.701256
d    2.066620
e    0.741559
dtype: float64                               #frame.apply(f)

Utah      1.753016
Ohio      2.443354
Texas     0.760543
Oregon    2.246665
dtype: float64                               #frame.apply(f, axis=1)

3、排序

对于Series而言，使用sort_index()和sort_value()来实现对于行(索引)和列(值)的排序，其返回的是一个新对象。

Input:      obj = Series([4, 7, -3, 2], index=['d', 'a', 'b', 'c'])
print obj.sort_index()
print obj.sort_values()
Output:     a    7
b   -3
c    2
d    4
dtype: int64                        #索引排序
b   -3
c    2
d    4
a    7
dtype: int64                        #值排序

对于DataFrame而言，使用sort_index()和sort_index(axis=1)来对行索引和列索引进行排序，其行或列的值跟随其移动；其默认是按升序排序的，若要按照降序，则使用

sort_index(axis=1, ascending=False)

。这里，与书上使用的order不同，order已被sort_values替代。

Input:      frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
print frame.sort_index()
print frame.sort_index(axis=1)
Output:            d  a  b  c
one    4  5  6  7
three  0  1  2  3                 #行索引排序 o在前

a  b  c  d
three  1  2  3  0
one    5  6  7  4                 #列索引排序 a在前

当DataFrame需要其内按照一个或多个列的值进行排序时，使用sort_values(by=)来实现，看例子：

按照一列时：

Input：          frame = DataFrame({'b': [4, 7, -3, 2], 'a': [3, 1, 0, 2]})
print frame.sort_values(by='b')
Output:         a  b
2  0 -3
3  1  2
0  0  4
1  1  7

按照两列时：

Input:          frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print frame.sort_values(by=['a','b'])
Output:         a  b
2  0 -3
0  0  4
3  1  2
1  1  7                       #先按照a列排序，当其内元素相同时，看b列

4、排名

使用rank()实现对于Series和DataFrame的排序，在排序时，若存在平级，其存在几种破坏平级关系的method选项，分别为：

average：默认，在相等的分组中，为各值分配平均排名

min：使用整个分组的最小排名

max: 使用整个分组的最大排名

first: 按值在院士数据中出现的顺序分配排名

对于Series排序：

Input:      obj = Series([7, -5, 7, 4, 2, 0, 4])
print obj.rank()
Output:     0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5     #右侧为排名，对于相等元素4，其排名应该是4和5，使用average时其排名变为4.5

在

obj.rank(ascending=False, method='max')

中，ascending默认是升序排序。

对于DataFrame排序：

Input:      frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
print frame.rank(axis=1)
Output:          a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0
a54f
#axis=1即对culomns进行排序

5、value_counts：统计Series或者DataFrame中元素出现的次数

Series中：

Input:      obj =Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print obj.value_counts()
print pd.value_counts(obj.values, sort=False)

Output:     c    3
a    3
b    2
d    1     #右侧一列为统计的出现次数，默认是按统计值降序排列的，c先出现和原series有关

a    3
c    3
b    2
d    1    #sort=False，没有进行排序，只进行了计数统计

这里对

pd.value_counts(obj.values, sort=False)

有些误解，认为其输出具有一定的排序性，其实，当sort=False时，认定了该函数是没有排序的，所以其只有计数统计功能。

pandas.Series.value_counts

用法可以从http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html中看到。

DataFrame中：

Input:      data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
print result
Output:        Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0       #左侧为出现的元素，矩阵为出现次数

6、整数索引

对于Series和DataFrame，其索引如果是整数的时候，需要特别注意，例如：

Input:      ser= Series(bp.arange(3))
print ser[-1]

其输出会出错，主要是因为ser本身会有一个0,1,2的索引，而-1会使pandas求助于索引，而里面并没有-1，导致出错。而对于一个非整数的索引，就没有这样的歧义，例如：

Input:      ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
print ser[-1]
Output:     2.0

对于Series，解决这类问题可以使用iloc[i]命令(书中的iget_value()命令已经被移除)，iloc[i]命令是提供可靠的、不考虑索引类型的、基于位置的索引

Input:      ser= Series(bp.arange(3))
print ser.iloc[-1]
Output:     2.0

对于DataFrame，是iloc[i]针对于行，iloc[:,i]是针对于列，如下例子：

Input:      frame =DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1], columns=[2,4])
print frame
print frame.iloc[1]
print frame.iloc[:,1]
Output:        2  4
2  0  1
0  2  3
1  4  5          #原frame

2    2
4    3
Name: 0, dtype: int32     #取frame的第二行内容2，3

2    1
0    3
1    5
Name: 4, dtype: int32      #取第二列的内容

由此可见，.iloc[i]所索引的并不看Series和DataFrame所定义的索引，只考虑默认的位置的索引。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航