您的位置：首页 > 理论基础 > 数据结构算法

Pandas学习笔记（1）

2016-11-01 17:39 330 查看

一、Pandas的数据结构介绍

>>> from pandas import Series,DataFrame

>>> import pandas as pd

>>> import numpy as np

1.Series

Series：类似于一维数组的对象，由一组数据（各种numpy的数据类型）以及一组与之相关的数据标签（即索引）组成

>>> obj=Series([1,2,3,4])

#如果不指定索引，会自动生成从0-(N-1)的整数型索引

>>> obj

0 1

1 2

2 3

3 4

dtype: int64

>>> obj.values

array([1, 2, 3, 4])

>>> obj.index

RangeIndex(start=0, stop=4, step=1)

#numpy数组运算保留索引和值之间的关系

>>> obj[obj>2]

2 3

3 4

dtype: int64

>>> obj*2

0 2

1 4

2 6

3 8

dtype: int64

>>> np.exp(obj)

0 2.718282

1 7.389056

2 20.085537

3 54.598150

dtype: float64

#如果数据被存放在有一个python字典中，也可以直接通过这个字典创建Series

>>> score={"Tom":99,"Lucy":90,"John":80,"Green":58}

>>> score

{'John': 80, 'Green': 58, 'Lucy': 90, 'Tom': 99}

>>> obj_score=Series(score)

>>> obj_score

Green 58

John 80

Lucy 90

Tom 99

dtype: int64

#Series可以被看成是一个定长的有序字典，可以用很多原本需要字典参数的函数

>>> "Green" in obj_score

True

>>> "yaoxq" in obj_score

False

#将一个字典传入Series的索引，就可以得到匹配的值，“NaN”表示缺失或者NA值。

>>> name={"A","B","C","Tom"}

>>> obj_score_new=Series(obj_score,index=name)

>>> obj_score_new

A NaN

C NaN

B NaN

Tom 99.0

dtype: float64

#我们可以使用isnull和isnotnull来检测缺失数据

>>> pd.isnull(obj_score)

Green False

John False

Lucy False

Tom False

dtype: bool

>>> pd.isnull(obj_score_new)

A True

C True

B True

Tom False

dtype: bool

>>> pd.notnull(obj_score)

Green True

John True

Lucy True

Tom True

dtype: bool

>>> pd.notnull(obj_score_new)

A False

C False

B False

Tom True

dtype: bool

#pandas会自动对齐不同索引的数据

>>> obj_score+obj_score_new

A NaN

B NaN

C NaN

Green NaN

John NaN

Lucy NaN

Tom 198.0

dtype: float64

#Series本身和索引都有一个name属性，该属性和pandas其他功能关系密切

>>> obj_score.name = "score"

>>> obj_score.index.name = "name"

>>> obj_score

name

Green 58

John 80

Lucy 90

Tom 99

Name: score, dtype: int64

#Series的索引可通过赋值方式修改

>>> obj_score.index=["","","",""]

>>> obj_score

58

80

90

99

Name: score, dtype: int64

2.DataFrame

DataFrame是一个表格型的数据结构。它含有一组有序的列，每列可以是不同的值类型。DataFrame既有行索引，也有列索引，可以被看做是由Series组成的字典（共用一个索引）。

DataFrame中的数据是以一个或多个二维块存放的。

创建DataFrame的方法很多，最常用的是直接传入一个等长列表或numpy数组组成的字典：

>>> data={'name':['Tom','Tom','Tom','Lucy','Lucy','John'],'year':[2014,2015,2016,2015,2016,2016],'score':[80,85,90,86,88,83]}

>>> frame=DataFrame(data)

>>> frame

name score year

0 Tom 80 2014

1 Tom 85 2015

2 Tom 90 2016

3 Lucy 86 2015

4 Lucy 88 2016

5 John 83 2016

#可以指定列序列

>>> DataFrame(data,columns=['year','score','name'])

year score name

0 2014 80 Tom

1 2015 85 Tom

2 2016 90 Tom

3 2015 86 Lucy

4 2016 88 Lucy

5 2016 83 John

#可以通过获取属性或字典标记的方式，来获取一个特定series（name属性已经被设置好了）

>>> frame.year

0 2014

1 2015

2 2016

3 2015

4 2016

5 2016

Name: year, dtype: int64

>>> frame['score']

0 80

1 85

2 90

3 86

4 88

5 83

Name: score, dtype: int64

#为不存在的列赋值会创建一个新列，del用于删除列。

>>> frame['isgirl']= frame.name == 'Lucy'

>>> frame

name score year isgirl

0 Tom 80 2014 False

1 Tom 85 2015 False

2 Tom 90 2016 False

3 Lucy 86 2015 True

4 Lucy 88 2016 True

5 John 83 2016 False

>>> del frame['isgirl']

>>> frame.columns

Index([u'name', u'score', u'year'], dtype='object')

另一种方式是嵌套字典：

>>> data={'Tom':{2000:80,2001:85,2002:90},'Lucy':{2001:90,2002:99},'John':{2002:100}}

>>> frame=DataFrame(data)

>>> frame

John Lucy Tom

2000 NaN NaN 80

2001 NaN 90.0 85

2002 100.0 99.0 90

#使用*.T来对dataframe进行转置

>>> frame.T

2000 2001 2002

John NaN NaN 100.0

Lucy NaN 90.0 99.0

Tom 80.0 85.0 90.0

上面例子中，内层字典的键会被合并、排序以形成最终的索引。如果显式指定了索引，pandas则会过滤数据

>>> frame1=DataFrame(data,index=[1999,2000,2001])

>>> frame1

John Lucy Tom

1999 NaN NaN NaN

2000 NaN NaN 80.0

2001 NaN 90.0 85.0

3.索引对象

Pandas的索引对象负责管理轴标签和其他元数据（比如轴名称）

构建Series或者DataFrame时，所用到的任何数组或者序列的标签都会被转换成一个index。

Index对象是不可修改的。

>>> frame.index[1]

2001

>>> frame.index[1:]

Int64Index([2001, 2002], dtype='int64')

>>> frame.index[1]=2009

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/indexes/base.py", line 1245, in __set

raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations
#索引的方法和属性


方法	属性
append	链接另一个index对象，产生一个新的Index
diff	计算差集，并得到一个Index
intersection	计算交集
union	计算并集
isin	计算一个指示各值是否都包含在参数集合中的布尔型数组
delete	产出索引i出的元素，并得到新的Index
drop	删除传入的值，并得到新的Index
insert	将元素插入到索引i处，并得到新的Index
is_monotonic	将各元素均大于等于前一个元素时，返回True
is_unique	将Index没有重复值时，返回True
unique	返回Index中唯一的数组

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Pandas 数据结构

相关文章推荐

新的分享

章节导航