您的位置:首页 > 编程语言 > Python开发

Python pandas 入门

2016-09-18 20:55 561 查看

pandas读取csv文件

pandas中为我们提供了丰富的读取文件的接口,对数据处理极为的方便。

创建数据

from pandas import DataFrame,read_csv

import matplotlib.pyplot as plt
import pandas as pd
import sys
import matplotlib

print('Pandas version' + sys.version)
print('Pandas version' + pd.__version__)
print('Matplotlib version' + matplotlib.__version__)

# create data
names = ['zhangsan','lisi','wangwu','zhaoliu']
ages = [11,12,13,14]

dataset = zip(names,ages)
dataset


将我们的数据转变为pandas的格式:

df = pd.DataFrame(data = dataset,columns = ['Names','Age'])
df
Out[47]:
Names  Age
0  zhangsan   11
1      lisi   12
2    wangwu   13
3   zhaoliu   14


将我们的数据写入到一个csv文件中去,利用pandas的接口

df.to_csv('age.csv',index = False,header = False)


这里的两个参数分别是前面的索引和表头,可以看到我们的文件已经生成了。

获取数据

Location = 'E:\\code\\python\\pandas\\age.csv'
df = pd.read_csv(Location)
df

Out[50]:
zhangsan  11
0     lisi  12
1   wangwu  13
2  zhaoliu  14


出现一个问题是我们的df会将数据的第一条判定为header,所以我们需要指定名字。

df = pd.read_csv(Location,names =['Names','Age'])
df
Out[53]:
Names  Age
0  zhangsan   11
1      lisi   12
2    wangwu   13
3   zhaoliu   14


分析数据

pandas提供了一些接口就像数据库的操作一样。

Sorted = df.sort_values(['Age'], ascending = False)
Sorted
Out[56]:
Names  Age
3   zhaoliu   14
2    wangwu   13
1      lisi   12
0  zhangsan   11
df['Age'].max()
Out[55]: 14


呈现数据

pandas提供了数据的可视化

df['Age'].plot()




pandas读取txt文件

pandas读取txt文件和读取csv很像。

pandas读取excel文件

创建数据

写入一个函数中

import pandas as pd
import matplotlib.pyplot as plt
import sys
import matplotlib
import numpy.random as np

np.seed(111)

def CreateDataSet(Number):
Output = []
for i in range(Number):

#Create a weekly data rang
rng = pd.date_range(start='1/1/2009',end='12/31/2012',freq='W-MON')

#Create random data
data = np.randint(low=25,high=1000,size=len(rng))

#Status pool
status = [1,2,3]

random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))]
states = ['GA','FL','f1','NY','NJ','TX']

random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))]

Output.extend(zip(random_states,random_status,data,rng))

return Output
dataset = CreateDataSet(4)
print dataset

df = pd.DataFrame(data = dataset,columns = ['state','status','CustomerCount','StatusDate'])

df.to_excel('Lesson3.xlsx',index = False)
print ('Done')


这里只要注意一下它产生随机数时候的方法就可以了。

从文件中读取数据

Location = ‘E:\code\python\pandas\Lesson3.xlsx’

parse a specific sheet

Location = 'E:\\code\\python\\pandas\\Lesson3.xlsx'

#parse a specific sheet
df = pd.read_excel(Location,0,index_col = 'StatusDate')
df.dtypes

df.index

df.head()
Out[4]:
state  status  CustomerCount
StatusDate
2009-01-05    GA       1            877
2009-01-12    FL       1            901
2009-02-02    GA       1            300
2009-03-09    NY       1            992


数据预处理

df['state'].unique()#可以筛选出一共几种state

df['state'] = df.state.apply(lambda x:x.upper())

mask = df['status'] == 1

df = df[mask]

df.state[df.state=='NJ']='NY'


数据的一些基本的操作:

df['CustomerCount'].plot(figsize=(15,5))#画图

sortdf = df[df['state']=='NY'].sort_index(axis=0)#按照横轴进行排序
sortdf.head(10)
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()


画图呈现我们的数据





同时可以精确到月份



数据行列的增加删除

In [5]: df

Out[5]:

Rev

0    0

1    1

2    2

3    3

4    4

5    5

6    6

7    7

8    8

9    9

In [6]: df['NelCol'] = 5

In [7]: df

Out[7]:

Rev  NelCol

0    0       5

1    1       5

2    2       5

3    3       5

4    4       5

5    5       5

6    6       5

7    7       5

8    8       5

9    9       5

In [9]: df['NelCol'] = df['NelCol'] + 1

In [10]: df

Out[10]:

Rev  NelCol

0    0       6

1    1       6

2    2       6

3    3       6

4    4       6

5    5       6

6    6       6

7    7       6

8    8       6

9    9       6

In [11]: del df['NelCol']

In [12]: df

Out[12]:

Rev

0    0

1    1

2    2

3    3

4    4

5    5

6    6

7    7

8    8

9    9

In [13]: df['test'] = 3

In [14]: df

Out[14]:

Rev  test

0    0     3

1    1     3

2    2     3

3    3     3

4    4     3

5    5     3

6    6     3

7    7     3

8    8     3

9    9     3

In [15]: df['col'] = df['Rev']

In [16]: df

Out[16]:

Rev  test  col

0    0     3    0

1    1     3    1

2    2     3    2

3    3     3    3

4    4     3    4

5    5     3    5

6    6     3    6

7    7     3    7

8    8     3    8

9    9     3    9

In [17]: i = ['a','b','c','d','e','f','g','h','i','j']

In [18]: df.index = i

In [19]: df

Out[19]:

Rev  test  col

a    0     3    0

b    1     3    1

c    2     3    2

d    3     3    3

e    4     3    4

f    5     3    5

g    6     3    6

h    7     3    7

i    8     3    8

j    9     3    9

In [21]: df.loc['a']

Out[21]:

Rev     0

test    3

col     0

Name: a, dtype: int64

In [22]: df.loc['a':'d']

Out[22]:

Rev  test  col

a    0     3    0

b    1     3    1

c    2     3    2

d    3     3    3

In [23]: df['Rev']

Out[23]:

a    0

b    1

c    2

d    3

e    4

f    5

g    6

h    7

i    8

j    9

Name: Rev, dtype: int32
In [24]: df[0:3,'Rev']
In [25]: df.ix[0:3,'Rev']

Out[25]:

a    0

b    1

c    2

Name: Rev, dtype: int32

In [26]: df.ix[5:,'col']

Out[26]:

f    5

g    6

h    7

i    8

j    9

Name: col, dtype: int32

In [27]: df.head()

Out[27]:

Rev  test  col

a    0     3    0

b    1     3    1

c    2     3    2

d    3     3    3

e    4     3    4

In [28]: df.head(2)

Out[28]:

Rev  test  col

a    0     3    0

b    1     3    1

In [29]: df.tail(2)

Out[29]:

Rev  test  col

i    8     3    8

j    9     3    9


pandas中的groupby语句

首先创建数据框

In [11]: df

Out[11]:

letter  one  two

0      a    1    2

1      a    1    2

2      b    1    2

3      b    1    2

4      c    1    2


看一下group的效果:

In [14]: one = df.groupby('letter')

In [15]: one

Out[15]: <pandas.core.groupby.DataFrameGroupBy object at 0x00000000094BDFD0>

In [16]: one.sum()

Out[16]:

one  two

letter

a         2    4

b         2    4

c         1    2


groupby 两个属性

In [18]: letterone = df.groupby(['letter','one']).sum()

In [19]: letterone

Out[19]:

two

letter one

a      1      4

b      1      4

c      1      2


不想将我们的letter作为索引

In [21]: letterone = df.groupby(['letter','one'],as_index = False).sum()

In [22]: letterone

Out[22]:

letter  one  two

0      a    1    4

1      b    1    4

2      c    1    2


pandas计算极端值

1:创建我们的数据

In [14]: df

Out[14]:

Revenue State


2012-01-01 1.0 NY

2012-02-01 2.0 NY

2012-03-01 3.0 NY

2012-04-01 4.0 NY

2012-05-01 5.0 FL

2012-06-01 6.0 FL

2012-07-01 7.0 GA

2012-08-01 8.0 GA

2012-09-01 9.0 FL

2012-10-01 10.0 FL

2013-01-01 10.0 NY

2013-02-01 10.0 NY

2013-03-01 9.0 NY

2013-04-01 9.0 NY

2013-05-01 8.0 FL

2013-06-01 8.0 FL

2013-07-01 7.0 GA

2013-08-01 7.0 GA

2013-09-01 6.0 FL

2013-10-01 6.0 FL

计算均值,方差而特殊值

In [15]: newdf = df.copy()
In [16]: newdf['x_Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
In [17]: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
In [19]: newdf['Outlier'] = newdf['x_Mean'] > newdf['1.96*std']
newdf.head()
Out[31]:
Revenue State Outlier  x_Mean  1.96*std
2012-01-01      1.0    NY   False    5.00  7.554813
2012-02-01      2.0    NY   False    4.00  7.554813
2012-03-01      3.0    NY   False    3.00  7.554813
2012-04-01      4.0    NY   False    2.00  7.554813
2012-05-01      5.0    FL   False    2.25  3.434996


用状态进行分组

In [22]: newdf = df.copy()

In [23]: State = newdf.groupby(‘State’)

运用lambda函数:

In [27]: newdf['Outlier']=State.transform(lambda x:abs(x - x.mean())>1.96*x.std())
In [28]: newdf['x_Mean']=State.transform(lambda x:abs(x - x.mean()))
In [29]: newdf['1.96*std'] = State.transform(lambda x:x.std())
In [30]: newdf['1.96*std'] = State.transform(lambda x:x.std()*1.96)


可以用一个函数实现上述功能:

newdf = df.copy()

StateMonth = newdf.groupby(['State', lambda x: x.month])

def s(group):
group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
group['1.96*std'] = 1.96*group['Revenue'].std()
group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
return group

Newdf2 = StateMonth.apply(s)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python csv