Python pandas 入门
2016-09-18 20:55
561 查看
pandas读取csv文件
pandas中为我们提供了丰富的读取文件的接口,对数据处理极为的方便。创建数据
from pandas import DataFrame,read_csv import matplotlib.pyplot as plt import pandas as pd import sys import matplotlib print('Pandas version' + sys.version) print('Pandas version' + pd.__version__) print('Matplotlib version' + matplotlib.__version__) # create data names = ['zhangsan','lisi','wangwu','zhaoliu'] ages = [11,12,13,14] dataset = zip(names,ages) dataset
将我们的数据转变为pandas的格式:
df = pd.DataFrame(data = dataset,columns = ['Names','Age']) df Out[47]: Names Age 0 zhangsan 11 1 lisi 12 2 wangwu 13 3 zhaoliu 14
将我们的数据写入到一个csv文件中去,利用pandas的接口
df.to_csv('age.csv',index = False,header = False)
这里的两个参数分别是前面的索引和表头,可以看到我们的文件已经生成了。
获取数据
Location = 'E:\\code\\python\\pandas\\age.csv' df = pd.read_csv(Location) df Out[50]: zhangsan 11 0 lisi 12 1 wangwu 13 2 zhaoliu 14
出现一个问题是我们的df会将数据的第一条判定为header,所以我们需要指定名字。
df = pd.read_csv(Location,names =['Names','Age']) df Out[53]: Names Age 0 zhangsan 11 1 lisi 12 2 wangwu 13 3 zhaoliu 14
分析数据
pandas提供了一些接口就像数据库的操作一样。Sorted = df.sort_values(['Age'], ascending = False) Sorted Out[56]: Names Age 3 zhaoliu 14 2 wangwu 13 1 lisi 12 0 zhangsan 11 df['Age'].max() Out[55]: 14
呈现数据
pandas提供了数据的可视化df['Age'].plot()
pandas读取txt文件
pandas读取txt文件和读取csv很像。pandas读取excel文件
创建数据
写入一个函数中import pandas as pd import matplotlib.pyplot as plt import sys import matplotlib import numpy.random as np np.seed(111) def CreateDataSet(Number): Output = [] for i in range(Number): #Create a weekly data rang rng = pd.date_range(start='1/1/2009',end='12/31/2012',freq='W-MON') #Create random data data = np.randint(low=25,high=1000,size=len(rng)) #Status pool status = [1,2,3] random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))] states = ['GA','FL','f1','NY','NJ','TX'] random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))] Output.extend(zip(random_states,random_status,data,rng)) return Output dataset = CreateDataSet(4) print dataset df = pd.DataFrame(data = dataset,columns = ['state','status','CustomerCount','StatusDate']) df.to_excel('Lesson3.xlsx',index = False) print ('Done')
这里只要注意一下它产生随机数时候的方法就可以了。
从文件中读取数据
Location = ‘E:\code\python\pandas\Lesson3.xlsx’parse a specific sheet
Location = 'E:\\code\\python\\pandas\\Lesson3.xlsx' #parse a specific sheet df = pd.read_excel(Location,0,index_col = 'StatusDate') df.dtypes df.index df.head() Out[4]: state status CustomerCount StatusDate 2009-01-05 GA 1 877 2009-01-12 FL 1 901 2009-02-02 GA 1 300 2009-03-09 NY 1 992
数据预处理
df['state'].unique()#可以筛选出一共几种state df['state'] = df.state.apply(lambda x:x.upper()) mask = df['status'] == 1 df = df[mask] df.state[df.state=='NJ']='NY'
数据的一些基本的操作:
df['CustomerCount'].plot(figsize=(15,5))#画图 sortdf = df[df['state']=='NY'].sort_index(axis=0)#按照横轴进行排序 sortdf.head(10) Daily.loc['FL'].plot() Daily.loc['GA'].plot()
画图呈现我们的数据
同时可以精确到月份
数据行列的增加删除
In [5]: df Out[5]: Rev 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 In [6]: df['NelCol'] = 5 In [7]: df Out[7]: Rev NelCol 0 0 5 1 1 5 2 2 5 3 3 5 4 4 5 5 5 5 6 6 5 7 7 5 8 8 5 9 9 5 In [9]: df['NelCol'] = df['NelCol'] + 1 In [10]: df Out[10]: Rev NelCol 0 0 6 1 1 6 2 2 6 3 3 6 4 4 6 5 5 6 6 6 6 7 7 6 8 8 6 9 9 6 In [11]: del df['NelCol'] In [12]: df Out[12]: Rev 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 In [13]: df['test'] = 3 In [14]: df Out[14]: Rev test 0 0 3 1 1 3 2 2 3 3 3 3 4 4 3 5 5 3 6 6 3 7 7 3 8 8 3 9 9 3 In [15]: df['col'] = df['Rev'] In [16]: df Out[16]: Rev test col 0 0 3 0 1 1 3 1 2 2 3 2 3 3 3 3 4 4 3 4 5 5 3 5 6 6 3 6 7 7 3 7 8 8 3 8 9 9 3 9 In [17]: i = ['a','b','c','d','e','f','g','h','i','j'] In [18]: df.index = i In [19]: df Out[19]: Rev test col a 0 3 0 b 1 3 1 c 2 3 2 d 3 3 3 e 4 3 4 f 5 3 5 g 6 3 6 h 7 3 7 i 8 3 8 j 9 3 9 In [21]: df.loc['a'] Out[21]: Rev 0 test 3 col 0 Name: a, dtype: int64 In [22]: df.loc['a':'d'] Out[22]: Rev test col a 0 3 0 b 1 3 1 c 2 3 2 d 3 3 3 In [23]: df['Rev'] Out[23]: a 0 b 1 c 2 d 3 e 4 f 5 g 6 h 7 i 8 j 9 Name: Rev, dtype: int32 In [24]: df[0:3,'Rev'] In [25]: df.ix[0:3,'Rev'] Out[25]: a 0 b 1 c 2 Name: Rev, dtype: int32 In [26]: df.ix[5:,'col'] Out[26]: f 5 g 6 h 7 i 8 j 9 Name: col, dtype: int32 In [27]: df.head() Out[27]: Rev test col a 0 3 0 b 1 3 1 c 2 3 2 d 3 3 3 e 4 3 4 In [28]: df.head(2) Out[28]: Rev test col a 0 3 0 b 1 3 1 In [29]: df.tail(2) Out[29]: Rev test col i 8 3 8 j 9 3 9
pandas中的groupby语句
首先创建数据框In [11]: df Out[11]: letter one two 0 a 1 2 1 a 1 2 2 b 1 2 3 b 1 2 4 c 1 2
看一下group的效果:
In [14]: one = df.groupby('letter') In [15]: one Out[15]: <pandas.core.groupby.DataFrameGroupBy object at 0x00000000094BDFD0> In [16]: one.sum() Out[16]: one two letter a 2 4 b 2 4 c 1 2
groupby 两个属性
In [18]: letterone = df.groupby(['letter','one']).sum() In [19]: letterone Out[19]: two letter one a 1 4 b 1 4 c 1 2
不想将我们的letter作为索引
In [21]: letterone = df.groupby(['letter','one'],as_index = False).sum() In [22]: letterone Out[22]: letter one two 0 a 1 4 1 b 1 4 2 c 1 2
pandas计算极端值
1:创建我们的数据In [14]: df
Out[14]:
Revenue State
2012-01-01 1.0 NY
2012-02-01 2.0 NY
2012-03-01 3.0 NY
2012-04-01 4.0 NY
2012-05-01 5.0 FL
2012-06-01 6.0 FL
2012-07-01 7.0 GA
2012-08-01 8.0 GA
2012-09-01 9.0 FL
2012-10-01 10.0 FL
2013-01-01 10.0 NY
2013-02-01 10.0 NY
2013-03-01 9.0 NY
2013-04-01 9.0 NY
2013-05-01 8.0 FL
2013-06-01 8.0 FL
2013-07-01 7.0 GA
2013-08-01 7.0 GA
2013-09-01 6.0 FL
2013-10-01 6.0 FL
计算均值,方差而特殊值
In [15]: newdf = df.copy() In [16]: newdf['x_Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) In [17]: newdf['1.96*std'] = 1.96*newdf['Revenue'].std() In [19]: newdf['Outlier'] = newdf['x_Mean'] > newdf['1.96*std'] newdf.head() Out[31]: Revenue State Outlier x_Mean 1.96*std 2012-01-01 1.0 NY False 5.00 7.554813 2012-02-01 2.0 NY False 4.00 7.554813 2012-03-01 3.0 NY False 3.00 7.554813 2012-04-01 4.0 NY False 2.00 7.554813 2012-05-01 5.0 FL False 2.25 3.434996
用状态进行分组
In [22]: newdf = df.copy()
In [23]: State = newdf.groupby(‘State’)
运用lambda函数:
In [27]: newdf['Outlier']=State.transform(lambda x:abs(x - x.mean())>1.96*x.std()) In [28]: newdf['x_Mean']=State.transform(lambda x:abs(x - x.mean())) In [29]: newdf['1.96*std'] = State.transform(lambda x:x.std()) In [30]: newdf['1.96*std'] = State.transform(lambda x:x.std()*1.96)
可以用一个函数实现上述功能:
newdf = df.copy() StateMonth = newdf.groupby(['State', lambda x: x.month]) def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return group Newdf2 = StateMonth.apply(s)
相关文章推荐
- Python数据分析入门之pandas基础总结
- Python数据分析入门之pandas总结基础
- Python pandas快速入门
- Python数据分析入门-Pandas环境搭建
- Python数据分析入门-Pandas环境搭建
- python pandas10分钟入门
- python数据分析(pandas入门)
- 利用Python数据分析:pandas入门(五)
- 利用Python数据分析:pandas入门(四)
- Python数据分析入门(一)-Pandas数据结构(Series)
- 利用python进行数据分析-pandas入门
- Python:基于pandas ,Pymatlab的 数据分析入门
- python数据分析pandas包入门学习(三)汇总和统计描述
- 利用Python数据分析:pandas入门(六)
- 利用Python数据分析:pandas入门(一)
- Python点滴(四)—pandas快速入门使用
- python学习笔记一(pandas入门)
- 利用Python进行数据分析——pandas入门(五)(4)
- 利用Pythonj进行数据分析学习笔记——第五章 pandas入门
- Python——Pandas库入门