Pandas——NaN&Pivot&dropna&reset_index
2016-04-21 17:03
519 查看
本文的数据是Titanic上的船客的信息,有这么几个属性:
pclass – the cabin class of the passenger. 1 was the best cabin class, followed by 2, then 3.
name – the name of the passenger.
sex – the gender of the passenger.
age – the age of the passenger.
boat – the lifeboat the passenger got into.
body – the body number of the passenger.
Panda中有一种数据类型NaN,标示not a number,标示缺失值。
下面这段代码用来找到age属性中的缺失值个数,其中的isnull函数用来判断DataFrame中的元素是否为NaN,是NaN则为True。
- 但是需要注意的是:此时计算所有非缺失值的和的时候不能用总的数据去算:这样计算的结果是0,因为NaN的任何计算都是0.
修改如下(先过滤掉缺失值)
titanic_survival[“fare”][titanic_survival[“pclass”] == pclass]这个长句子表达式titanic_survival中等级为pclass的fare值组成的变量。
pclass – the cabin class of the passenger. 1 was the best cabin class, followed by 2, then 3.
name – the name of the passenger.
sex – the gender of the passenger.
age – the age of the passenger.
boat – the lifeboat the passenger got into.
body – the body number of the passenger.
Finding The Missing Data
Python中有一种数据类型None,标示no value.Panda中有一种数据类型NaN,标示not a number,标示缺失值。
下面这段代码用来找到age属性中的缺失值个数,其中的isnull函数用来判断DataFrame中的元素是否为NaN,是NaN则为True。
import pandas as pd titanic_survival = pd.read_csv("titanic_survival.csv") age_null = pd.isnull(titanic_survival["age"]) age_null_true = age_null[age_null == True] age_null_count = len(age_null_true)
Whats The Big Deal With Missing Data?
知道了age属性中有那么多缺失值,下一步就是如何处理缺失值,采用肥缺失值的平均值来填充.- 但是需要注意的是:此时计算所有非缺失值的和的时候不能用总的数据去算:这样计算的结果是0,因为NaN的任何计算都是0.
import pandas as pd mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
修改如下(先过滤掉缺失值)
import pandas as pd age_null = pd.isnull(titanic_survival["age"]) good_ages = titanic_survival["age"][age_null == False] correct_mean_age = sum(good_ages) / len(good_ages)
Easier Ways To Do Math
上面的工作其实pandas已经内置了一个过滤缺失值的求平均的函数.mean()函数:import pandas as pd correct_mean_age = titanic_survival["age"].mean() correct_mean_fare = titanic_survival["fare"].mean()
Computing Summary Statistics
船上的客人通过pclass属性被分为1,2,3等级。计算每个等级的平均票价:titanic_survival[“fare”][titanic_survival[“pclass”] == pclass]这个长句子表达式titanic_survival中等级为pclass的fare值组成的变量。
passenger_classes = [1, 2, 3] fares_by_class = {} for pclass in passenger_classes: fare_for_class = None fares_by_class[pclass] = fare_for_class fares_by_class = {} for pclass in passenger_classes: pclass_fares = titanic_survival["fare"][titanic_survival["pclass"] == pclass] fare_for_class = pclass_fares.mean() fares_by_class[pclass] = fare_for_class
Making Pivot Tables
pivot_table函数计算每个等级用户的成活的概率,pivot_table是一个聚合函数,聚合的方式是均值,按index分组,survived的值作为聚合对象。import pandas as pd import numpy as np passenger_survival = titanic_survival.pivot_table(index="pclass", values="survived", aggfunc=np.mean) # First class passengers had a much higher survival chance print(passenger_survival) passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean) ''' pclass 1 0.619195 2 0.429603 3 0.255289 Name: survived, dtype: float64 '''
More Complex Pivot Tables
import numpy as np # This will compute the mean survival chance and the mean age for each passenger class passenger_survival = titanic_survival.pivot_table(index="pclass", values=["age", "survived"], aggfunc=np.mean) print(passenger_survival) port_stats = titanic_survival.pivot_table(index="embarked", values=["age", "survived", "fare"], aggfunc=np.mean) ''' age survived pclass 1 39.159918 0.619195 2 29.506705 0.429603 3 24.816367 0.255289 '''
Drop Missing Values
import pandas as pd # Drop all rows that have missing values new_titanic_survival = titanic_survival.dropna() # It looks like we have an empty dataframe now. # This is because every row has at least one missing value. print(new_titanic_survival) ''' mpty DataFrame Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest] Index: [] ''' # We can also use the axis argument to drop columns that have missing values new_titanic_survival = titanic_survival.dropna(axis=1) print(new_titanic_survival) ''' Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...] ''' # We can use the subset argument to only drop rows if certain columns have missing values. # This drops all rows where "age" or "sex" is missing. new_titanic_survival = titanic_survival.dropna(subset=["age", "sex"]) new_titanic_survival = titanic_survival.dropna(subset=["age", "body", "home.dest"])
Reindex Rows
当对DataFrame数据进行删除缺失值后,需要对其index进行调整:将reset_index的drop值设置为False表示重新排index从0开始,一般是不进行调整的,这样可以保持原有的数据索引。# The indexes are the original numbers from titanic_survival new_titanic_survival = titanic_survival.dropna(subset=["body"]) print(new_titanic_survival) # Reset the index to an integer sequence, starting at 0. # The drop keyword argument specifies whether or not to make a dataframe column with the index values. # If True, it won't, if False, it will. # We'll almost always want to set it to True. new_titanic_survival = new_titanic_survival.reset_index(drop=True) # Now we have indexes starting from 0! print(new_titanic_survival) new_titanic_survival = titanic_survival.dropna(subset=["age", "boat"]) titanic_reindexed = new_titanic_survival.reset_index(drop=True)
apply
import pandas as pd def not_null_count(column): column_null = pd.isnull(column) null = column[column_null == False] return len(null) # 非空元素的个数 column_not_null_count = titanic_survival.apply(not_null_count) #迭代计算每行
# 计算乘客是否是未成年(<18) def is_minor(row): if row["age"] < 18: return True else: return False minors = titanic_survival.apply(is_minor, axis=1) # axis=1表示逐行 import pandas as pd # 根据年龄贴标签 def generate_age_label(row): age = row["age"] if pd.isnull(age): return "unknown" elif age < 18: return "minor" else: return "adult" age_labels = titanic_survival.apply(generate_age_label, axis=1)
Computing Survival Percentage By Age Group
最明智的方法就是用前面提到的:import numpy as np age_group_survival = titanic_survival.pivot_table(index=age_labels , values=[ "survived"], aggfunc=np.mean)
相关文章推荐
- mysql-mmm实现高可用和部署时须要考虑的问题
- Thread中的stop方法过时
- tomcat+ssl
- linux下通过NFS将远程磁盘mount到本地
- 自学Linux命令的四种方法
- shell编程——if语句 if -z -n -f -eq -ne -lt
- Makefile:3:*** missing separator. Stop.
- 让Windows下的Tomcat将控制台信息记录到日志
- Tomcat数据源的配置
- select标签option选择
- 【软件安装】CentOS 7二进制安装mysql
- nginx、php-fpm二三问
- Linux设备驱动之——I2C总线
- CentOS + PyCharm 环境下使用 LIBSVM(及 unresolved reference 问题的解决)
- Linux Vi 删除全部内容,删除某行到结尾,删除某段内容 的方法
- Linux 常用命令
- Linux目录结构和常用命令
- NSOperation
- 【架构】关于RabbitMQ
- Centos6.5安装gcc及g++