【跟着stackoverflow学Pandas】Select rows from a DataFrame based on values in a column -pandas 筛选
2017-08-05 15:24
701 查看
最近做一个系列博客,跟着stackoverflow学Pandas。
专栏地址:http://blog.csdn.net/column/details/16726.html
以 pandas作为关键词,在stackoverflow中进行搜索,随后安照 votes 数目进行排序:
https://stackoverflow.com/questions/tagged/pandas?sort=votes&pageSize=15
pandas的筛选功能,跟excel的筛选功能类似,但是功能更强大。
在SQL数据中, 我们可以用这样的语句:
所以,如果想通过数值来对行进行筛选,我们可以通过构造bool值来选择DataFrame的行
组合多种条件
不等于,可以使用
参考:
http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
可以发现采用
专栏地址:http://blog.csdn.net/column/details/16726.html
以 pandas作为关键词,在stackoverflow中进行搜索,随后安照 votes 数目进行排序:
https://stackoverflow.com/questions/tagged/pandas?sort=votes&pageSize=15
Select rows from a DataFrame based on values in a column -pandas 筛选
https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandaspandas的筛选功能,跟excel的筛选功能类似,但是功能更强大。
在SQL数据中, 我们可以用这样的语句:
select * from table where colume_name = some_value.
bool 索引
在Pandas的DataFrame格式中可以采用 bool 值作为索引,选取数据行。比如:import pandas as pd # Create data set d = {'foo':[100, 111, 222], 'bar':[333, 444, 555]} df = pd.DataFrame(d) # Full dataframe: df # Shows: # bar foo # 0 333 100 # 1 444 111 # 2 555 222 # bool 值索引 df[[True, False, True]] # 或 df.loc[[True, False, True]] # 都可以得到 # bar foo #0 333 100 #1 444 111
所以,如果想通过数值来对行进行筛选,我们可以通过构造bool值来选择DataFrame的行
df[df['column_name'] == some_value]如果是数值型,也可以采用 >/<
df[df['column_name'].isin(some_values)]some_values 可以是单个变量,也可以是list 或者迭代器
组合多种条件
df[(df['column_name'] == some_value) & df['other_column'].isin(some_values)] df[(df['column_name'] == some_value) | df['other_column'].isin(some_values)] #注意,& | 的优先级很高,所以每个条件都需要一个括号
不等于,可以使用
df[~df['column_name'].isin(some_values)] df[df['column_name'] != some_value]
np.where
与上面所述的方法有所不同, np.where 返回的是行的位置,所以在获取行时不能采用df, 要采用df.loc 或者 df.ilocnp.where(df.A.values=='foo') # (array([0, 2, 4, 6, 7]),) df.iloc[np.where(df.A.values=='foo')]
query
DataFrame 提供了query函数,方便我们可以采用表达式来进行数据的筛选。参考:
http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
n = 10 df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc')) # b c # 0 9 0 # 1 1 2 # 2 2 4 # 3 7 6 # 4 6 4 # 5 4 7 # 6 2 9 # 7 4 8 # 8 6 2 # 9 9 0 df.query('index > b > c') # b c # 8 6 2 #可以采用的表达式很多,比如 df.query('(a < b) & (b < c)') df.query('a < b and b < c') df.query('color == "red"')
时间测评
import pandas as pd df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split()}) df.iloc[np.where(df.A.values=='foo')] %timeit df.iloc[np.where(df.A.values=='foo')] #1000 loops, best of 3: 274 µs per loop %timeit df.loc[np.where(df.A.values=='foo')] #1000 loops, best of 3: 342 µs per loop %timeit df.loc[df['A'] == 'foo'] #1000 loops, best of 3: 347 µs per loop %timeit df[df['A'] == 'foo'] #1000 loops, best of 3: 354 µs per loop %timeit df.loc[df['A'].isin(['foo'])] #1000 loops, best of 3: 265 µs per loop %timeit df[df.A=='foo'] #1000 loops, best of 3: 357 µs per loop %timeit df.query('(A=="foo")') #1000 loops, best of 3: 943 µs per loop
可以发现采用
df.iloc[np.where(df.A.values=='foo')]和
df.loc[df['A'].isin(['foo'])]速度比较快, 而采用query的方法比较慢。
df.loc[df['A'] == 'foo']速度快于
df[df['A'] == 'foo']
相关文章推荐
- select rows by values in a column from Dataframe
- 【跟着stackoverflow学Pandas】 -Get list from pandas DataFrame column headers - Pandas 获取列名
- 【跟着stackoverflow学Pandas】 - Adding new column to existing DataFrame in Python pandas - Pandas 添加列
- 【跟着stackoverflow学Pandas】Delete column from pandas DataFrame-删除列
- 【跟着stackoverflow学Pandas】How to iterate over rows in a DataFrame in Pandas-DataFrame按行迭代
- 【跟着stackoverflow学Pandas】add one row in a pandas.DataFrame -DataFrame添加行
- sorting data based on the value in second column of a file
- 【跟着stackoverflow学Pandas】-How do I get the row count of a Pandas dataframe-获取DataFrame行数
- 异常-----freemarker.template.TemplateException: Expected collection or sequence. datas evaluated instead to freemarker.core.HashLiteral$SequenceHash on line 7, column 18 in inc/select.ftl.
- 【跟着stackoverflow学Pandas】--Converting a Pandas GroupBy object to DataFrame-Groupby对象转换为DataFrame
- Create Data Block Based On From Clause Query In Oracle Forms
- dataframe插入数据报错SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a
- update records in one table based on values in another table
- Get XML tree format based on the tree data in SQL server table
- Expression parameters.formName is undefined on line 111, column 43 in template/simple/doubleselect.
- pandas.DataFrame.drop_duplicates后面inplace=True与inplace=False的区别
- How to get the data from a cell when I click on the GridButtonColumn of the same row
- From Pandas to Apache Spark’s Dataframe
- Select data from an Excel sheet in MSSQL
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)