您的位置：首页 > 其它

白手起家学习数据科学 ——处理数据之“操纵数据篇”(七)

2015-12-03 11:20 357 查看

操纵数据(Manipulating Data)

数据科学家最重要的技能就是操纵数据。相比于指定的技术，操纵数据是更加通用的方法。我们会解决少量的例子给你一些真实的感受。

假设我们处理股票价格数据：

data = [
{'closing_price': 102.06,
'date': datetime.datetime(2014, 8, 29, 0, 0),
'symbol': 'AAPL'},
# ...
]

概念上我们把他们看成一行一行的(像spreadsheet)。

让我们开始问关于数据的问题，我们会留意我们正在做的模式并抽取出一些通用模式。

例如，我们想要知道AAPL里中最高的closing price。让我们把它分成如下几步：

提取出具有AAPL属性的行；

从这些行中抓取closing_price;

获取这些prices中最大的值。

这三个步骤可以用list解析器(for-in)来实现：

max_aapl_price = max(row["closing_price"]
for row in data
if row["symbol"] == "AAPL")

更加一般的情况，我们想要知道每只股票的最大closing price：

把所有数据分组成具有相同symbol的行；

对于每一组数据，做上面相同的操作。

# group rows by symbol
by_symbol = defaultdict(list)
for row in data:
by_symbol[row["symbol"]].append(row)

# use a dict comprehension to find the max for each symbol
max_price_by_symbol = { symbol : max(row["closing_price"]
for row in grouped_rows)
for symbol, grouped_rows in by_symbol.iteritems() }

这2个例子有一些模式(能够经常使用操纵数据的函数)：我们需要从每个dict里拉出closing_price，所以让我们创建一个获取dict里指定数值的函数；另外一个函数是获取dicts集合中指定的字段：

def picker(field_name):
"""returns a function that picks a field out of a dict"""
return lambda row: row[field_name]
def pluck(field_name, rows):
"""turn a list of dicts into the list of field_name values"""
return map(picker(field_name), rows)

通过grouper函数的结果，我们也能创建一个对行进行分组的函数，有选择性的对每个组进行value_transform操作：

def group_by(grouper, rows, value_transform=None):
# key is output of grouper, value is list of rows
grouped = defaultdict(list)
for row in rows:
grouped[grouper(row)].append(row)

if value_transform is None:
return grouped
else:
return { key : value_transform(rows)
for key, rows in grouped.iteritems() }

这个允许我们快速且简单的重写以前的例子：

max_price_by_symbol = group_by(picker("symbol"),
data,
lambda rows: max(pluck("closing_price", rows)))

现在，我们可以问更加复杂的问题了，像一天中最大值与最小值的变化比；price_today/price_yesterday-1的变化比，这个意思是我们需要把今天的价格与昨天的价格关联起来的方法，一个方法是根据symbol分组prices，然后在每个组里：

按日期排序价格；

使用zip得到pair(previous,current);

把pair转换成新的”percent change”行。

我们会在每一个组里运行如下函数：

def percent_price_change(yesterday, today):
return today["closing_price"] / yesterday["closing_price"] - 1

def day_over_day_changes(grouped_rows):
# sort the rows by date
ordered = sorted(grouped_rows, key=picker("date"))

# zip with an offset to get pairs of consecutive days
return [{ "symbol" : today["symbol"],
"date" : today["date"],
"change" : percent_price_change(yesterday, today) }
for yesterday, today in zip(ordered, ordered[1:])]

然后我们能在group_by中的参数 value_transform位置处使用day_over_day_changes函数：

# key is symbol, value is list of "change" dicts
changes_by_symbol = group_by(picker("symbol"), data, day_over_day_changes)

# collect all "change" dicts into one big list
all_changes = [change
for changes in changes_by_symbol.values()
for change in changes]

现在很容易找到最大值和最小值：

max(all_changes, key=picker("change"))
# {'change': 0.3283582089552237,
# 'date': datetime.datetime(1997, 8, 6, 0, 0),
# 'symbol': 'AAPL'}
# see, e.g. http://news.cnet.com/2100-1001-202143.html 
min(all_changes, key=picker("change"))
# {'change': -0.5193370165745856,
# 'date': datetime.datetime(2000, 9, 29, 0, 0),
# 'symbol': 'AAPL'}
# see, e.g. http://money.cnn.com/2000/09/29/markets/techwrap/[/code] 
现在我们能使用新的all_changes数据集找到哪个月是最好时机投资。首先我们按月分组changes；然后我们计算每组的全部change。

我们写一个合适的value_transform，然后使用group_by：

# to combine percent changes, we add 1 to each, multiply them, and subtract 1
# for instance, if we combine +10% and -20%, the overall change is
# (1 + 10%) * (1 - 20%) - 1 = 1.1 * .8 - 1 = -12%
def combine_pct_changes(pct_change1, pct_change2):
return (1 + pct_change1) * (1 + pct_change2) - 1

def overall_change(changes):
return reduce(combine_pct_changes, pluck("change", changes))

overall_change_by_month = group_by(lambda row: row['date'].month,
all_changes,
overall_change)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航