您的位置:首页 > 其它

Simpler R coding with pipes > the present and future of the magrittr package

2015-02-13 09:22 901 查看
This is a guest post by Stefan Milton, the author of the
magrittr
package which introduces the %>% operator to R programming.

Preface (by Tal Galili)

I was first introduced to the %>% (a.k.a: pipe) operator in R, thanks to Hadley Wickham’s (fascinating) dplyr
tutorial (link to the workshop’s material) at useR!2014. After several discussions during the conference (including one very influential conversation
with Rstudio’s Joe Cheng), I got convinced that the pipe operator is one (if not THE)
most
important innovation
 introduced, this year, to the R ecosystem.
Soon after, I contacted Stefan Milton (the author of the magrittr package), asking him to write about his implementation of the
pipe operator. Stefan generously agreed, and what follows is what he had to share with the rest of us.

magrittr: The Difficult Crossing – by Stefan Milton Bache

Background
The basics
Outlook
Ce n’est qu’un un revoir



Background

It has only been 7 months and a bit since my initial 
magrittr
 commit to GitHub on January 1st. It has had more success than I had anticipated, and it appears that I was not quite alone with a frustration which caused
me to start the 
magrittr
 project. I am not easily frustrated with 
R
, but after a few weeks working with 
F#
 at work, I felt it upon
returning to 
R
: I had gotten used to writing code in a different way — all nicely aligned with thought and order of execution. The forward pipe operator 
|>
 was so addictive
that being unable to do something similar in 
R
 was more than mildly irritating. Reversing thought, deciphering nested function calls, and making excessive use of temporary variables almost became deal breakers! Surprisingly,
I had never really noticed this before, but once I did my returning to 
R
 became a difficult crossing.

An amazing thing about 
R
 is that it is a very flexible language and the problem could be solved. The 
|>
 operator in 
F#
 is indeed
very simple: it is defined as 
let (|>) x f = f x
. However, the usefulness of this simplicity relies heavily on a concept that is not available in 
R
: partial application.
Furthermore, functions in 
F#
 almost always adhere to certain design principles which make the simple definition sufficient. Suppose that 
f
 is a function of two arguments, then
in 
F#
 you may apply 
f
 to only the first argument and obtain a new function as the result — a function of the second argument alone. This is partial application, and
works with any number of arguments, but application is always from left to right in the argument list. This is why the most important argument (and the one most likely to be a left-hand side object in the pipeline) is almost always the last argument, which
in turn makes the simple definition of 
|>
 work. To illustrate, consider the following example:

some_value |> some_function other_value

Here, 
some_function
 is partially applied to 
other_value
, creating a new function of a single argument, and by the simple definition of 
|>
,
this is applied to 
some_value
.

It was clear to me that because 
R
 is lacking native partial application and conventions on argument order, no simple solution would be satisfactory, although definitely possible, see e.g. here or here.
I wanted to make something that would feel natural in 
R
, and which would serve the main purpose of improving cognitive performance of those writing the code, and of those reading the code.

It turned out that while I was working on magrittr’s 
%>%
 operator, Hadley Wickham and Romain Francois was implementing a similar 
%.%
 operator in their 
dplyr
 package
which they announced on January 17. However, it was not quite as flexible, and we thought that piping functionality was better placed in its own more light-weight package. Hadley joined the 
magrittr
 project, and in 
dplyr
2.0
 the 
%.%
 operator was deprecated — instead
%>%
 was imported from 
magrittr
.

The basics

Although quite a few blogs have nice introductions to the 
magrittr
 package (there is also a vignette),
I’ll provide a brief recap here to add some context to the thoughts presented above. Consider the example below (no claim of any scientific relevance, but it was a nice opportunity to try Hadley’s 
babynames
 package):

library(babynames) # data package
library(dplyr)     # provides data manipulating functions.
library(magrittr)  # ceci n'est pas un pipe
library(ggplot2)   # for graphics

babynames %>%
filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
qplot(year, total, color = sex, data = ., geom = "line") %>%
add(ggtitle('Names starting with "Ste"')) %>%
print

 



First note, that even without knowing much about 
magrittr
 (or even 
R
) reading this chunk of code is pretty easy — like a recipe, and not a single temporary variable is needed.
It’s almost like

1. take the baby data, then
2.   filter it such that the name sub-string from character 1 to 3 equals "Ste", then
3.   group it by year and sex, then
4.   summarize it by computing total sum for each group, then
5.   plot the resuls, coloring by sex, then
6.   add a title, then
7.   print it to the canvas.

Maybe even easier?! The order in which you’d think of these steps is the same as the order in which they are written, and as the order in which they are executed. The alternative would be to use either a bunch of variables, or to have a nasty string of nested
functions calls starting with 
print
 at the very left, 
babynames
 somewhere in the middle, and the remaining arguments and values scattered around.

The example illustrates a few features of 
%>%
. Firstly, the 
dplyr
 functions 
filter
group_by
,
and 
summarize
 all take as first argument a data object, and as default this is where 
%>%
 will place its left-hand side. The 
babynames
 data
is thus inserted as first argument in the call to 
filter
. When the filtering is done, the result is passed as the first argument to 
group_by
, and similarly for 
summarize
.
However, one is not always so fortunate that a function is designed to accept the data (or whatever you might be piping along) as its first argument (the 
dplyr
 functions are designed with 
%>%
 operations
in mind). This is the case with e.g. 
qplot
, but note the 
data = .
 argument. This tells 
%>%
 to place the left-hand side there, and
not as the first argument. This is a simple and natural way to accommodate the lack of consistency of function signatures, and allows the left-hand side to go anywhere in the call on the right-hand side. You may also have noted that 
print
 is
used without parentheses; this is to make the code even cleaner when only one the left-hand side is needed as input. Finally, note that 
%>%
 can be used in a nested fashion (a separate chain is found within the 
filter
 call)
and that 
magrittr
 has aliases for commonly used operators, such as 
add
 for 
+
 and 
equals
 for 
==
 used
above. These make pipe chains more readable (not necessarily shorter).

Outlook

The two main places to obtain 
magrittr
 are CRAN (using 
install.packages
) and GitHub (using 
devtools::install_github
). As usual,
the first is the stable version, and the latter is the development version and at the time of this writing the latter has quite a lot of features not yet available made it to the CRAN version. Examples are the tee operator 
%T>
 operator
which works like 
%>%
 but returns the left-hand side after applying the right-hand side; the 
%$%
 operator which exposes the contents/variables of left-hand side for the right-hand
side expression (so one can omit the verbose
dataset$
 in front of each); a compound assignment pipe operator 
%<>%
 which pipes the left-hand side symbol as usual, but rather
than returning the result of the entire chain, the original symbol is overwritten (could also be e.g. 
dataset$variable
 instead of a simple symbol). One reason that these features have not yet appeared in the CRAN version
(although really useful) is that we give a lot of thought to the more general philosophy, and how all these pieces fit best together in a a coherent framework. In particular, one interesting concept that I think is promising is one of functionalsequences
(ala 
magrittr
). Currently each right-hand side is viewed in isolation, and independent of the others in the chain. But since they are all tied together in a linear fashion; one input, one output, one can view everything
in the chain, except for the first argument, as a function of a single argument—a functional sequence constructed from a sequence of 
magrittr
-like right-hand sides. Furthermore, currently 
%>%
 serves
the purpose of building values, but a functional sequence is an analogue for building functions, and ties the concepts together. In the development version there is a first attempt to implement this, but this should still be considered experimental.

I’ll illustrate the concept by an example. Consider an auction where participants submit the quantities they are willing to buy at different prices. Given all the submitted bids, our task is to aggregate the demand and supply curves and visualize the crossing
at which supply and demand meet to determine the price which clears the market.

Let’s first generate some (unrealistically uniform) artificial data:

set.seed(1) # reproducability

# Utility function for sampling.
sample_with_replace <-
function(v, n = 100) sample(v, size = n, replace = TRUE)

# Generate some auction data for the example.
auction.data <-
data.frame(
Price    = 1:100 %>% sample_with_replace,
Quantity = 1:10  %>% sample_with_replace,
Type     =
0:1 %>%
sample_with_replace %>%
factor(labels = c("Buy", "Sell"))
) %T>%
(lambda(x ~ x %>% head %>% print))

##   Price Quantity Type
## 1    27        7  Buy
## 2    38        4  Buy
## 3    58        3 Sell
## 4    91       10  Buy
## 5    21        7  Buy
## 6    90        3 Sell

Notice the use of both the tee operator and the experimental lambda syntax, which are currently only available in the development version.

The task is split into two steps; we construct a function, using a functional sequence operator 
%,%
, a function which is able to aggregate a supply (or demand) curve for sellers (buyers). The other step uses this
in a chain which takes the data all the way to a visual.

# Define a function that aggregates the bid data for a type.
# Note that the sorting direction depends on type.
# For each price level find the total volume which will be sold/bought.
aggregate_bids <-
group_by(Type, Price) %,%
summarize(Quantity = sum(Quantity)) %,%
ungroup %,%
arrange(Price*(1 - 2*(Type == "Buy"))) %,%
mutate(Quantity = Quantity %>% cumsum)

# Group the data, aggregate the bids, and plot the supply and demand curves.
auction.data %>%
group_by(Type) %>%
do(aggregate_bids(.)) %>%
qplot(Quantity, Price, col = Type, geom = "step", data = .) %>%
print

 



Note how the 
aggregate_bids
 function is built in a way completely analogous to a usual 
%>%
 chain, except that the 
%,%
 is used to
signal that the result is a functional sequence and not a value. Another option is to use 
%>%
 here too and have a designated first left-hand side, e.g. 
.
 (suggested
by Romain Francois, R-enthusiast and R/C++ hero).

The functional sequence view of the pipe-chain also opens up for possible optimization. Currently the 
%>%
 pipe is built for robustness and user-friendliness in a sense similar to generic functions. It will figure
out how to proceed given the structure of the right-hand side, which of course has a small overhead. In most situations this is negligible, in others (such as the one described here)
one can restructure ones code so it becomes negligible. But granted, one might encounter realistic examples where a little performance boost would be nice. It is quite possible that integrating functional sequences in a way where it only needs to be clever
about each step once would lead to good results. This could be particularly useful in situations like

result <-
looong_vector %>%
lapply(
one_action %,%
another_action(requiring_x) %,%
(lambda(. ~ finalizing_actions))
)
While this post was written, other R bloggers wrote their own posts on magrittr, here is what they had to say:
magrittr: Simplifying R code with pipes
More Readable Code with Pipes in R
More fun with %.% and %>%
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐