Simpler R coding with pipes > the present and future of the magrittr package
2015-02-13 09:22
901 查看
This is a guest post by Stefan Milton, the author of the
magrittr
package which introduces the %>% operator to R programming.
tutorial (link to the workshop’s material) at useR!2014. After several discussions during the conference (including one very influential conversation
with Rstudio’s Joe Cheng), I got convinced that the pipe operator is one (if not THE)
most
important innovation introduced, this year, to the R ecosystem.
Soon after, I contacted Stefan Milton (the author of the magrittr package), asking him to write about his implementation of the
pipe operator. Stefan generously agreed, and what follows is what he had to share with the rest of us.
The basics
Outlook
Ce n’est qu’un un revoir
me to start the
returning to
that being unable to do something similar in
I had never really noticed this before, but once I did my returning to
An amazing thing about
very simple: it is defined as
Furthermore, functions in
in
works with any number of arguments, but application is always from left to right in the argument list. This is why the most important argument (and the one most likely to be a left-hand side object in the pipeline) is almost always the last argument, which
in turn makes the simple definition of
Here,
this is applied to
It was clear to me that because
I wanted to make something that would feel natural in
It turned out that while I was working on magrittr’s
which they announced on January 17. However, it was not quite as flexible, and we thought that piping functionality was better placed in its own more light-weight package. Hadley joined the
I’ll provide a brief recap here to add some context to the thoughts presented above. Consider the example below (no claim of any scientific relevance, but it was a nice opportunity to try Hadley’s
First note, that even without knowing much about
It’s almost like
Maybe even easier?! The order in which you’d think of these steps is the same as the order in which they are written, and as the order in which they are executed. The alternative would be to use either a bunch of variables, or to have a nasty string of nested
functions calls starting with
The example illustrates a few features of
and
is thus inserted as first argument in the call to
However, one is not always so fortunate that a function is designed to accept the data (or whatever you might be piping along) as its first argument (the
in mind). This is the case with e.g.
not as the first argument. This is a simple and natural way to accommodate the lack of consistency of function signatures, and allows the left-hand side to go anywhere in the call on the right-hand side. You may also have noted that
used without parentheses; this is to make the code even cleaner when only one the left-hand side is needed as input. Finally, note that
and that
above. These make pipe chains more readable (not necessarily shorter).
the first is the stable version, and the latter is the development version and at the time of this writing the latter has quite a lot of features not yet available made it to the CRAN version. Examples are the tee operator
which works like
side expression (so one can omit the verbose
than returning the result of the entire chain, the original symbol is overwritten (could also be e.g.
(although really useful) is that we give a lot of thought to the more general philosophy, and how all these pieces fit best together in a a coherent framework. In particular, one interesting concept that I think is promising is one of functionalsequences
(ala
in the chain, except for the first argument, as a function of a single argument—a functional sequence constructed from a sequence of
the purpose of building values, but a functional sequence is an analogue for building functions, and ties the concepts together. In the development version there is a first attempt to implement this, but this should still be considered experimental.
I’ll illustrate the concept by an example. Consider an auction where participants submit the quantities they are willing to buy at different prices. Given all the submitted bids, our task is to aggregate the demand and supply curves and visualize the crossing
at which supply and demand meet to determine the price which clears the market.
Let’s first generate some (unrealistically uniform) artificial data:
Notice the use of both the tee operator and the experimental lambda syntax, which are currently only available in the development version.
The task is split into two steps; we construct a function, using a functional sequence operator
in a chain which takes the data all the way to a visual.
Note how the
signal that the result is a functional sequence and not a value. Another option is to use
by Romain Francois, R-enthusiast and R/C++ hero).
The functional sequence view of the pipe-chain also opens up for possible optimization. Currently the
out how to proceed given the structure of the right-hand side, which of course has a small overhead. In most situations this is negligible, in others (such as the one described here)
one can restructure ones code so it becomes negligible. But granted, one might encounter realistic examples where a little performance boost would be nice. It is quite possible that integrating functional sequences in a way where it only needs to be clever
about each step once would lead to good results. This could be particularly useful in situations like
magrittr: Simplifying R code with pipes
More Readable Code with Pipes in R
More fun with %.% and %>%
magrittr
package which introduces the %>% operator to R programming.
Preface (by Tal Galili)
I was first introduced to the %>% (a.k.a: pipe) operator in R, thanks to Hadley Wickham’s (fascinating) dplyrtutorial (link to the workshop’s material) at useR!2014. After several discussions during the conference (including one very influential conversation
with Rstudio’s Joe Cheng), I got convinced that the pipe operator is one (if not THE)
most
important innovation introduced, this year, to the R ecosystem.
Soon after, I contacted Stefan Milton (the author of the magrittr package), asking him to write about his implementation of the
pipe operator. Stefan generously agreed, and what follows is what he had to share with the rest of us.
magrittr: The Difficult Crossing – by Stefan Milton Bache
BackgroundThe basics
Outlook
Ce n’est qu’un un revoir
Background
It has only been 7 months and a bit since my initialmagrittrcommit to GitHub on January 1st. It has had more success than I had anticipated, and it appears that I was not quite alone with a frustration which caused
me to start the
magrittrproject. I am not easily frustrated with
R, but after a few weeks working with
F#at work, I felt it upon
returning to
R: I had gotten used to writing code in a different way — all nicely aligned with thought and order of execution. The forward pipe operator
|>was so addictive
that being unable to do something similar in
Rwas more than mildly irritating. Reversing thought, deciphering nested function calls, and making excessive use of temporary variables almost became deal breakers! Surprisingly,
I had never really noticed this before, but once I did my returning to
Rbecame a difficult crossing.
An amazing thing about
Ris that it is a very flexible language and the problem could be solved. The
|>operator in
F#is indeed
very simple: it is defined as
let (|>) x f = f x. However, the usefulness of this simplicity relies heavily on a concept that is not available in
R: partial application.
Furthermore, functions in
F#almost always adhere to certain design principles which make the simple definition sufficient. Suppose that
fis a function of two arguments, then
in
F#you may apply
fto only the first argument and obtain a new function as the result — a function of the second argument alone. This is partial application, and
works with any number of arguments, but application is always from left to right in the argument list. This is why the most important argument (and the one most likely to be a left-hand side object in the pipeline) is almost always the last argument, which
in turn makes the simple definition of
|>work. To illustrate, consider the following example:
some_value |> some_function other_value
Here,
some_functionis partially applied to
other_value, creating a new function of a single argument, and by the simple definition of
|>,
this is applied to
some_value.
It was clear to me that because
Ris lacking native partial application and conventions on argument order, no simple solution would be satisfactory, although definitely possible, see e.g. here or here.
I wanted to make something that would feel natural in
R, and which would serve the main purpose of improving cognitive performance of those writing the code, and of those reading the code.
It turned out that while I was working on magrittr’s
%>%operator, Hadley Wickham and Romain Francois was implementing a similar
%.%operator in their
dplyrpackage
which they announced on January 17. However, it was not quite as flexible, and we thought that piping functionality was better placed in its own more light-weight package. Hadley joined the
magrittrproject, and in
dplyr 2.0the
%.%operator was deprecated — instead
%>%was imported from
magrittr.
The basics
Although quite a few blogs have nice introductions to themagrittrpackage (there is also a vignette),
I’ll provide a brief recap here to add some context to the thoughts presented above. Consider the example below (no claim of any scientific relevance, but it was a nice opportunity to try Hadley’s
babynamespackage):
library(babynames) # data package library(dplyr) # provides data manipulating functions. library(magrittr) # ceci n'est pas un pipe library(ggplot2) # for graphics babynames %>% filter(name %>% substr(1, 3) %>% equals("Ste")) %>% group_by(year, sex) %>% summarize(total = sum(n)) %>% qplot(year, total, color = sex, data = ., geom = "line") %>% add(ggtitle('Names starting with "Ste"')) %>% print
First note, that even without knowing much about
magrittr(or even
R) reading this chunk of code is pretty easy — like a recipe, and not a single temporary variable is needed.
It’s almost like
1. take the baby data, then 2. filter it such that the name sub-string from character 1 to 3 equals "Ste", then 3. group it by year and sex, then 4. summarize it by computing total sum for each group, then 5. plot the resuls, coloring by sex, then 6. add a title, then 7. print it to the canvas.
Maybe even easier?! The order in which you’d think of these steps is the same as the order in which they are written, and as the order in which they are executed. The alternative would be to use either a bunch of variables, or to have a nasty string of nested
functions calls starting with
babynamessomewhere in the middle, and the remaining arguments and values scattered around.
The example illustrates a few features of
%>%. Firstly, the
dplyrfunctions
filter,
group_by,
and
summarizeall take as first argument a data object, and as default this is where
%>%will place its left-hand side. The
babynamesdata
is thus inserted as first argument in the call to
filter. When the filtering is done, the result is passed as the first argument to
group_by, and similarly for
summarize.
However, one is not always so fortunate that a function is designed to accept the data (or whatever you might be piping along) as its first argument (the
dplyrfunctions are designed with
%>%operations
in mind). This is the case with e.g.
qplot, but note the
data = .argument. This tells
%>%to place the left-hand side there, and
not as the first argument. This is a simple and natural way to accommodate the lack of consistency of function signatures, and allows the left-hand side to go anywhere in the call on the right-hand side. You may also have noted that
used without parentheses; this is to make the code even cleaner when only one the left-hand side is needed as input. Finally, note that
%>%can be used in a nested fashion (a separate chain is found within the
filtercall)
and that
magrittrhas aliases for commonly used operators, such as
addfor
+and
equalsfor
==used
above. These make pipe chains more readable (not necessarily shorter).
Outlook
The two main places to obtainmagrittrare CRAN (using
install.packages) and GitHub (using
devtools::install_github). As usual,
the first is the stable version, and the latter is the development version and at the time of this writing the latter has quite a lot of features not yet available made it to the CRAN version. Examples are the tee operator
%T>operator
which works like
%>%but returns the left-hand side after applying the right-hand side; the
%$%operator which exposes the contents/variables of left-hand side for the right-hand
side expression (so one can omit the verbose
dataset$in front of each); a compound assignment pipe operator
%<>%which pipes the left-hand side symbol as usual, but rather
than returning the result of the entire chain, the original symbol is overwritten (could also be e.g.
dataset$variableinstead of a simple symbol). One reason that these features have not yet appeared in the CRAN version
(although really useful) is that we give a lot of thought to the more general philosophy, and how all these pieces fit best together in a a coherent framework. In particular, one interesting concept that I think is promising is one of functionalsequences
(ala
magrittr). Currently each right-hand side is viewed in isolation, and independent of the others in the chain. But since they are all tied together in a linear fashion; one input, one output, one can view everything
in the chain, except for the first argument, as a function of a single argument—a functional sequence constructed from a sequence of
magrittr-like right-hand sides. Furthermore, currently
%>%serves
the purpose of building values, but a functional sequence is an analogue for building functions, and ties the concepts together. In the development version there is a first attempt to implement this, but this should still be considered experimental.
I’ll illustrate the concept by an example. Consider an auction where participants submit the quantities they are willing to buy at different prices. Given all the submitted bids, our task is to aggregate the demand and supply curves and visualize the crossing
at which supply and demand meet to determine the price which clears the market.
Let’s first generate some (unrealistically uniform) artificial data:
set.seed(1) # reproducability # Utility function for sampling. sample_with_replace <- function(v, n = 100) sample(v, size = n, replace = TRUE) # Generate some auction data for the example. auction.data <- data.frame( Price = 1:100 %>% sample_with_replace, Quantity = 1:10 %>% sample_with_replace, Type = 0:1 %>% sample_with_replace %>% factor(labels = c("Buy", "Sell")) ) %T>% (lambda(x ~ x %>% head %>% print))
## Price Quantity Type ## 1 27 7 Buy ## 2 38 4 Buy ## 3 58 3 Sell ## 4 91 10 Buy ## 5 21 7 Buy ## 6 90 3 Sell
Notice the use of both the tee operator and the experimental lambda syntax, which are currently only available in the development version.
The task is split into two steps; we construct a function, using a functional sequence operator
%,%, a function which is able to aggregate a supply (or demand) curve for sellers (buyers). The other step uses this
in a chain which takes the data all the way to a visual.
# Define a function that aggregates the bid data for a type. # Note that the sorting direction depends on type. # For each price level find the total volume which will be sold/bought. aggregate_bids <- group_by(Type, Price) %,% summarize(Quantity = sum(Quantity)) %,% ungroup %,% arrange(Price*(1 - 2*(Type == "Buy"))) %,% mutate(Quantity = Quantity %>% cumsum) # Group the data, aggregate the bids, and plot the supply and demand curves. auction.data %>% group_by(Type) %>% do(aggregate_bids(.)) %>% qplot(Quantity, Price, col = Type, geom = "step", data = .) %>% print
Note how the
aggregate_bidsfunction is built in a way completely analogous to a usual
%>%chain, except that the
%,%is used to
signal that the result is a functional sequence and not a value. Another option is to use
%>%here too and have a designated first left-hand side, e.g.
.(suggested
by Romain Francois, R-enthusiast and R/C++ hero).
The functional sequence view of the pipe-chain also opens up for possible optimization. Currently the
%>%pipe is built for robustness and user-friendliness in a sense similar to generic functions. It will figure
out how to proceed given the structure of the right-hand side, which of course has a small overhead. In most situations this is negligible, in others (such as the one described here)
one can restructure ones code so it becomes negligible. But granted, one might encounter realistic examples where a little performance boost would be nice. It is quite possible that integrating functional sequences in a way where it only needs to be clever
about each step once would lead to good results. This could be particularly useful in situations like
result <- looong_vector %>% lapply( one_action %,% another_action(requiring_x) %,% (lambda(. ~ finalizing_actions)) )While this post was written, other R bloggers wrote their own posts on magrittr, here is what they had to say:
magrittr: Simplifying R code with pipes
More Readable Code with Pipes in R
More fun with %.% and %>%
相关文章推荐
- Error:A problem was found with the configuration of task ':app:packageRelease'. > File 'F:\AndroidSt
- (Andrew NG)The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
- Installation failed with message...It is possible that this issue is resolved by uninstalling an existing version of the apk if it is present, and then re-installing.
- Asynchronous and Distributed Programming in R with the Future Package
- The Past,Present,and Future of Configuration Management(Susan A.Dart)(1992)
- The Study of Programming Windows with MFC--Progress and Animate control
- The Study of Programming Windows with MFC--Imagelist and ComboBoxEx
- The Future of Middleware and the BizTalk Roadmap
- The limitation and reason of EWF with a Hibernate Once/Resume Many Configuration
- The Hopes of Agnes Grey Contrasted with the Reality ---- Discussion on the Distance between Ideality and Reality
- Android应用程序上传错误The package name of your apk may not begin with any of the following values:[com.android, com.google, android, co
- Flash, Google, VP8, and the future of internet video
- Bjarne Stroustrup Expounds on Concepts and the Future of C++
- Allow user to scroll and maintain position with "Scroll To Bottom of the Div" example
- With Fear and Wonder in Its Wake, Sputnik Lifted Us Into the Future
- The study of chapter 13 in programming windows with mfc-printing with document and views
- The Semantic Web : A Guide to the Future of XML, Web Services, and Knowledge Management
- 【原】用使用JavaScript展开/折叠TreeView中所有节点(Expand and Collapse All Nodes of asp.net Treeview on the client with javascript)
- The Essence of Object-Oriented Programming with Java and UML
- The Future of Client App Dev : WPF and Silverlight Convergence