您的位置:首页 > 其它

dplyr do: Some Tips for Using and Programming

2016-07-08 10:21 781 查看
This post was originally posted on Quantide blog. Read the full
article here.
If you want to compute arbitrary operations on a data frame returning more than one number back, use 
dplyr
 
do()
!

This post aims to explore some basic concepts of 
do()
, along with giving some advice in using and programming.

do()
 is a verb (function) of 
dplyr
dplyr
 is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames,
like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.

First of all, you have to install 
dplyr
 package:

install.packages("dplyr")

and to load it:

require(dplyr)

We will analyze the use of 
do()
 with the following dataset, created with random data:

set.seed(100)
ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)),
x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))

We firstly transform it into a 
tbl_df
 object to achieve a better print method. No changes occur on the input data frame.

ds <- tbl_df(ds)
ds


Source: local data frame [300 x 3]

group        x           y
(fctr)    (dbl)       (dbl)
1       a 1.995615 -1.71089045
2       a 3.263062 -0.03712943
3       a 2.842166 -0.09022217
4       a 4.773570  0.69742469
5       a 3.233943  2.76536531
6       a 3.637260  4.06379942
7       a 1.836419  2.26214995
8       a 4.429065  2.75438347
9       a 1.349481 -1.77539016
10      a 2.280276  3.04043881
..    ...      ...         ...



Base Concepts of do() (Non Standard Evaluation Version)

As we already said, 
do()
 computes arbitrary operations on a data frame returning more than one number back.

To use 
do()
, you must know that:

it always returns a dataframe
unlike the others data manipulation verbs of 
dplyr
do()
needs the specification of 
.
 placeholder inside the function to apply, referring to the data it has to work with.
# Head of ds
ds %>% do(head(.))


Source: local data frame [6 x 3]

group        x           y
(fctr)    (dbl)       (dbl)
1      a 1.995615 -1.71089045
2      a 3.263062 -0.03712943
3      a 2.842166 -0.09022217
4      a 4.773570  0.69742469
5      a 3.233943  2.76536531
6      a 3.637260  4.06379942


it is conceived to be used with dplyr 
group_by()
 to compute operations within groups:
# Head of ds by group
ds %>% group_by(group) %>% do(head(.))


Source: local data frame [18 x 3]
Groups: group [3]

group          x           y
(fctr)      (dbl)       (dbl)
1       a 1.99561530 -1.71089045
2       a 3.26306233 -0.03712943
3       a 2.84216582 -0.09022217
4       a 4.77356962  0.69742469
5       a 3.23394254  2.76536531
6       a 3.63726018  4.06379942
7       b 2.33415330 -0.56965729
8       b 5.72622741  1.71643653
9       b 2.06170532  4.87756954
10      b 4.68575126 -0.08011508
11      b 0.08401255 -0.04767590
12      b 2.19938816  4.18954758
13      c 3.05634353 -0.89257491
14      c 2.28659319  2.63171152
15      c 4.70525275  1.31450497
16      c 4.02673050 -1.86270620
17      c 5.03640599  2.48564201
18      c 0.95704183  1.27446410


the argument of 
do()
 can be named or unnamed:
named arguments (more than one supplied) become list-columns, with one element for each group:
# Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(out=tail(.$x, 3))


Source: local data frame [3 x 2]
Groups: <by row>

group      out
(fctr)    (chr)
1      a <dbl[3]>
2      b <dbl[3]>
3      c <dbl[3]>


unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:
# Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))


Source: local data frame [9 x 2]
Groups: group [3]

group       out
(fctr)     (dbl)
1      a 3.8270397
2      a 0.6426337
3      a 0.6519305
4      b 3.3238824
5      b 0.8290942
6      b 4.1538746
7      c 6.5861213
8      c 4.6280643
9      c 0.3599512


Its use is the same working with customized functions.

Let us define the following function, which performs two simple operations returning a data frame:

my_fun <- function(x, y){
res_x = mean(x) + 2
res_y = mean(y) * 5
return(data.frame(res_x, res_y))
}

If the argument is named the result is:

# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))


Source: local data frame [3 x 2]
Groups: <by row>

group                out
(fctr)              (chr)
1      a <data.frame [1,2]>
2      b <data.frame [1,2]>
3      c <data.frame [1,2]>

Otherwise, if argument is unnamed the result is:

# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))


Source: local data frame [3 x 3]
Groups: group [3]

group    res_x     res_y
(fctr)    (dbl)     (dbl)
1      a 5.005825  9.167546
2      b 5.022282  8.683619
3      c 5.025586 11.240558



Programming with do_() (Standard Evaluation Version)

How can we enclose the previous operations inside a function? Simple! Using 
do_()
 (the SE version of 
do()
) and 
interp()
 function of 
lazyeval
 package.

Continue reading on Quantide blog…

The post dplyr do: Some Tips for Using and Programming appeared
first on MilanoR.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  dplyr do