『Data Science』R语言学习笔记,基础语法
2016-07-12 22:06
399 查看
Data Types
Data Object & Vector
x <- c(0.5, 0.6) ## numeric x <- c(TRUE, FALSE) ## logical x <- c(T, F) ## logical x <- c("a","b","c") ## character x <- 9:29 ## integer x <- c(1+0i, 2+4i) ## complex x <- vector("numeric", length = 10) ## create a numeric vector, which length is 10. x <- 0.6 ## get the class type of the variable class(x) ## print the class type of "x". x <- 1:10 ## set the class type to the variable forcibly. as.character(x)
List
x <- list("...", "...", ...)
Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of lenght 2 (nrow, ncol).m <- matrix(nrow = 2, ncol = 3) n <- matrix(1:6, nrow = 2, ncol = 3) dim(m) ## get the value of "norw, ncol" of the matrix. attributes(m) ## get the a of m <- 1:10 ## create a new numeric vector, from 1 to 10 dim(m) <- c(2,5) ## put the vector "m" into a matrix, and assign the value (nrow = 2, ncol = 3) to it. m ## print the value of "m". x <- 1:3 y <- 10:12 cbind(x, y) ## create a matrix by "cbind", binding the value of columns with variables, which has 3 rows and 2 columns. rbind(x, y) ## create a matrix by "rbind", binding the value of rows with variables, which has 2 rows and 3 columns.
Factors
Factors are used to represent categorical data. One can think of a factor is an integer vector where each integer has a label.x <- factor(c("yes", "yes", "yes", "yes", "no", "no")) ## create a factor with a character vector. x ## print the factor. table(x) ## list the label (with its quantity) of the factor in a table. unclass(x) ## list the value and the label of the factor. x <- factor(c("yes", "yes", "no", level("yes", "no"))) ## create a factor with a character vector which had set the "levels" in it.
Missing Values
Missing values are denoted by NA of NaN for undefined mathematical operations.is.na() is.nan() x <- c(1, 2, NaN, NA, 4) ## Create a vector for test the functions, ```is.na()``` and ```is.nan()```. is.na(x) ## NA values have a class also, so there are integer NA, character NA, etc. is.nan(x) ## A NaN value is also NA but the converse is not true.
Whole codes below:
> x <- c(1, 2, NA, 10, 3) > is.na(x) [1] FALSE FALSE TRUE FALSE FALSE > is.nan(x) [1] FALSE FALSE FALSE FALSE FALSE > x <- c(1, 2, NaN, NA, 4) > is.na(x) [1] FALSE FALSE TRUE TRUE FALSE > is.nan(x) [1] FALSE FALSE TRUE FALSE FALSE
Data Frames
Data frames are used to store tabular data.They are represented as a special type of list where every element of the list has to have the same length.
Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.
Unlike matrices, data frames can store different classes of objects in each column (just like lists);matrices must have every element be the same class.
Data frames also have a special attribute called
row.names.
Data frames are usually created by calling
read.table()or
read.csv().
Can be converted to a matrix by calling
data.matrix().
> x <- data.frame(foo = 1:4, bar = c(T,T,F,F)) ## create a Data Frame Object which has two columns and four rows. > x foo bar 1 1 TRUE 2 2 TRUE 3 3 FALSE 4 4 FALSE
Names
R objects can also have names, which is very useful for writing readable code and self-describing objects.> x <- 4:6 ## Create a integer vector 'x' which has three elements. > names(x) <- c("foo", "bar", "norf") ## Assign names to vector 'x'. > x ## Print the value of 'x'. foo bar norf 4 5 6
Data Reading
Reading Data
read.table,
read.csv, for reading tabular data, which return a
data.frameobject.
readLines, for reading lines of a text file.
source, for reading in R code files(inverse of dump).
dget, for reading in R code files(inverse of dput).
load, for reading in saved workspaces.
unserialize, for reading single R objects in binary form.
read.table
Description: Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.Main Arguments:
file
header
sep, columns separate, like ,.
colClasses, the data class types of the column.
nrows, number of the rows.
comment.character, a character vector indicating the class of each column in the dataset.
skip, the number of lines to skip from the beginning.
stringsAsFactors, should character variables be coded as factors?
Usages:
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...) read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
Writing Data
Description:write.tableprints its required argument
x(after converting it to a data frame if it is not one nor a matrix) to a file or connection.
Main Points:
write.table
writeLines
dump
dput
save
serialize
Usages:
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "") write.csv(...) write.csv2(...)
Reading Large Tables
Read the help page forread.table, which contains many hints.
Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.
Set
comment.char = ""if there are no commented lines in your file.
Use the
colClassesargument. Specifying this option instead of using the default can make
read.tablerun MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are "numeric", for example, then you can just set
colClasses = "numeric". A quick an dirty way to figure out the classes of each column is the following:
> initial <- read.table("db.txt", nrows = 100, sep = "\t") > classes <- sapply(initial, class) > tabAll <- read.table("db.txt", sep = "\t", colClasses = classes)
Set
nrows. This doesn't make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool
wcto calculate the number of lines in a file.
Reading Data Formats
dput
and dget
> y <- data.frame(a = 1, b = "a") ## Create a `data.frame` object for `dput` > dput(y) ## `dput` the object created before structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", "b"), row.names = c(NA, -1L), class = "data.frame") > dput(y, file = 'y.R') ## `dput` the object created before into a file which named 'y.R' > new.y <- dget('y.R') ## get the data store in the file 'y.R' > new.y ## print the data in the 'y.R' a b 1 1 a
dump
Multiple objects can be deparsed using the dump function and read back in using source.> x <- "foo" ## create the first data object > y <- data.frame(a = 1, b = "a") ## create the second data object > dump(c("x", "y"), file = "data.R") ## store the both data object in to a file called 'data.R' > rm(x, y) ## remove the both data object from RAM > source("data.R") ## import the dumped file 'data.R' > y ## print the data object 'y' from 'data.R' a b 1 1 a > x ## print the data object 'x' from 'data.R' [1] "foo"
Connections: Interfaces to the Outside World
Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.file, opens a connection to a file
gzfile, opens a connection to a file compressed with gzip
bzfile, opens a connection to a file compressed with bzip2
url, opens a connection to a webpage.
> con <- file('db.txt', 'r') > readLines(con)
Subsetting
[always returns an object of the same class as the original; can be used to select more than one element (there is one exception)
[[is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.
$is used to extract elements of a list or data frame by name; semantics are similar to hat of
[[.
Basic
> x <- c("a", "b", "c", "d", "e") > x[1] [1] "a" > x[2] [1] "b" > x[1:3] [1] "a" "b" "c" > x[x > "a"] [1] "b" "c" "d" "e" > u <- x>"a" > u [1] FALSE TRUE TRUE TRUE TRUE > x[u] [1] "b" "c" "d" "e"
Lists
> x <- list(foo = 1:4, bar = 0.6) > x[1] $foo [1] 1 2 3 4 > x[[1]] [1] 1 2 3 4 > x[[2]] [1] 0.6 > x$bar [1] 0.6 > x$foo [1] 1 2 3 4 > x[["bar"]] [1] 0.6 > x["bar"] $bar [1] 0.6
> x <- list(foo = 1:4, bar = 0.6, baz = "hello") > x[c(1, 3)] $foo [1] 1 2 3 4 $baz [1] "hello" > name <- "foo" > x[[name]] [1] 1 2 3 4 > x$name ## `name` is a variable, not a `level`, so does not has x$name in the list `x`. NULL > x$foo [1] 1 2 3 4
Matrices
Matrices can be subsetted in the usual way with (i,j) type indices.> x <- matrix(1:6, 2, 3) > x[1,2] [1] 3 > x[1,] [1] 1 3 5 > x[,2] [1] 3 4 > x[1, 2, drop = FALSE] [,1] [1,] 3 > x[1, , drop = FALSE] [,1] [,2] [,3] [1,] 1 3 5
Partial Matching
Partial matching of names is allowed with[[and
$.
> x <- list(aardvark = 1:5) > x$a [1] 1 2 3 4 5 > x[["a"]] NULL > x[["a", exact = FALSE]] [1] 1 2 3 4 5
Removing NA Values
> x <- c(1, 2, NA, 4, NA, 5) > bad <- is.na(x) > x[!bad] [1] 1 2 4 5
Use built-in function
complete.cases()to get a logical vector indicating which cases are complete, i.e., have no missing values.
> x <- c(1, 2, NA, 4, NA, 5) > y <- c("a", "b", NA, "d", NA, "f") > good <- complete.cases(x, y) > good [1] TRUE TRUE FALSE TRUE FALSE TRUE > x[good] [1] 1 2 4 5 > y[good] [1] "a" "b" "d" "f"
From data frame
> airquality[1:6,] ## call a matrix Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 ## there a NA value in this vector 6 28 NA 14.9 66 5 6 ## there a NA value in this vector > good <- complete.cases(airquality) ## as there a NA value in 6s/7s row, so it is filtered. > airquality[good, ][1:6, ] Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8
Vectorized Operations
many operations in R are vectorized making code more efficient, concise, and easier to read.> x <- 1:4; y <- 6:9 > x + y [1] 7 9 11 13 > x > 2 [1] FALSE FALSE TRUE TRUE > y >= 2 [1] TRUE TRUE TRUE TRUE > y == 8 [1] FALSE FALSE TRUE FALSE > x * y [1] 6 14 24 36 > x / y [1] 0.1666667 0.2857143 0.3750000 0.4444444
Logic Control
if-else
> if (x > 3) { + y <- 10 + } else { + y <- 0 + }
For
> x <- c("a", "b", "c", "d") > for (i in 1:4) { + print(x[i]) + } [1] "a" [1] "b" [1] "c" [1] "d" > for(i in seq_along(x)) { + print(x[i]) + } [1] "a" [1] "b" [1] "c" [1] "d" > for(letter in x){ + print(letter) + } [1] "a" [1] "b" [1] "c" [1] "d" > for(i in 1:4) print(x[i]) [1] "a" [1] "b" [1] "c" [1] "d"
While
> count <- 0 > while(count < 10) { + print(count) + count <- count + 1 + } [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 > z <- 5 > while(z >=3 && z <= 10) { + print(z) + coin <- rbinom(1, 1, 0.5) + + if(coin == 1) { + z <- z + 1 + } else { + z <- z - 1 + } + } [1] 5 [1] 4 [1] 3 [1] 4 [1] 5 [1] 4 [1] 5 [1] 4 [1] 3
Repeat
> x0 <- 1 > tol <- 1e-8 > repeat { + x1 <- computeEstimate() + if(abs(x1 - x0) < tol) { + break + } else { + x0 <- x1 + } + }
> for(i in 1:100) { + if(i <= 20) { + next ## jump into next loop + } + }
Function
> add2 <- function(x, y) { + x + y + } > add2(2,3) [1] 5
> above <- function(x, n = 10) { + use <- x >n + x[use] + } > x <- 1:20 > above(x, 10) [1] 11 12 13 14 15 16 17 18 19 20
> columnmean <- function(y, removeNA = TRUE) { + nc <- ncol(y) + means <- numeric(nc) + for(i in 1:nc) { + means[i] <- mean(y[,i], na.rm = removeNA) + } + means ## return result + } > columnmean(airquality) ## compute the mean of values of columns of `airqulity`. [1] 42.129310 185.931507 9.957516 77.882353 6.993464 15.803922
The ...
Argument
...is often used when extending another function and you don't want to copy the entire argument list of the original function.
myplot <- function(x, y, type = "1", ...) { plot(x, y, type = type, ...) }
The
...argument is also necessary when the number of arguments passed to the function cannot be known in advance.
> args(paste) ## view the description of arguments of function `paste`. function (..., sep = " ", collapse = NULL) NULL > args(cat) function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) NULL > paste("a", "b", sep = ":") [1] "a:b" > paste("a", "b", se = ":") [1] "a b :"
Scoping Rules
A Diversion on Binding Values to Symbol
When R tries to bind a value to a symbol, it searches through a series of environments to find the apropriate value. When you are working on the command line and need to retrieve the value of an R object, the order is roughlySearch the global environment for a symbol name matching the one requested.
Search the namespaces of each of the packages on the search list.
Free Variable
> z <- 1 > lm <- function(x, y) { + x + y + z ## z is a free variable + } > lm(1, 1) [1] 3
Coding Standard
Always use text files / text editor.Indent your code.
Limit the width of your code.
Limit the length of your function.
Dates and Times
Dates are represented by the Date classTimes are represented by the
POSIXctor the
POSIXltclass
Dates are stored internally as the number of days since 1970-01-01
Times are stored internally as the number of seconds since 1970-01-01
> Sys.time() [1] "2016-07-13 22:22:37 CST" > timeNow <- Sys.time() > datestring <- c(timeNow) > x <- strptime(datestring, "%B %d, %Y %H:%M") ## format the time string > x [1] NA > class(x) [1] "POSIXlt" "POSIXt"
Loop Functions
lapplyLoop over a list and evaluate a functin on each element.
sapplySame as lapply but try to simplify the result.
applyApply a function over the margins of an array.
taplyApply a function over subsets of a vector.
mapplyMultivariate version of lapply.
An auxiliary function
splitis also useful, particularly in conjunction with lapply.
lapply
lapplyreturns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
> lapply function (X, FUN, ...) { FUN <- match.fun(FUN) if (!is.vector(X) || is.object(X)) X <- as.list(X) .Internal(lapply(X, FUN)) } <bytecode: 0x000000000b606e90> <environment: namespace:base>
For an instance below.
> x <- list(a = 1:5, b = rnorm(10)) > lapply(x, mean) $a [1] 3 $b [1] -0.1931699
rnorm: Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd.
runif,
dunif,
punif,
qunif: These functions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.
> x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2)) > lapply(x, function(elt) elt[,1]) $a [1] 1 2 $b [1] 1 2 3
sapply
sapplywill try to simplify the result of
lapplyif possible.
If the result is a list where every element is length 1, then a vector is returned.
If the result is a list where every element is a vector of the same length (>1), a matrix is returned.
If it can't figure things out, a list is returned.
apply
applyis used to a evaluate a function (often an anonymous one) over the margins of an array.
It is most often used to apply a function to the rows or columns of a matrix.
It can be used with general arrays, e.g. taking the average of an array of matrices.
It is not really faster than writing a loop, but it works in one line!
> str(apply) function (X, MARGIN, FUN, ...)
xis an array
MARGINis an integer vector indicating which margins should be "retained"
FUNis a function to be applied.
...is for other arguments to be passed to
FUN
> x <- matrix(1:4, 2, 2) > x [,1] [,2] [1,] 1 3 [2,] 2 4 > apply(x, 1, mean) [1] 2 3 > apply(x, 2, mean) [1] 1.5 3.5
MARGIN = 1Compute the
meanat every row, and return a vector as result.
MARGIN = 1Compute the
meanat every column, and return a vector as result.
Other shortcuts.
rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)
Apply in multiple dimensions array, in the source below , we use a vector as a MARGIN value to complete the compute of multiple dimensions compute.
> a <- array(rnorm(2 * 2 * 10), c(2, 2, 10)) > apply(a, c(1, 2), mean) [,1] [,2] [1,] 0.6869065 -0.66529430 [2,] -0.1136978 -0.04124547
mapply
mapplyis a multivariate apply of sorts which applies a function in parallel over a set of arguments.
> str(mapply) function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUNis a function to apply.
...contains arguments to apply over.
MoreArgsis a list of other arguments to
FUN.
SIMPLIFYindicates whether the result should be simplified.
tapply
tapplyis used to apply a function over subsets of a vector.
split
splitdivides the data in the vector
xinto the groups defined by
f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.
> s <- split(airquality, airquality$Month) > sapply(s, function(x) colMeans(x[,c("Ozone", "Wind")])) 5 6 7 8 9 Ozone NA NA NA NA NA Wind 11.62258 10.26667 8.941935 8.793548 10.18
相关文章推荐
- 数据分析的3大作用:解决生活问题、降低被误导概率、职场发展需要
- 如何使用Visual Studio 2010在数据库中生成随机测试数据
- win2008 R2服务器下修改MySQL 5.5数据库data目录的方法
- Windows Server 2003下修改MySQL 5.5数据库data目录
- 对 jQuery 中 data 方法的误解分析
- 浅析jQuery 3.0中的Data
- jquery load事件(callback/data)使用方法及注意事项
- jQuery中使用data()方法读取HTML5自定义属性data-*实例
- JQuery中attr属性和jQuery.data()学习笔记【必看】
- Javascript实现关联数据(Linked Data)查询及注意细节
- JQuery.Ajax()的data参数类型实例详解
- Select data from an Excel sheet in MSSQL
- js表数据排序 sort table data
- Mysql Data目录和 Binlog 目录 搬迁的方法
- mysql 卡死 大部分线程长时间处于sending data的状态
- ERROR 1406 : Data too long for column 解决办法
- mysql 的load data infile
- 向大家推荐一个收集整理正则表达式的网站
- java复制文件和java移动文件的示例分享
- mysql Load Data InFile 的用法