您的位置:首页 > 产品设计 > UI/UE

Step-by-Step Guide to Setting Up an R-Hadoop System

2015-12-15 14:48 671 查看
This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps
or settings might be different for Windows or Ubuntu.
To install Hadoop on Windows, you can find detailed instructions at

Build and Install Hadoop 2.2 or newer on Windows

Below is a list of software used for this setup.

OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0

Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17

R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

This process should work with Hadoop 2.2 or above and newer versions of HBase as well, but I haven't tested it yet.

Homebrew is a missing package manager for Mac OS X, and it is needed for install git, pkg-config andthrift. For other operating systems, the equivalents to Homebrew are apt-get on Ubuntu and yum on CentOS.

By the way, two painful steps in this process are setting up HBase on Hadoop in cluster mode and installing rhbase. If you want to have a quick start or are not going to use HBase, you donot need to intall thrift, HBase or rhbase,
and therefore can skip

step 3 - Install HBase,
step 5.4 - Install thrift 0.9.0, and
installing rhbase at step 7.3.

If you are going to set up RHadoop on Linux, see RHadoop
Installation Guide for Red Hat Enterprise Linux.

1. Set up single-node Hadoop

If building a Hadoop system for the first time, you are suggested to start with a stand-alone mode first, and then switch to pseudo-distributed mode and cluster (fully-distributed) mode.

1.1 Download Hadoop

Download Hadoop from http://hadoop.apache.org/releases.html#Download and then unpack it.

1.2 Set up Hadoop in standalone mode

1.2.1 Set JAVA_HOME

In file conf/hadoop_env.sh, add the line below:

export JAVA_HOME=/Library/Java/Home


1.2.2 Set up remote desktop and enabling self-login

Open the “System Preferences” window, and click “Sharing”“ (under "Internet & Wireless”). Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Allow access for only these Users” and select your account,
which we assume is “hadoop”.

After that, save authorized keys so that you can log in localhost without typing a password.

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys


The above step to set up remote desktop and self-login was picked up fromhttp://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29,
which provides detailed instructions to set up Hadoop on Mac.

1.2.3 Run Hadoop

After that, run commands below under system console to check whether Hadoop has been installed properly in a stand-alone mode.

## go to hadoop directory
cd hadoop-1.1.2

## see a list of Hadoop commands
bin/hadoop

## version of Hadoop
bin/hadoop version

## start Hadoop
bin/start-all.sh

## check Hadoop is running
jps

## stop Hadoop
bin/stop-all.sh


After running 
jps
, You should see a list of services below.

 Hadoop 1.1.2Hadoop 2.2.0 or above
master nodeNameNodeNameNode
 SecondaryNameNodeResourceManager
 JobTrackerJobHistoryServer
slave nodeDataNodeDataNode
 TaskTrackerNodeManager

1.3 Test Hadoop

Then we test Hadoop with two examples to make sure that it works.

1.3.1 Example 1 - calculate pi

bin/hadoop jar hadoop-examples-*.jar pi 10 100


In the above code, the first argument (10) is the number of maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.

1.3.2 Example 2 - word count

In this example, all files in local folder hadoop-1.1.2/conf are copied to a HDFS directory input, to be used as input for pattern searching. Of course you can use other available text files as input.

## copy files
bin/hadoop fs -put conf input

## run distributed grep, and save results in directory *output*
## The pattern to find is 'dfs[a-z.]+'.
## Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results.
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

## copy result from HDFS directory *output* to local directory *output*
bin/hadoop fs -get output output

## have a look at results
cat output/*


2 Set up Hadoop in cluster mode

If your Hadoop works in a standalone mode, you can then proceed to a cluster (full-distributed) mode.

2.1 Switching between different modes

You may want to keep settings for all three modes, because you will likely need to switch between different modes, for trouble-shooting in HBase and RHadoop installation at later stages. Therefore, you are suggested to keep settings for three modes in three
separate directories, conf.single, conf.pseudoand conf.cluster, and use commands below to choose a specific setting. Same applies to HBase settings.

ln -s conf.single conf
ln -s conf.pseudo conf
ln -s conf.cluster conf


2.2 Setup name node (master machine)

Configure the following 3 files on master machine

core-site.xml
hdfs-site.xml
mapred-site.xml
Set masters and slaves files

file “masters”: IP address or hostname of namenode (master machine)
file “slaves”: a list of IP addresses or hostnames of datanodes (slave machines)

2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes

This is similar to step 1.2.2.

2.4 Copy public key

Copy the public key created on master node to all slave nodes.

2.5 Firewall

Enable incoming connections for Java on all machines, otherwise, slaves would not be able to receive any jobs.

2.6 Setup data nodes (slave machines)

Tar the hadoop directory on master node, copy it to all slaves and then untar it.

2.7 Format name node

Go to Handoop directory and run

bin/hadoop namenode -format


2.8 Run Hadoop

Start Hadoop

bin/start-all.sh

Monitor nodes and jobs with browser:

Namenode and HDFS file system: http://IP_ADDR_OF_NAMENODE:50070 Hadoop job tracker: http://IP_ADDR_OF_NAMENODE:50030 Stop Hadoop and MapReduce:

bin/stop-all.sh


2.9 Test Hadoop

You may want to test Hadoop in cluster mode, use the same code given at step 1.3.

2.10 Further Information

More instuctions on setting up Hadoop are available at links below.

2.10.1 Single-node mode

Steps to install Hadoop 2.x release (Yarn or Next-Gen) on single node cluster setup

Hadoop MapReduce Next Generation - Setting up a Single Node Cluster

2.10.2 Cluster mode

Setting up Hadoop in clustered mode in Ubuntu

Steps to install Hadoop 2.x release (Yarn or Next-Gen) on multi-node cluster

Hadoop YARN Installation: The definitive guide

Hadoop MapReduce Next Generation - Cluster Setup

3. Set up HBase

3.1 Set up HBase

You can skip this step if you are not going to use HBase.

See links below for detailed instructions on setting up HBase on Hadoop.

HBase Installation : Fully Distributed Mode
How to use HBase & Hadoop Clustered
The Apache HBase Reference Guide
I used the settings given in section 2.4 - Example Configurations at this link to set up HBase in fully distributed
mode.

3.2 Switching between different modes

Same as Hadoop, you are suggested to start with a stand-alone mode first. After that, you can switch to pseudo-distribution or cluster mode. However, you are suggested to keep settings for all three modes, e.g., for possible switching between different modes
when you install RHadoop at a later stage. See step 2.1 for details about switching between different modes.

4. Install R

The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.

It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing
R projects, although it is not mandatory.

5. Install GCC, Homebrew, git, pkg-config and thrift

GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase or rhbase, you donot need to install pkg-config or thrift.

5.1 Download and install GCC

Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing
some R packages from source.

5.2 Install Homebrew

Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.

su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)" brew update
brew doctor


Refer to the Homebrew website at http://brew.sh if any errors at above step.

5.3 Install git and pkg-config

brew install git
brew install pkg-config


5.4 Install thrift 0.9.0

Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason,
uninstall 0.9.1 and then install 0.9.0.

Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).

## Do NOT run command below !!!
brew install thrift


Instead, follow steps below to install thrift 0.9.0.

$ brew versions thrift

Warning: brew-versions is unsupported and may be removed soon.
Please use the homebrew-versions tap instead: https://github.com/Homebrew/homebrew-versions 0.9.1    git checkout eccc96b Library/Formula/thrift.rb
0.9.0    git checkout c43fc30 Library/Formula/thrift.rb
0.8.0    git checkout e5475d9 Library/Formula/thrift.rb
0.7.0    git checkout 141ddb6 Library/Formula/thrift.rb

...

Find the formula for thrift 0.9.0 in above list, and install with that formula.

## go to the homebrew base directory
$ cd $( brew --prefix )

## check out thrift 0.9.0
git checkout c43fc30 Library/Formula/thrift.rb

## instal thrift
brew install thrift


Then we check whether pkg-config path is correct.

pkg-config --cflags thrift


The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying
that some .h files can not be found when installing rhbase.

If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.

5.5 More instructions

If there are problems with installing other packages above, more instructions can be found at links below.

http://diggdata.in/post/67561846971/fetch-data-from-hbase-database-from-r-using-rhbase
https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase
Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.

sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/


6. Environment settings

Run code below in R to set environment variables for Hadoop.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")


Alternatively, add above to ~/.bashrc so that you don't need to set them every time.

export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2
export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop
export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar


7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr

7.1 Install relevant R packages

install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",
"functional", "stringr", "plyr", "reshape2", "dplyr",
"R.methodsS3", "caTools", "Hmisc"))


RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are
installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library
under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.

The destination library can be set with function 
install.packages()
 using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install
Packages.

## find your R libraries
.libPaths()
#"/Users/hadoop/Library/R/3.1/library"
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

## check which library a package was installed into
system.file(package="functional")
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional"

## install package to a specific library
install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")


In addition to above packages, you are also suggested to install 
data.table
. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The
reason could be that RHadoop uses 
data.table
 to handle large data.

install.packages("data.table")


7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING

Set environment variables for Hadoop, if you haven't done so at step 6.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")


7.3 Install RHadoop packages

Download packages 
rhdfs
rhbase
rmr2
 and 
plyrmr
 fromhttps://github.com/RevolutionAnalytics/RHadoop/wiki and
install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.

install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")


7.4 Further information

If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

8. Run an R job on Hadoop

Below is an example to count words in text files from HDFS folder wordcount/data. The R code is fromJeffrey
Breen's presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.

## copy local text file to hdfs
bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/

After that, we can use R code below to run a Hadoop job for word counting.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

library(rmr2)

## map function
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}

## reduce function
reduce <- function(word, counts) {
keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text",
map=map, reduce=reduce)
}

## delete previous result if any
system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out")

## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
out <- wordcount(hdfs.data, hdfs.out)

## Fetch results from HDFS
results <- from.dfs(out)

## check top 30 frequent words
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df[order(results.df$count, decreasing=T), ], 30)

If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R! 

9. Setting up multiple users

Now you might want to set up accounts for other users to use Hadoop. Detailed instructions on that can be found at Setting
Up Multiple Users in Hadoop Clusters.

10. Further readings

More examples of R jobs on Hadoop with rmr2 can be found at

https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md
https://github.com/RevolutionAnalytics/rmr2/archive/master.zip.
To learn MapReduce and Hadoop, below are some documents to read.

A presentation on Using R with Hadoop by Jeffrey Breen

MapReduce on Wikipedia

MapReduce: Simplified Data Processing on Large Clusters

A brief introduction to the Hadoop Distributed File System

Besides RHadoop, another way to run R jobs on Hadoop is using RHIPE.

RHIPE website

Large Complex Data: Divide and Recombine (D&R) with RHIPE

11. Contact and feedback

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this
thread in the RDataMining group? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to contact Yanchang Zhao yanchang@rdatamining.com, or post
your questions to my RDataMining group on LinkedIn.

Thanks.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: