在HDFS上配置Alluxio
2016-05-05 10:41
1481 查看
初始步骤
要在一组机器上运行一个Alluxio集群,需要在每台机器上部署Alluxio二进制包。你可以自己编译Alluxio,或者下载二进制包注意,在默认情况下,预编译的Alluxio二进制包适用于HDFS
2.2.0,若使用其他版本的Hadoop,需要从Alluxio源代码重新编译,且编译时应照以下方法中的一种设置Hadoop版本号。假定Alluxio源代码的根目录为
${ALLUXIO_HOME}。
修改
${ALLUXIO_HOME}/pom.xml配置文件中的
hadoop.version标签。例如,若使用Hadoop
2.6.0,将该pom文件中的”
<hadoop.version>2.2.0</hadoop.version>“修改为”
<hadoop.version>2.6.0</hadoop.version>“,接着使用maven重新编译:
$ mvn clean package
另外,也可以选择使用maven编译时在命令行中指定对应的Hadoop版本号,例如,若使用Hadoop HDFS
2.6.0:
$ mvn -Dhadoop.version=2.6.0 clean package
如果一切正常,在
assembly/target目录中应当能看到
alluxio-assemblies-1.0.1-jar-with-dependencies.jar文件,使用该jar文件即可运行Alluxio
Master和Worker。
配置Alluxio
要运行Alluxio二进制包,一定要先创建配置文件,从template文件创建一个配置文件:$ cp conf/alluxio-env.sh.template conf/alluxio-env.sh
接着修改
alluxio-env.sh文件,将底层存储系统的地址设置为HDFS namenode的地址(例如,若你的HDFS namenode是在本地默认端口运行,则该地址为
hdfs://localhost:9000)。
export ALLUXIO_UNDERFS_ADDRESS=hdfs://NAMENODE:PORT
使用HDFS在本地运行Alluxio
配置完成后,你可以在本地启动Alluxio,观察一切是否正常运行:$ ./bin/alluxio format $ ./bin/alluxio-start.sh local
该命令应当会启动一个Alluxio master和一个Alluxio worker,可以在浏览器中访问http://localhost:19999查看master Web UI。
接着,你可以运行一个简单的示例程序:
$ ./bin/alluxio runTests
运行成功后,访问HDFS Web UI http://localhost:50070,确认其中包含了由Alluxio创建的文件和目录。在该测试中,创建的文件名称应像这样:
/alluxio/data/default_tests_files/BasicFile_STORE_SYNC_PERSIST。
运行以下命令停止Alluxio:
$ ./bin/alluxio-stop.sh all
Running Spark on Alluxio
This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems.Compatibility
Alluxio works together with Spark 1.1 or later out-of-the-box.Prerequisites
General Setup
Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode.Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level
alluxiodirectory with the following command:
mvn clean package -Pspark -DskipTestsAdd the following line to
spark/conf/spark-env.sh.
export SPARK_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:$SPARK_CLASSPATH
Additional Setup for HDFS
If Alluxio is run on top of a Hadoop 1.x cluster, create a new filespark/conf/core-site.xmlwith the following content:
<configuration> <property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> </property> </configuration>If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created
spark/conf/core-site.xml:
<property> <name>fs.alluxio-ft.impl</name> <value>alluxio.hadoop.FaultTolerantFileSystem</value> </property>and the following line to
spark/conf/spark-env.sh:
export SPARK_JAVA_OPTS=" -Dalluxio.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181 -Dalluxio.zookeeper.enabled=true $SPARK_JAVA_OPTS "
Use Alluxio as Input and Output
This section shows how to use Alluxio as input and output sources for your Spark applications.Use Data Already in Alluxio
First, we will copy some local data to the Alluxio file system. Put the fileLICENSEinto Alluxio, assuming you are in the Alluxio project directory:
$ bin/alluxio fs copyFromLocal LICENSE /LICENSERun the following commands from
spark-shell, assuming Alluxio Master is running on
localhost:
> val s = sc.textFile("alluxio://localhost:19998/LICENSE") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")Open your browser and check http://localhost:19999/browse. There should be an output file
LICENSE2which doubles each line in the file
LICENSE.
Use Data from HDFS
Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a fileLICENSEinto HDFS, assuming the namenode is running on
localhostand the Alluxio project directory is
/alluxio:
$ hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/LICENSENote that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from
spark-shell, assuming Alluxio Master is running on
localhost:
> val s = sc.textFile("alluxio://localhost:19998/LICENSE") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")Open your browser and check http://localhost:19999/browse. There should be an output file
LICENSE2which doubles each line in the file
LICENSE. Also, the
LICENSEfile now appears in the Alluxio file system space.NOTE: It is possible that the
LICENSEfile is not in Alluxio storage (Not In-Memory). This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block.
> val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1) > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")
Using Fault Tolerant Mode
When running Alluxio with fault tolerant mode, you can point to any Alluxio master:> val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")
Data Locality
If Spark task locality isANYwhile it should be
NODE_LOCAL, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community.Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname:
$ $SPARK_HOME/sbin/start-slave.sh -h <slave-hostname> <spark master uri>For example:
$ $SPARK_HOME/sbin/start-slave.sh -h simple30 spark://simple27:7077You can also set the
SPARK_LOCAL_HOSTNAMEin
$SPARK_HOME/conf/spark-env.shto achieve this. For example:
SPARK_LOCAL_HOSTNAME=simple30In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.
Running Hadoop MapReduce on Alluxio
This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.Initial Setup
The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar, or if you are using Hadoop 1, this examples jar.Compiling the Alluxio Client
In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:$ mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTestsThe version
<YOUR_HADOOP_VERSION>supports many different distributions of Hadoop. For example,
mvn install -Dhadoop.version=2.7.1 -DskipTestswould compile Alluxio for the Apache Hadoop version 2.7.1. Please visit the Building Alluxio Master Branch page for more information about support for other distributions.After the compilation succeeds, the new Alluxio client jar can be found at:
core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jarThis is the jar that you should use for the rest of this guide.
Configuring Hadoop
You need to add the following three properties tocore-site.xmlfile in your Hadoop installation
confdirectory:
<property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description> </property> <property> <name>fs.alluxio-ft.impl</name> <value>alluxio.hadoop.FaultTolerantFileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description> </property> <property> <name>fs.AbstractFileSystem.alluxio.impl</name> <value>alluxio.hadoop.AlluxioFileSystem</value> <description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description> </property>This will allow your MapReduce jobs to use Alluxio for their input and output files. If you are using HDFS as the under storage system for Alluxio, it may be necessary to add these properties to the
hdfs-site.xmlfile as well.In order for the Alluxio client jar to be available to the JobClient, you can modify
HADOOP_CLASSPATHby changing
hadoop-env.shto:
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:${HADOOP_CLASSPATH}This allows the code that creates and submits the Job to use URIs with Alluxio scheme.
Distributing the Alluxio Client Jar
In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio.This guide on how to include 3rd party libraries from Cloudera describes several ways to distribute the jars. From that guide, the recommended way to distributed the Alluxio client jar is to use the distributed cache, via the-libjarscommand line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes. Below are instructions for the 2 main alternatives:1.Using the -libjars command line option. You can run a job by using the
-libjarscommand line option when using
hadoop jar ..., specifying
/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jaras the argument. This will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the
-libjarsoption:
$ hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar <INPUT FILES> <OUTPUT DIRECTORY>`2.Distributing the jars to all nodes manually. For installing Alluxio on each node, you must place the client jar
alluxio-core-client-1.0.1-jar-with-dependencies.jar(located in the
/<PATH_TO_ALLUXIO>/core/client/target/directory), in the
$HADOOP_HOME/lib(may be
$HADOOP_HOME/share/hadoop/common/libfor different versions of Hadoop) directory of every MapReduce node, and then restart all of the TaskTrackers. One caveat of this approach is that the jars must be installed again for each update to a new release. On the other hand, when the jar is already on every node, then the
-libjarscommand line option is not needed.
Running Hadoop wordcount with Alluxio Locally
First, compile Alluxio with the appropriate Hadoop version:$ mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running:
$ cd $HADOOP_HOME $ ./bin/stop-all.sh $ ./bin/start-all.shConfigure Alluxio to use the local HDFS cluster as its under storage system. You can do this by modifying
conf/alluxio-env.shto include:
export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:9000Start Alluxio locally:
$ ./bin/alluxio-stop.sh all $ ./bin/alluxio-start.sh localYou can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:
$ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txtThis command will copy the
LICENSEfile into the Alluxio namespace with the path
/wordcount/input.txt.Now we can run a MapReduce job for wordcount.
$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/outputAfter this job completes, the result of the wordcount will be in the
/wordcount/outputdirectory in Alluxio. You can see the resulting files by running:
$ ./bin/alluxio fs ls /wordcount/output $ ./bin/alluxio fs cat /wordcount/output/part-r-00000
相关文章推荐
- DEPRECATED: Use of this script to execute hdfs command is deprecated.
- 笔记
- HDFS 读写数据详细步骤
- HDFS 基本文件操作API
- HDFS高级操作命令和工具
- HDFS 文件操作命令格式与注意事项
- HDFS 启动与关闭
- HDFS 可靠性的设计实现
- HDFS 文件操作基础命令
- HBase与HDFS结合使用
- DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command
- HDFS详解
- HDFS快照管理
- 清理Kylin的中间存储数据(HDFS & HBase Tables)
- Hadoop2.7实战v1.0之Linux参数调优
- IMF传奇行动第85课:Spark Streaming第四课:基于HDFS的Spark Streaming案例实战和内幕源码解密
- HDFS写入和读取流程
- HDFS写入和读取流程
- 六:熟悉HDFS基本常用命令(一)
- 七:熟悉HDFS基本常用命令(二)