您的位置:首页 > 大数据 > Hadoop

在HDFS上配置Alluxio

2016-05-05 10:41 1481 查看


初始步骤

要在一组机器上运行一个Alluxio集群,需要在每台机器上部署Alluxio二进制包。你可以自己编译Alluxio,或者下载二进制包

注意,在默认情况下,预编译的Alluxio二进制包适用于HDFS 
2.2.0
,若使用其他版本的Hadoop,需要从Alluxio源代码重新编译,且编译时应照以下方法中的一种设置Hadoop版本号。假定Alluxio源代码的根目录为
${ALLUXIO_HOME}

修改
${ALLUXIO_HOME}/pom.xml
配置文件中的
hadoop.version
标签。例如,若使用Hadoop 
2.6.0
,将该pom文件中的”
<hadoop.version>2.2.0</hadoop.version>
“修改为”
<hadoop.version>2.6.0</hadoop.version>
“,接着使用maven重新编译:

$ mvn clean package


另外,也可以选择使用maven编译时在命令行中指定对应的Hadoop版本号,例如,若使用Hadoop HDFS 
2.6.0


$ mvn -Dhadoop.version=2.6.0 clean package


如果一切正常,在
assembly/target
目录中应当能看到
alluxio-assemblies-1.0.1-jar-with-dependencies.jar
文件,使用该jar文件即可运行Alluxio
Master和Worker。


配置Alluxio

要运行Alluxio二进制包,一定要先创建配置文件,从template文件创建一个配置文件:

$ cp conf/alluxio-env.sh.template conf/alluxio-env.sh


接着修改
alluxio-env.sh
文件,将底层存储系统的地址设置为HDFS namenode的地址(例如,若你的HDFS namenode是在本地默认端口运行,则该地址为
hdfs://localhost:9000
)。

export ALLUXIO_UNDERFS_ADDRESS=hdfs://NAMENODE:PORT



使用HDFS在本地运行Alluxio

配置完成后,你可以在本地启动Alluxio,观察一切是否正常运行:

$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local


该命令应当会启动一个Alluxio master和一个Alluxio worker,可以在浏览器中访问http://localhost:19999查看master Web UI。

接着,你可以运行一个简单的示例程序:

$ ./bin/alluxio runTests


运行成功后,访问HDFS Web UI http://localhost:50070,确认其中包含了由Alluxio创建的文件和目录。在该测试中,创建的文件名称应像这样:
/alluxio/data/default_tests_files/BasicFile_STORE_SYNC_PERSIST


运行以下命令停止Alluxio:

$ ./bin/alluxio-stop.sh all

Running Spark on Alluxio

This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems.

Compatibility

Alluxio works together with Spark 1.1 or later out-of-the-box.

Prerequisites

General Setup

Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode.
Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level 
alluxio
 directory with the following command:
mvn clean package -Pspark -DskipTests
Add the following line to 
spark/conf/spark-env.sh
.
export SPARK_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:$SPARK_CLASSPATH

Additional Setup for HDFS

If Alluxio is run on top of a Hadoop 1.x cluster, create a new file 
spark/conf/core-site.xml
 with the following content:
<configuration>
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
</configuration>
If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created 
spark/conf/core-site.xml
:
<property>
<name>fs.alluxio-ft.impl</name>
<value>alluxio.hadoop.FaultTolerantFileSystem</value>
</property>
and the following line to 
spark/conf/spark-env.sh
:
export SPARK_JAVA_OPTS="
-Dalluxio.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181
-Dalluxio.zookeeper.enabled=true
$SPARK_JAVA_OPTS
"

Use Alluxio as Input and Output

This section shows how to use Alluxio as input and output sources for your Spark applications.

Use Data Already in Alluxio

First, we will copy some local data to the Alluxio file system. Put the file 
LICENSE
 into Alluxio, assuming you are in the Alluxio project directory:
$ bin/alluxio fs copyFromLocal LICENSE /LICENSE
Run the following commands from 
spark-shell
, assuming Alluxio Master is running on 
localhost
:
> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")
Open your browser and check http://localhost:19999/browse. There should be an output file 
LICENSE2
 which doubles each line in the file 
LICENSE
.

Use Data from HDFS

Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a file 
LICENSE
 into HDFS, assuming the namenode is running on 
localhost
 and the Alluxio project directory is 
/alluxio
:
$ hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/LICENSE
Note that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from 
spark-shell
, assuming Alluxio Master is running on 
localhost
:
> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")
Open your browser and check http://localhost:19999/browse. There should be an output file 
LICENSE2
 which doubles each line in the file 
LICENSE
. Also, the 
LICENSE
 file now appears in the Alluxio file system space.NOTE: It is possible that the 
LICENSE
 file is not in Alluxio storage (Not In-Memory). This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block.
> val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1)
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Using Fault Tolerant Mode

When running Alluxio with fault tolerant mode, you can point to any Alluxio master:
> val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")

Data Locality

If Spark task locality is 
ANY
 while it should be 
NODE_LOCAL
, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community.Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname:
$ $SPARK_HOME/sbin/start-slave.sh -h <slave-hostname> <spark master uri>
For example:
$ $SPARK_HOME/sbin/start-slave.sh -h simple30 spark://simple27:7077
You can also set the 
SPARK_LOCAL_HOSTNAME
 in 
$SPARK_HOME/conf/spark-env.sh
 to achieve this. For example:
SPARK_LOCAL_HOSTNAME=simple30
In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.



Running Hadoop MapReduce on Alluxio

This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.

Initial Setup

The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar, or if you are using Hadoop 1, this examples jar.

Compiling the Alluxio Client

In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:
$ mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTests
The version 
<YOUR_HADOOP_VERSION>
 supports many different distributions of Hadoop. For example, 
mvn install -Dhadoop.version=2.7.1 -DskipTests
 would compile Alluxio for the Apache Hadoop version 2.7.1. Please visit the Building Alluxio Master Branch page for more information about support for other distributions.After the compilation succeeds, the new Alluxio client jar can be found at:
core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar
This is the jar that you should use for the rest of this guide.

Configuring Hadoop

You need to add the following three properties to 
core-site.xml
 file in your Hadoop installation 
conf
 directory:
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description>
</property>
<property>
<name>fs.alluxio-ft.impl</name>
<value>alluxio.hadoop.FaultTolerantFileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description>
</property>
<property>
<name>fs.AbstractFileSystem.alluxio.impl</name>
<value>alluxio.hadoop.AlluxioFileSystem</value>
<description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>
This will allow your MapReduce jobs to use Alluxio for their input and output files. If you are using HDFS as the under storage system for Alluxio, it may be necessary to add these properties to the 
hdfs-site.xml
 file as well.In order for the Alluxio client jar to be available to the JobClient, you can modify 
HADOOP_CLASSPATH
 by changing 
hadoop-env.sh
 to:
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:${HADOOP_CLASSPATH}
This allows the code that creates and submits the Job to use URIs with Alluxio scheme.

Distributing the Alluxio Client Jar

In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio.This guide on how to include 3rd party libraries from Cloudera describes several ways to distribute the jars. From that guide, the recommended way to distributed the Alluxio client jar is to use the distributed cache, via the 
-libjars
 command line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes. Below are instructions for the 2 main alternatives:1.Using the -libjars command line option. You can run a job by using the 
-libjars
 command line option when using 
hadoop jar ...
, specifying
/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar
 as the argument. This will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the 
-libjars
 option:
$ hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar <INPUT FILES> <OUTPUT DIRECTORY>`
2.Distributing the jars to all nodes manually. For installing Alluxio on each node, you must place the client jar 
alluxio-core-client-1.0.1-jar-with-dependencies.jar
 (located in the 
/<PATH_TO_ALLUXIO>/core/client/target/
 directory), in the 
$HADOOP_HOME/lib
 (may be
$HADOOP_HOME/share/hadoop/common/lib
 for different versions of Hadoop) directory of every MapReduce node, and then restart all of the TaskTrackers. One caveat of this approach is that the jars must be installed again for each update to a new release. On the other hand, when the jar is already on every node, then the 
-libjars
 command line option is not needed.

Running Hadoop wordcount with Alluxio Locally

First, compile Alluxio with the appropriate Hadoop version:
$ mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>
For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running:
$ cd $HADOOP_HOME
$ ./bin/stop-all.sh
$ ./bin/start-all.sh
Configure Alluxio to use the local HDFS cluster as its under storage system. You can do this by modifying 
conf/alluxio-env.sh
 to include:
export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:9000
Start Alluxio locally:
$ ./bin/alluxio-stop.sh all
$ ./bin/alluxio-start.sh local
You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:
$ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt
This command will copy the 
LICENSE
 file into the Alluxio namespace with the path 
/wordcount/input.txt
.Now we can run a MapReduce job for wordcount.
$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output
After this job completes, the result of the wordcount will be in the 
/wordcount/output
 directory in Alluxio. You can see the resulting files by running:
$ ./bin/alluxio fs ls /wordcount/output
$ ./bin/alluxio fs cat /wordcount/output/part-r-00000
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: