$ mvn clean package

另外,也可以选择使用maven编译时在命令行中指定对应的Hadoop版本号,例如,若使用Hadoop HDFS 

$ mvn -Dhadoop.version=2.6.0 clean package




$ cp conf/alluxio-env.sh.template conf/alluxio-env.sh

文件,将底层存储系统的地址设置为HDFS namenode的地址(例如,若你的HDFS namenode是在本地默认端口运行,则该地址为




$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local

该命令应当会启动一个Alluxio master和一个Alluxio worker,可以在浏览器中访问http://localhost:19999查看master Web UI。


$ ./bin/alluxio runTests

运行成功后,访问HDFS Web UI http://localhost:50070,确认其中包含了由Alluxio创建的文件和目录。在该测试中,创建的文件名称应像这样:


$ ./bin/alluxio-stop.sh all

Running Spark on Alluxio

This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems.


Alluxio works together with Spark 1.1 or later out-of-the-box.


General Setup

Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode.
Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level 
 directory with the following command:
mvn clean package -Pspark -DskipTests
Add the following line to 
export SPARK_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:$SPARK_CLASSPATH

Additional Setup for HDFS

If Alluxio is run on top of a Hadoop 1.x cluster, create a new file 
 with the following content:
If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created 
and the following line to 

Use Alluxio as Input and Output

This section shows how to use Alluxio as input and output sources for your Spark applications.

Use Data Already in Alluxio

First, we will copy some local data to the Alluxio file system. Put the file 
 into Alluxio, assuming you are in the Alluxio project directory:
$ bin/alluxio fs copyFromLocal LICENSE /LICENSE
Run the following commands from 
, assuming Alluxio Master is running on 
> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")
Open your browser and check http://localhost:19999/browse. There should be an output file 
 which doubles each line in the file 

Use Data from HDFS

Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a file 
 into HDFS, assuming the namenode is running on 
 and the Alluxio project directory is 
$ hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/LICENSE
Note that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from 
, assuming Alluxio Master is running on 
> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")
Open your browser and check http://localhost:19999/browse. There should be an output file 
 which doubles each line in the file 
. Also, the 
 file now appears in the Alluxio file system space.NOTE: It is possible that the 
 file is not in Alluxio storage (Not In-Memory). This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block.
> val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1)
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Using Fault Tolerant Mode

When running Alluxio with fault tolerant mode, you can point to any Alluxio master:
> val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")

Data Locality

If Spark task locality is 
 while it should be 
, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community.Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname:
$ $SPARK_HOME/sbin/start-slave.sh -h <slave-hostname> <spark master uri>
For example:
$ $SPARK_HOME/sbin/start-slave.sh -h simple30 spark://simple27:7077
You can also set the 
 to achieve this. For example:
In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.

Running Hadoop MapReduce on Alluxio

This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.

Initial Setup

The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar, or if you are using Hadoop 1, this examples jar.

Compiling the Alluxio Client

In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:
$ mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTests
The version 
 supports many different distributions of Hadoop. For example, 
mvn install -Dhadoop.version=2.7.1 -DskipTests
 would compile Alluxio for the Apache Hadoop version 2.7.1. Please visit the Building Alluxio Master Branch page for more information about support for other distributions.After the compilation succeeds, the new Alluxio client jar can be found at:
This is the jar that you should use for the rest of this guide.

Configuring Hadoop

You need to add the following three properties to 
 file in your Hadoop installation 
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description>
<description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
This will allow your MapReduce jobs to use Alluxio for their input and output files. If you are using HDFS as the under storage system for Alluxio, it may be necessary to add these properties to the 
 file as well.In order for the Alluxio client jar to be available to the JobClient, you can modify 
 by changing 
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:${HADOOP_CLASSPATH}
This allows the code that creates and submits the Job to use URIs with Alluxio scheme.

Distributing the Alluxio Client Jar

In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio.This guide on how to include 3rd party libraries from Cloudera describes several ways to distribute the jars. From that guide, the recommended way to distributed the Alluxio client jar is to use the distributed cache, via the 
 command line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes. Below are instructions for the 2 main alternatives:1.Using the -libjars command line option. You can run a job by using the 
 command line option when using 
hadoop jar ...
, specifying
 as the argument. This will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the 
$ hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar <INPUT FILES> <OUTPUT DIRECTORY>`
2.Distributing the jars to all nodes manually. For installing Alluxio on each node, you must place the client jar 
 (located in the 
 directory), in the 
 (may be
 for different versions of Hadoop) directory of every MapReduce node, and then restart all of the TaskTrackers. One caveat of this approach is that the jars must be installed again for each update to a new release. On the other hand, when the jar is already on every node, then the 
 command line option is not needed.

Running Hadoop wordcount with Alluxio Locally

First, compile Alluxio with the appropriate Hadoop version:
$ mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>
For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running:
$ ./bin/stop-all.sh
$ ./bin/start-all.sh
Configure Alluxio to use the local HDFS cluster as its under storage system. You can do this by modifying 
 to include:
export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:9000
Start Alluxio locally:
$ ./bin/alluxio-stop.sh all
$ ./bin/alluxio-start.sh local
You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:
$ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt
This command will copy the 
 file into the Alluxio namespace with the path 
.Now we can run a MapReduce job for wordcount.
$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output
After this job completes, the result of the wordcount will be in the 
 directory in Alluxio. You can see the resulting files by running:
$ ./bin/alluxio fs ls /wordcount/output
$ ./bin/alluxio fs cat /wordcount/output/part-r-00000
