您的位置:首页 > 运维架构 > Apache

Tutorial on how to install apache spark on Windows

2016-08-03 10:32 393 查看
In this tutorial I will show you how I installed Apache Spark on windows and how I setup Ipython notebook to work with it.

Before I begin with installing Spark, I have installed python. I used anaconda python distribution which has most number of popular python packages available. You can download the distribution from the below link:
https://www.continuum.io/downloads


I have downloaded Anaconda Python installer for version 3.5.

Okay once I installed the anaconda, for the IDE I have chosen to use IPython.

To install open the anaconda prompt, and enter the following command
conda install jupyter


Note if you already have ipython installed, you might want to update it using the following command
conda update jupyter


Now I have downloaded and installed spark from
http://spark.apache.org/downloads.html


I chose Spark 1.6.0 , package which was pre-built for hadoop 2.6 and later.

After downloading it, I copied the file to the C:\ drive and then unzipped it, then renamed the folder to spark. The following is the screenshot of the folder structure.



I renamed the file log4j.properties.template in the conf directory to log4j.properties and then opened it in notepad++ which
is a nice and clean text editor, you can even use notepad.

I changed the line where it says
log4j.rootCategory=INFO, console


to
log4j.rootCategory=WARN, console


Doing this reduces the amount of ERROR messages which we see on the console, if you want to reduce even further you can change it to show only ERROR messages by changing it to
log4j.rootCategory=ERROR, console


after saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.

I have download the file from the below mentioned link
http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe


I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.
C:\winutils\bin\winutils.exe


Now I have opened the system properties, and added the environment variables. You can open system properties by pressing WIN + R button which open’s up the run and enter sysdm.cpl

I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following
variable name as HADOOP_HOME  and it's value as C:\winutils
variable name as SPARK_HOME and it's value as C:\spark


Also clicked on path variable and added $SPARK_HOME$\bin at the end

Now to launch the spark , I just opened the command prompt and entered pyspark to open the spark shell.

Now, if you want to use IPython, set the below user variables to the environment variables in the system properties just like how HADOOP_HOME was set
variable name as PYSPARK_DRIVER_PYTHON  and it's value as ipython
variable name as PYSPARK_DRIVER_PYTHON_OPTS and it's value as: notebook


Now, when you launch the command prompt and enter pyspark, you will notice jupyter notebook being launched.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  spark