Spark:用Scala和Java实现WordCount
2015-01-14 17:29
573 查看
用Scala和Java实现WordCount,其中Java实现的JavaWordCount是spark自带的例子($SPARK_HOME/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java)
1.环境
OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
Hadoop:Hadoop 2.4.1
JDK:1.7.0_60
Spark:1.1.0
Scala:2.11.2
集成开发环境:IntelliJ IDEA 13.1.3
注意:需要在客户端windows环境下安装IDEA、Scala、JDK,并且为IDEA下载scala插件。
2.Scala实现单词计数
1 package com.hq 2 3 /** 4 * User: hadoop 5 * Date: 2014/10/10 0010 6 * Time: 18:59 7 */ 8 import org.apache.spark.SparkConf 9 import org.apache.spark.SparkContext 10 import org.apache.spark.SparkContext._ 11 12 /** 13 * 统计字符出现次数 14 */ 15 object WordCount { 16 def main(args: Array[String]) { 17 if (args.length < 1) { 18 System.err.println("Usage: <file>") 19 System.exit(1) 20 } 21 22 val conf = new SparkConf() 23 val sc = new SparkContext(conf) 24 val line = sc.textFile(args(0)) 25 26 line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println) 27 28 sc.stop() 29 } 30 }
3.Java实现单词计数
View
Code
4.IDEA打包和运行
4.1 IDEA的工程结构
在IDEA中建立Scala工程,并导入spark api编程jar包(spark-assembly-1.1.0-hadoop2.4.0.jar:$SPARK_HOME/lib/里面)
4.2 打成jar包
File ---> Project Structure
配置完成后,在菜单栏中选择Build->Build Artifacts...,然后使用Build等命令打包。打包完成后会在状态栏中显示“Compilation completed successfully...”的信息,去jar包输出路径下查看jar包,如下所示。
ScalaTest1848.jar就是我们编程所产生的jar包,里面包含了三个类HelloWord、WordCount、JavaWordCount。
可以用这个jar包在spark集群里面运行java或者scala的单词计数程序。
4.3 以Spark集群standalone方式运行单词计数
上传jar包到服务器,并放置在/home/ebupt/test/WordCount.jar路径下。
上传一个text文本文件到HDFS作为单词计数的输入文件:hdfs://eb170:8020/user/ebupt/text
内容如下
View
Code
用spark-submit命令提交任务运行,具体使用查看:spark-submit --help
1 [ebupt@eb174 bin]$ spark-submit --help 2 Spark assembly has been built with Hive, including Datanucleus jars on classpath 3 Usage: spark-submit [options] <app jar | python file> [app options] 4 Options: 5 --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. 6 --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or 7 on one of the worker machines inside the cluster ("cluster") 8 (Default: client). 9 --class CLASS_NAME Your application's main class (for Java / Scala apps). 10 --name NAME A name of your application. 11 --jars JARS Comma-separated list of local jars to include on the driver 12 and executor classpaths. 13 --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place 14 on the PYTHONPATH for Python apps. 15 --files FILES Comma-separated list of files to be placed in the working 16 directory of each executor. 17 18 --conf PROP=VALUE Arbitrary Spark configuration property. 19 --properties-file FILE Path to a file from which to load extra properties. If not 20 specified, this will look for conf/spark-defaults.conf. 21 22 --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). 23 --driver-java-options Extra Java options to pass to the driver. 24 --driver-library-path Extra library path entries to pass to the driver. 25 --driver-class-path Extra class path entries to pass to the driver. Note that 26 jars added with --jars are automatically included in the 27 classpath. 28 29 --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). 30 31 --help, -h Show this help message and exit 32 --verbose, -v Print additional debug output 33 34 Spark standalone with cluster deploy mode only: 35 --driver-cores NUM Cores for driver (Default: 1). 36 --supervise If given, restarts the driver on failure. 37 38 Spark standalone and Mesos only: 39 --total-executor-cores NUM Total cores for all executors. 40 41 YARN-only: 42 --executor-cores NUM Number of cores per executor (Default: 1). 43 --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). 44 --num-executors NUM Number of executors to launch (Default: 2). 45 --archives ARCHIVES Comma separated list of archives to be extracted into the 46 working directory of each executor.
①提交scala实现的单词计数:
[ebupt@eb174 test]$ spark-submit --master spark://eb174:7077 --name WordCountByscala --class com.hq.WordCount --executor-memory 1G --total-executor-cores 2 ~/test/WordCount.jar hdfs://eb170:8020/user/ebupt/text
②提交java实现的单词计数:
[ebupt@eb174 test]$ spark-submit --master spark://eb174:7077 --name JavaWordCountByHQ --class com.hq.JavaWordCount --executor-memory 1G --total-executor-cores 2 ~/test/WordCount.jar
hdfs://eb170:8020/user/ebupt/text
③2者运行结果类似,所以只写了一个:
按 Ctrl+C 复制代码
按 Ctrl+C 复制代码
相关文章推荐
- python、scala、java分别实现在spark上实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Java和Scala实现WordCount
- Spark:用Scala和Java实现WordCount
- java8实现spark wordcount并且按照value排序输出
- SparkStreaming实现HDFS的wordCount(java版)
- Spark wordcount - Python, Scala, Java
- Spark 程序 WordCount实现 Scala、Python
- java8实现spark streaming的wordcount
- java和scala分别实现WordCount
- 分别用Java、Scala、spark-shell开发wordcount程序及测试代码
- Spark Streaming开发入门——WordCount(Java&Scala)
- Java实现Spark词配对Wordcount计数
- java实现kafka整合spark streaming完成wordCount,updateStateByKey完成实时状态更新
- maven构建Scala程序,实现spark的wordcount
- 第67课:Spark SQL下采用Java和Scala实现Join的案例综合实战(巩固前面学习的Spark SQL知识)
- Spark:用Scala和Java实现WordCount
- Spark pipe + PHP 的 wordcount 实现