您的位置：首页 > 编程语言 > Java开发

（Java）Spark 统计文本文件中共有多少行包含给定字符

2017-02-27 11:34 615 查看

Spark 统计文本文件中共有多少行包含给定字符（Java）

一、开发环境:

jdk 8
maven 3

IntelliJ idea 16
win10

二、创建maven工程、编写业务、测试

1.pom 配置

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>

<groupId>com.hurz</groupId>
<artifactId>spark.wordcount</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
<!--spark core-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>

<build>
<plugins>
<!--maven 编译 包插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>

<!--maven 打包jar文件 插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<archive>
<manifest>
<!--主函数入口 -->
<mainClass>com.hurz.spark.wordcount.CountMain</mainClass>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
</manifest>
</archive>
<classesDirectory>
</classesDirectory>
</configuration>
</plugin>
</plugins>
</build>
</project>

2.编写统计方法，HelloWorld。

主要业务代码

package com.hurz.spark.wordcount;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import java.io.File;
/**
* @author : hurz / heywakeup@gzulmu.com
* @version : 1.0
* @date : 2017/2/27
*/
public class CountMain {
public static void main(String[] args) {
if (args == null || args.length < 2) {
throw new IllegalArgumentException("参数不正确，请传入文件地址和需要统计的单词");
}
String filePath = args[0];
File temp = new File(filePath);
if (!temp.exists() || temp.isDirectory()) {
throw new IllegalArgumentException("请传入正确的文件路径");
}
String countWord = args[1];
if (countWord == null || countWord.trim().isEmpty()) {
throw new IllegalArgumentException("请传入需要统计的单词");
}

SparkConf conf = new SparkConf().setAppName("MyApp");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(filePath).cache();
//计算包含 countWord 的行数
long num = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains(countWord);
}
}).count();

System.out.println("Lines with " + countWord + ": " + num);
sc.stop();
}
}

3.打包jar文件，测试运行

运行 maven package 生成jar包

将生成的jar 上传到 spark 服务器（本例直接上传到 spark 根目录）
在spark 根目录创建文件tfile.txt

touch tfile.txt
vi tfile.txt

编辑内容为:

《When You Are Old》
-- William Butler Yeats
William Butler Yeats
When you are old and grey and full of sleep,
And nodding by the fire, take down this book,
And slowly read, and dream of the soft look,
Your eyes had once, and of their shadows deep;
How many loved your moments of glad grace,
And loved your beauty with love false or true,
But one man loved the pilgrim soul in you,
And loved the sorrows of your changing face;
And bending down beside the glowing bars,
Murmur, a little sadly, how love fled,
And paced upon the mountains overhead,
And hid his face amid a crowd of stars.

打开spark 客户端运行测试

./bin/spark-submit --class com.hurz.spark.wordcount.CountMain wc.jar\
tfile.txt\
loved

备注:
tfile.txt 为文本文件绝对路径比如 /usr/local/aaa.txt; loved 为需要统计的字符。wc.jar 为 jar包的绝对路径比如 /usr/local/wc.jar （偷懒直接吧jar 和使用到的文本文件直接放在spark 根目录）

参考文档: http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/submitting-applications.html
源码下载地址 http://download.csdn.net/detail/heywakeup/9765062

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： spark Hadoop HelloWord

相关文章推荐

新的分享

章节导航