007-spark的wordCount
2015-09-01 20:14
211 查看
测试文本内容
[hadoop@mycluster ~]$ cat /home/hadoop/wc.txt
hello me
hello you
hello china
hello you
1、读取本地或者HDFS文件
spark启动时候生成SparkContext 对象sc,通过spark的上下文对象sc读取文件
命令:scala> var textFile = sc.textFile("/home/hadoop/wc.txt").collect
执行结果:textFile: Array[String] = Array(hello me, hello you, hello china, hello you)
2、执行文件
2.1 flatMap 对读取的结果通过制表符方式平摊
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).collect
或者
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).collect
结果:
textFile: Array[String] = Array(hello, me, hello, you, hello, china, hello, you)
2.2 map(word=>(word,1)) word表示每个单词,每个单词为1
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).map(word => (word,1)).collect
或者
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).collect
结果:
textFile: Array[(String, Int)] = Array((hello,1), (me,1), (hello,1), (you,1), (hello,1), (china,1), (hello,1), (you,1))
2.3 执行reduceByKey函数
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( (a,b) => a + b ).collect
或者
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).collect
结果:
textFile: Array[(String, Int)] = Array((hello,4), (me,1), (you,2), (china,1))
2.4 key 字段进行排序
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).collect
结果:
textFile: Array[(String, Int)] = Array((china,4), ("hello ",1), (me,1), (you,2))
2.5 输出结果保存本地或者HDFS上
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).saveAsTextFile("/home/hadoop/output")
执行结果:
[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
[hadoop@mycluster output]$ more part-00001
(you,2)
(china,1)
2.6 让输出结果仅生成一个文件
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("/home/hadoop/output")
执行结果:
[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
(you,2)
(china,1)
2.6 出现次数最多的单词排在前面
var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map( x => (x._2,x._1) ).collect
结果: textFile: Array[(String, Int)] = Array((hello,4), (you,2), (me,1), (china,1))
备注:
以上就是wordcount的例子。下面给出读取hdfs上的数据的案例
var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("hdfs://mycluster:9000/output")
执行结果:
[hadoop@mycluster output]$ hdfs dfs -cat hdfs://mycluster:9000/output/part-00000
(hello,4)
(me,1)
(you,2)
(china,1)
[hadoop@mycluster ~]$ cat /home/hadoop/wc.txt
hello me
hello you
hello china
hello you
1、读取本地或者HDFS文件
spark启动时候生成SparkContext 对象sc,通过spark的上下文对象sc读取文件
命令:scala> var textFile = sc.textFile("/home/hadoop/wc.txt").collect
执行结果:textFile: Array[String] = Array(hello me, hello you, hello china, hello you)
2、执行文件
2.1 flatMap 对读取的结果通过制表符方式平摊
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).collect
或者
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).collect
结果:
textFile: Array[String] = Array(hello, me, hello, you, hello, china, hello, you)
2.2 map(word=>(word,1)) word表示每个单词,每个单词为1
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).map(word => (word,1)).collect
或者
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).collect
结果:
textFile: Array[(String, Int)] = Array((hello,1), (me,1), (hello,1), (you,1), (hello,1), (china,1), (hello,1), (you,1))
2.3 执行reduceByKey函数
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( (a,b) => a + b ).collect
或者
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).collect
结果:
textFile: Array[(String, Int)] = Array((hello,4), (me,1), (you,2), (china,1))
2.4 key 字段进行排序
scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).collect
结果:
textFile: Array[(String, Int)] = Array((china,4), ("hello ",1), (me,1), (you,2))
2.5 输出结果保存本地或者HDFS上
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).saveAsTextFile("/home/hadoop/output")
执行结果:
[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
[hadoop@mycluster output]$ more part-00001
(you,2)
(china,1)
2.6 让输出结果仅生成一个文件
var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("/home/hadoop/output")
执行结果:
[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
(you,2)
(china,1)
2.6 出现次数最多的单词排在前面
var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map( x => (x._2,x._1) ).collect
结果: textFile: Array[(String, Int)] = Array((hello,4), (you,2), (me,1), (china,1))
备注:
以上就是wordcount的例子。下面给出读取hdfs上的数据的案例
var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("hdfs://mycluster:9000/output")
执行结果:
[hadoop@mycluster output]$ hdfs dfs -cat hdfs://mycluster:9000/output/part-00000
(hello,4)
(me,1)
(you,2)
(china,1)
相关文章推荐
- 机器学习 径向基(Radial basis function)与RBF核函数 浅析
- 算法:插入排序
- HTTP请求头和响应头总结
- 从上往下打印二叉树(分层遍历)
- OPS - add pubkey to the server with script
- ZOJ2321解题报告
- codeforces 337E E. Divisor Tree(数论+贪心)
- EasyX
- 005-spark standalone模式安装
- hihocoder1224
- 【工业串口和网络软件通讯平台(SuperIO)教程】九.重写通讯接口函数,实现特殊通讯方式
- 人工智能学习梵高,毕加索风格,画出的世界名画是什么样子
- MyBatis——日志
- 外碎片与内碎片
- 页面之间传值方式的总结,五种方式,通知,block,代理,单例,NSUERDEFALUT,
- Android UI设计小知识——渐变色背景的制作
- 交换排序------冒泡排序(实现Java)
- Zookeeper Api(java)入门与应用
- Processing 教程(2) - 鼠标、键盘事件、条件选择、屏幕宽高
- 【工业串口和网络软件通讯平台(SuperIO)教程】九.重写通讯接口函数,实现特殊通讯方式