Hive实现wordcount的统计
2018-01-03 17:29
447 查看
1 所需环境
Hive的安装参考地址2 创建一个数据库
创建wordcount数据库hive> create database wordcount; OK Time taken: 0.389 seconds hive> show databases; OK default wordcount Time taken: 0.043 seconds, Fetched: 3 row(s)
3 创建表
创建一张表,记录文件数据,使用换行符作为分隔符。hive> create table file_data(context string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'; OK Time taken: 1.227 seconds hive> show tables; OK file_data Time taken: 0.195 seconds, Fetched: 1 row(s)
4 准备数据
准备要统计的数据[hadoop@zydatahadoop001 ~]$ pwd /home/hadoop [hadoop@zydatahadoop001 ~]$ ll total 12 -rw-rw-r--. 1 hadoop hadoop 7 Dec 19 09:56 demo1.txt -rw-rw-r--. 1 hadoop hadoop 27 Dec 22 14:40 helloword.txt -rw-------. 1 hadoop hadoop 1008 Dec 19 15:00 nohup.out [hadoop@zydatahadoop001 ~]$ vi wordcount.txt hello world hello hadoop hello java hello mysql c c++ lisi zhangsan wangwu
5 将数据加载到 file_data 表中
将准备的数据(/home/hadoop/wordcount.tx)添加到file_data 表中hive> load data local inpath '/home/hadoop/wordcount.txt' into table file_data; Loading data to table wordcount.file_data Table wordcount.file_data stats: [numFiles=1, totalSize=75] OK Time taken: 3.736 seconds 查看file_data hive> select * from file_data; OK hello world hello hadoop hello java hello mysql c c++ lisi zhangsan wangwu Time taken: 0.736 seconds, Fetched: 7 row(s)
6 根据’ ‘切分数据,切分出来的每个单词作为一行 记录到结果表。
创建结果表,存放单词统计记录。hive> create table words(word string);; OK Time taken: 0.606 seconds
切分数据,把结果放入wordcount表中。
split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词
hive> insert into table words select explode(split(word , " ")) from file_data; 查询 words hive> select * from words; OK hello world hello hadoop hello java hello mysql c c++ lisi zhangsan wangwu Time taken: 0.304 seconds, Fetched: 13 row(s)
7 使用聚合函数count进行统计
结果hive> select word, count(word) from words group by word; Query ID = hadoop_20171222143131_4636629a-1983-4b0b-8d96-39351f3cd53b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1513893519773_0004, Tracking URL = http://zydatahadoop001:8088/proxy/application_1513893519773_0004/ Kill Command = /opt/software/hadoop-cdh/bin/hadoop job -kill job_1513893519773_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2017-12-22 15:47:06,871 Stage-1 map = 0%, reduce = 0% 2017-12-22 15:47:50,314 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 22.96 sec 2017-12-22 15:48:09,496 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 25.59 sec MapReduce Total cumulative CPU time: 25 seconds 590 msec Ended Job = job_1513893519773_0004 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 25.59 sec HDFS Read: 6944 HDFS Write: 77 SUCCESS Total MapReduce CPU Time Spent: 25 seconds 590 msec OK c 1 c++ 1 hadoop 1 hello 4 java 1 lisi 1 mysql 1 wangwu 1 world 1 zhangsan 1
相关文章推荐
- 使用hive实现wordcount
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- Java实现词频统计(Wordcount)-Map或Hashtable的value排序
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- hive中使用HQL实现wordcount
- Hadoop2.7.3+Hive2.1.0整合实现wordcount程序
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- lucene中facet实现统计分析的思路——本质上和word count计数无异,像splunk这种层层聚合(先filed1统计,再field2统计,最后field3统计)lucene是排序实现
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- Hive实现wordCount程序
- hive学习之WordCount单词统计
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟
- 使用hive、java api两种方式实现wordcount功能、及个人感悟