Spark利用hive与MySQL外部数据源做join
2018-04-03 19:06
246 查看
MySQL端
使用dept表,表内容如下:mysql> select * from dept -> ; +--------+------------+----------+ | deptno | dname | loc | +--------+------------+----------+ | 10 | ACCOUNTING | NEW YORK | | 20 | RESEARCH | DALLAS | | 30 | SALES | CHICAGO | | 40 | OPERATIONS | BOSTON | +--------+------------+----------+
Hive端
1.创建empspark表,并导入数据hive> CREATE TABLE empspark ( > empno int, > ename string, > job string, > mgr int, > hiredate string, > salary double, > comm double, > deptno int > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"; OK Time taken: 0.635 seconds hive> LOAD DATA LOCAL INPATH "/home/hadoop/data/emp" OVERWRITE INTO TABLE empspark; Loading data to table default.empspark Table default.empspark stats: [numFiles=1, numRows=0, totalSize=820, rawDataSize=0] OK Time taken: 0.973 seconds hive> select * from empspark; OK 369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10 7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30 7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20 7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30 7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20 7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10 Time taken: 0.679 seconds, Fetched: 14 row(s)
Spark端
1.数据准备结束,首先我们应该使sparksql可以访问hive的表,在这里我做的两步操作分别是:将hive的conf/hive-site.xml 复制导spark的conf/目录下
将mysql-connector-java-5.1.45-bin.jar驱动复制到spark的jars/目录下
2.创建两张表的dataframe
MySQL表:
val jdbcDF = spark.read.format("jdbc") .option("url", "jdbc:mysql://localhost:3306") .option("dbtable", "wl.dept") .option("user", "username") .option("password", "password").load()
hive表:
val hivedf=spark.table("empspark")
做join操作
jdbcDF.join(hivedf,jdbcDF.col("deptno")===hivedf.col("deptno")).show
显示结果
+------+----------+--------+-----+------+---------+----+-------------------+------+------+------+ |deptno| dname| loc|empno| ename| job| mgr| hiredate|salary| comm|deptno| +------+----------+--------+-----+------+---------+----+-------------------+------+------+------+ | 10|ACCOUNTING|NEW YORK| 7934|MILLER| CLERK|7782|1982-01-23 00:00:00|1300.0| null| 10| | 10|ACCOUNTING|NEW YORK| 7839| KING|PRESIDENT|null|1981-11-17 00:00:00|5000.0| null| 10| | 10|ACCOUNTING|NEW YORK| 7782| CLARK| MANAGER|7839|1981-06-09 00:00:00|2450.0| null| 10| | 20| RESEARCH| DALLAS| 7902| FORD| ANALYST|7566|1981-12-03 00:00:00|3000.0| null| 20| | 20| RESEARCH| DALLAS| 7876| ADAMS| CLERK|7788|1983-01-12 00:00:00|1100.0| null| 20| | 20| RESEARCH| DALLAS| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.0| null| 20| | 20| RESEARCH| DALLAS| 7566| JONES| MANAGER|7839|1981-04-02 00:00:00|2975.0| null| 20| | 20| RESEARCH| DALLAS| 369| SMITH| CLERK|7902|1980-12-17 00:00:00| 800.0| null| 20| | 30| SALES| CHICAGO| 7900| JAMES| CLERK|7698|1981-12-03 00:00:00| 950.0| null| 30| | 30| SALES| CHICAGO| 7844|TURNER| SALESMAN|7698|1981-09-08 00:00:00|1500.0| 0.0| 30| | 30| SALES| CHICAGO| 7698| BLAKE| MANAGER|7839|1981-05-01 00:00:00|2850.0| null| 30| | 30| SALES| CHICAGO| 7654|MARTIN| SALESMAN|7698|1981-09-28 00:00:00|1250.0|1400.0| 30| | 30| SALES| CHICAGO| 7521| WARD| SALESMAN|7698|1981-02-22 00:00:00|1250.0| 500.0| 30| | 30| SALES| CHICAGO| 7499| ALLEN| SALESMAN|7698|1981-02-20 00:00:00|1600.0| 300.0| 30| +------+----------+--------+-----+------+---------+----+-------------------+------+------+------+
相关文章推荐
- SparkSQL读取HBase数据,通过自定义外部数据源(hbase的Hive外关联表)
- 利用Sqoop从MySQL数据源向Hive中导入数据
- 第69课:Spark SQL通过Hive数据源JOIN实战 每天晚上20:00YY频道现场授课频道68917580
- spark-sql读取映射hbase数据的hive外部表
- 大数据Spark “蘑菇云”行动第89课:Hive中GroupBy优化、Join的多种类型实战及性能优化、OrderBy和SortBy、UnionAll等实战和优化
- spark安装mysql与hive
- 利用sqoop将hive和mysql数据互导简单实验
- Spark SQL之External DataSource外部数据源
- 利用udf函数将hive统计结果直接插入到mysql
- spark sql 访问hive数据时找不mysql的解决方法
- 大数据Spark “蘑菇云”行动第100课:Hive性能调优之企业级Join、MapJoin、GroupBy、Count、数据倾斜彻底解密和最佳实践
- HIVE 2.1.0 安装教程。(数据源mysql)
- hadoop+spark+hive+mysql集群搭建过程
- spark on hive 配置hive的metastore为mysql
- 基于Spark2.0搭建Hive on Spark环境(Mysql本地和远程两种情况)
- 利用sqoop将hive数据导入导出数据到mysql
- 52:Spark中的新解析引擎Catalyst源码中的外部数据源、缓存及其它
- Spark SQL读取hive数据时报找不到mysql驱动
- 大数据IMF传奇行动绝密课程第69课:Spark SQL通过Hive数据源实战
- Spark SQL中实现Hive MapJoin