关于所使用的spark版本中的spark sql不支持exists和in等子查询语句的解决方案记录
2017-01-07 09:21
603 查看
stackoverflow上一篇很好的问题解答解决方法:
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark
SQL / DataFrames and Datasets Guide / Supported Hive Features"
EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although
Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.
A table holds the rows that make some predicate (statement parameterized by column names) true:
The DBA gives the predicates for each base table
columns
A
the rows that make the AND of its arguments' predicates true; for a
the OR; for an
the AND NOT.
the rows where EXISTS dropped columns [predicate of T].
the rows where EXISTS U-only columns [predicate of T AND predicate of U].
the rows where predicate of T AND condition.
(Re querying generally see this answer.)
So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.
Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.
Your Solution 1 does not do what you think it does. If just one of the
are empty, ie even if there are
available in the other, the FROM cross product of tables is empty and no rows are returned. ("An
Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them
after which ON (for INNER JOINs) and WHERE filter some out.
Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect
the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).
原链接:http://stackoverflow.com/questions/34861516/spark-replacement-for-exists-and-in
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark
SQL / DataFrames and Datasets Guide / Supported Hive Features"
EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although
Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.
A table holds the rows that make some predicate (statement parameterized by column names) true:
The DBA gives the predicates for each base table
Twith
columns
T.C,...: T(T.C,...)
A
JOINholds
the rows that make the AND of its arguments' predicates true; for a
UNION,
the OR; for an
EXCEPT,
the AND NOT.
SELECT
kept columns
FROM
Tholds
the rows where EXISTS dropped columns [predicate of T].
T
LEFT SEMI JOIN
Uholds
the rows where EXISTS U-only columns [predicate of T AND predicate of U].
T
WHERE
conditionholds
the rows where predicate of T AND condition.
(Re querying generally see this answer.)
So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.
Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.
select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.id union select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.ID
Your Solution 1 does not do what you think it does. If just one of the
exists_clausetables
are empty, ie even if there are
IDmatches
available in the other, the FROM cross product of tables is empty and no rows are returned. ("An
Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them
after which ON (for INNER JOINs) and WHERE filter some out.
Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect
the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).
原链接:http://stackoverflow.com/questions/34861516/spark-replacement-for-exists-and-in
相关文章推荐
- 使用SQL语句查询每个分组的前N条记录
- sql查询语句中 in和 exists的区别与性能比较
- 如何使用Oracle查询并删除重复记录的SQL语句
- 【SQL】查询语句中in和exists的区别
- 【SQL】查询语句中in和exists的区别
- 关于Column '*' not found 解决方案 Hibernate使用SQL查询返回实体类型,即返回某个类,或实体类
- 营配数据质量核查,关于营销mis系统与配电gis系统里面的sql语句查询,做为积累使用,下次就不用重复写同样的语句了。
- 关于sql语句in的使用注意规则( 转)
- 关于Sql中尽量避免用的查询语句(in....
- 关于sql语句in的使用注意规则
- SQL查询语句精华使用简要----关于连接
- 如何使用Oracle查询并删除重复记录的SQL语句
- 使用SQL语句对重复记录查询、统计重复次数、删除重复数据
- Excel 中使用SQL 语句查询数据(六)-----IN 语句的应用
- hibernate使用hql和sql查询总记录数语句
- SQL语句使用Left Outer join代替In删除多条记录
- 使用SQL语句对重复记录查询、统计重复次数、删除重复数据
- 关于SQL语句查询最近联系人聊天记录(即最近一条消息)
- 关于使用一条SQL语句 找出同时符合多个tag条件的记录集合算法
- sql查询语句的优化,exists与in的更换