您的位置:首页 > 大数据 > 人工智能

ClassNotFoundException: Failed to find data source: jdbc

2018-02-07 15:38 761 查看
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages
at http://spark.apache.org/third-party-projects.html

The question is almost Why
does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's
configuration to be precise).

The short name (aka alias) of a data source, e.g. 
jdbc
 or 
kafka
,
are only available if the corresponding 
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 registers
DataSourceRegister
.

For 
jdbc
 alias
to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with
the following entry (there are others):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider


That's
what ties 
jdbc
 alias up
with the data source.

And you've excluded it from an uber-jar by the following 
assemblyMergeStrategy
.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}


Note 
case
PathList("META-INF", xs @ _*)
 which you simply 
MergeStrategy.discard
.
That's the root cause.

Just to check that the "infrastructure" is available and you could use the 
jdbc
 data
source by its fully-qualified name (not the alias), try this:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")


You will see other problems due to missing options like 
url
,
but...we're digressing.

A solution is to 
MergeStrategy.concat
 all 
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 (that
would create an uber-jar with all data sources, incl. the 
jdbc
 data
source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat


kafka
 data
source is an external module
and is not available to Spark applications by default.

You have to define it as a dependency in your 
pom.xml
 (as
you have done), but that's just the very first step to have it in your Spark application.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>


With that dependency you have to decide whether you want to create a so-called uber-jar that
would have all the dependencies bundled altogether (that results in a fairly big jar file and makes the submission time longer) or use 
--packages
 (or
less flexible 
--jars
)
option to add the dependency at 
spark-submit
 time.

(There are other options like storing the required jars on Hadoop HDFS or using Hadoop distribution-specific ways of defining dependencies for Spark applications, but let's keep things simple)

I'd recommend using 
--packages
 first
and only when it works consider the other options.

Use 
spark-submit
--packages
 to include the spark-sql-kafka-0-10 module as
follows.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0


Include the other command-line options as you wish.


Uber-Jar Approach

Including all the dependencies in a so-called uber-jar may not
always work due to how 
META-INF
directories
are handled.

For 
kafka
 data
source to work (and other data sources in general) you have to ensure that 
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 of
all the data sources are merged (not 
replace
 or 
first
 or
whatever strategy you use).

kafka
 data
sources uses its own META-INF/services/org.apache.spark.sql.sources.DataSourceRegister that
registers org.apache.spark.sql.kafka010.KafkaSourceProvider as
the data source provider for 
kafka
 format.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐