您的位置：首页 > 大数据 > 人工智能

ClassNotFoundException: Failed to find data source: jdbc

2018-02-07 15:38 761 查看

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages
at http://spark.apache.org/third-party-projects.html

The question is almost Why
does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's
configuration to be precise).

The short name (aka alias) of a data source, e.g.

jdbc

kafka

,
are only available if the corresponding

META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

registers
a

DataSourceRegister

.

For

jdbc

alias
to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with
the following entry (there are others):

org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider

That's
what ties

jdbc

alias up
with the data source.

And you've excluded it from an uber-jar by the following

assemblyMergeStrategy

assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}

Note

case
PathList("META-INF", xs @ _*)

which you simply

MergeStrategy.discard

.
That's the root cause.

Just to check that the "infrastructure" is available and you could use the

jdbc

data
source by its fully-qualified name (not the alias), try this:

spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")

You will see other problems due to missing options like

url

,
but...we're digressing.

A solution is to

MergeStrategy.concat

all

META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

(that
would create an uber-jar with all data sources, incl. the

jdbc

data
source).

case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat

kafka

data
source is an external module
and is not available to Spark applications by default.

You have to define it as a dependency in your

pom.xml

(as
you have done), but that's just the very first step to have it in your Spark application.

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>

With that dependency you have to decide whether you want to create a so-called uber-jar that
would have all the dependencies bundled altogether (that results in a fairly big jar file and makes the submission time longer) or use

--packages

(or
less flexible

--jars

)
option to add the dependency at

spark-submit

time.

(There are other options like storing the required jars on Hadoop HDFS or using Hadoop distribution-specific ways of defining dependencies for Spark applications, but let's keep things simple)

I'd recommend using

--packages

first
and only when it works consider the other options.

Use

spark-submit
--packages

to include the spark-sql-kafka-0-10 module as
follows.

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0

Include the other command-line options as you wish.

Uber-Jar Approach

Including all the dependencies in a so-called uber-jar may not
always work due to how

META-INF

directories
are handled.

For

kafka

data
source to work (and other data sources in general) you have to ensure that

META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

of
all the data sources are merged (not

replace

first

or
whatever strategy you use).

kafka

data
sources uses its own META-INF/services/org.apache.spark.sql.sources.DataSourceRegister that
registers org.apache.spark.sql.kafka010.KafkaSourceProvider as
the data source provider for

kafka

format.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航