您的位置:首页 > 其它

Spark2.0新特性介绍

2016-08-22 16:24 405 查看
Spark Release 2.0.0
官网地址 http://spark.apache.org/releases/spark-release-2-0-0.html#sparkr  
Apache Spark 2.0.0 is the first release on the 2.x line. The major updatesare API usability, SQL 2003 support, performance improvements, structuredstreaming, R UDF support, as well as operational improvements. In
addition,this release includes over 2500 patches from over 300 contributors.
Spark2.0是2.x系列的第一个发布版。主要的更新有更易用的接口(API usability)、对SQL2003的支持(SQL
2003 support)、性能的提升(performance improvements)、结构化的的流处理(structured streaming)、R语言自定义语法的支持(R
UDF support)和使用上的提升(operational improvements)。另外,此版本来自于300多位贡献者的2500+个补丁。
 
To downloadApache Spark 2.0.0, visit the downloads page. Youcan
consult JIRA for the detailed changes. We havecurated
a list of high level changes here, grouped by major modules.
需要下载Spark2.0的请点击http://spark.apache.org/downloads.html。同时可以访问Spark
Jira(https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12329449)来详细了解更新。下面是我们按照模块整理的比较重要的更新。
API Stability
Core and Spark SQL
Programming APIs
SQL
New Features
Performance and Runtime

MLlib
New features
Speed/scaling

SparkR
Streaming
Dependency, Packaging, and Operations
Removals, Behavior Changes
and Deprecations
Removals
Behavior Changes
Deprecations

Known Issues
Credits
API Stability
Apache Spark2.0.0 is the first release in the 2.X major line. Spark is guaranteeingstability of its non-experimental APIs for all 2.X releases. Although the APIshave stayed largely similar to 1.X, Spark 2.0.0 does
have API breaking changes.They are documented in the Removals,Behavior Changes
and Deprecations section.
Spark2.0是2.x主线上的第一个发布版。Spark2.0保证非实验性接口(non-experimental
APIs)的稳定性。虽然看起来Spark2.0的大部分接口类似于1.X系列,却是有突破性的变化,Removals,Behavior
Changes and Deprecations小节中有介绍。
Core and Spark SQL
Programming APIs
One of thelargest changes in Spark 2.0 is the new updated APIs:
其中一个比较大的变化就是更了接口:
Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been u
4000
nified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming
interface.
统一DataFrame和Dataset接口:统一了Scala和Java的DataFrame、Dataset接口,换言之,DataFrame是行式的Dataset;在R和Python中,?由于缺乏安全类型,DataFrame是主要的程序接口。
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
SparkSession:替代原来的SQLContext和HiveContext作为DataFrame和Dataset的入口函数。SQLContext和HiveContext保持向后兼容。
A new, streamlined configuration API for SparkSession
为SparkSession提供全新的、工作流式配置。
Simpler, more performant accumulator API
更易用、更高效的计算接口
A new, improved Aggregator API for typed aggregation in Datasets
Dataset中的聚合操作有全新的、改进的聚合接口
SQL
Spark 2.0substantially improved SQL functionalities with SQL2003 support. Spark SQL cannow run all 99 TPC-DS queries. More prominently, we have improved:
Spark2.0 极大的提高了对SQL2003的支持程度。SparkSQL现在可以通过TPC-DS的99个测试语句。更重要的是,还有下列更新:
A native SQL parser that supports both ANSI-SQL as well as Hive QL
原生SQL解析器同时支持ANSI-SQL和hiveSQL
Native DDL command implementations
原生DDL(数据库模式定义语言)的集成
Subquery support, including
支持子查询,包含:
Uncorrelated Scalar Subqueries
不相关的标量子查询,应该是指非关联的子查询
Correlated Scalar Subqueries
相关的标量子查询,关联子查询

NOT IN predicate Subqueries (in WHERE/HAVING clauses)
WHERE/HAVING中支持NOT IN
IN predicate subqueries (in WHERE/HAVING clauses)
WHERE/HAVING中支持IN
(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
WHERE/HAVING中支持(NOT)EXIST

View canonicalization support
视图标准化支持
In addition,when building without Hive support, Spark SQL should have almost all thefunctionality as when building with Hive support, with the exception of Hiveconnectivity, Hive UDFs, and script transforms.
另外,除Hive连接、UDF、脚本支持外,编译时不加Hive支持大部分功能与添加Hive支持相同。
New Features
新特性
Native CSV data source, based on Databricks’ spark-csv module
建立在Databricks上的本地CSV数据源
Off-heap memory management for both caching and runtime execution
针对缓存和运行中的堆外内存管理器
Hive style bucketing support
Hive分桶支持
Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
支持利用分位数、Bloom filter、count-min近似统计
Performance and Runtime
Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.
在SQL和DataFrames操作中利用整个代码生成阶段的新技术实现了2-10倍的实质性加速。
Improved Parquet scan throughput through vectorization
通过向量化提升了Parquet(一种列式存储)扫描的吞吐量
Improved ORC performance
提高ORC性能
Many improvements in the Catalyst query optimizer for common workloads
Catalyst
Improved window function performance via native implementations for all window functions
通过本地化实现提升窗口函数的性能
Automatic file coalescing for native data sources
本地数据源文件的自动合并
MLlib
TheDataFrame-based API is now the primary API. The RDD-based API is enteringmaintenance mode. See the MLlib
guide fordetails
以DataFrame为基础的接口成为MLlib主流的接口。RDD接口进入维护模式。
New features
ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R. See this blog
post and the following JIRAs for details: SPARK-6725, SPARK-11939, SPARK-14311.
ML持久化:DataFrame为基础的接口基本完全支持Scala、java、Python、R语言进行ML model和pipelines的保存和加载。
MLlib in R: SparkR now offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression. See this
talk to learn more.
SparkR:SparkR目前提供generalized linear models, naive Bayes, k-means clustering, and survival regression的接口支持。
Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.
PySpark:PySpark支持多数据mllib算法,包括:LDA、GM、GLR等等。
Algorithms added to DataFrames-based API: Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer.
算法增加DataFrame接口:Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer。
This talk lists manyof these new features.
Speed/scaling
Vectors andMatrices stored in DataFrames now use much more efficient serialization,reducing overhead in calling MLlib algorithms. (SPARK-14850)
SparkR
The largestimprovement to SparkR in Spark 2.0 is user-defined functions. There are threeuser-defined functions: dapply, gapply, and lapply. The first two can be usedto do partition-based UDFs using dapply and gapply,
e.g. partitioned modellearning. The latter can be used to do hyper-parameter tuning.
SparkR最大的更新是Spark2.0支持UDF.目前有三种UDF:dapply,
gapply,and lapply。前两者支持分区为基础的模型训练。后者支持多参数模型的优化。
In addition,there are a number of new features:
Improved algorithm coverage for machine learning in R, including naive Bayes, k-means clustering, and survival regression.
增大R上覆盖的机器学习算法范围,有:NB、Kmeans、SR。
Generalized linear models support more families and link functions.
GLM支持更多的类型和计算函数。
Save and load for all ML models.
所有模型可以保存和加载。
More DataFrame functionality: Window functions API, reader, writer support for JDBC, CSV, SparkSession
更多DataFrame方法:window接口
Streaming
Spark 2.0 shipsthe initial experimental release for Structured Streaming, a high levelstreaming API built on top of Spark SQL and the Catalyst optimizer. StructuredStreaming enables users to program against streaming
sources and sinks usingthe same DataFrame/Dataset API as in static data sources, leveraging theCatalyst optimizer to automatically incrementalize the query plans.
Spark2.0基于SparkSql和Catalyst编译器构建了high
level的结构化Streaming。结构化Streaming使用户能够使用DataFrame和DataSet操作静态数据一样操作流数据,并利用Catalyst编译器自动优化查询。
For the DStreamAPI, the most prominent update is the new experimental support for Kafka 0.10.
Dependency, Packaging, and Operations
There are avariety of changes to Spark’s operations and packaging process:
Spark2.0在操作和编译有很多地方的更改:
Spark 2.0 no longer requires a fat assembly jar for production deployment.
Spark不再使用assembly包。
Akka dependency has been removed, and as a result, user applications can program against any versions of Akka.
删除Akka依赖,用户可以使用任意版本的Akka。
Support launching multiple Mesos executors in coarse grained Mesos mode.
Mesos粗粒度模式下可以启动多个executor。
Kryo version is bumped to 3.0.
Kryo升级到3.0版本。
The default build is now using Scala 2.11 rather than Scala 2.10.
默认使用Scala2.11代替2.10编译。
Removals, Behavior Changes and Deprecations
Removals
The followingfeatures have been removed in Spark 2.0:
下面的特性已经在Spark2.0中被删除:
Bagel
Support for Hadoop 2.1 and earlier
支持Hadoop2.1之前的版本。
The ability to configure closure serializer
可以关闭序列化。
HTTPBroadcast
HTTP广播
TTL-based metadata cleaning
TTL元数据清洗
Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
org.apache.spark.Logging。建议使用slf4j
SparkContext.metricsSystem
SC为基础
Block-oriented integration with Tachyon (subsumed by file system integration)
Block-orented的集成。
Methods deprecated in Spark 1.x
与spark1.x方法脱离
Methods on Python DataFrame that returned RDDs (map, f
9b45
latMap, mapPartitions, etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
Python DataFrame方法返回RDD。
Less frequently used streaming connectors, including Twitter, Akka, MQTT, ZeroMQ
流连接大量使用Twitter, Akka, MQTT, ZeroMQ
Hash-based shuffle manager
Hash shuffle机制
History serving functionality from standalone Master
Standalone模式的历史服务
For Java and Scala, DataFrame no longer exists as a class. As a result, data sources would need to be updated.
Scala和Java中,DataFrame不在作为一个类
Spark EC2 script has been fully moved to an external repository hosted by the UC Berkeley
AMPLab
SparkEC2完全的转译到外部源
Behavior Changes
The followingchanges might require updating existing applications that depend on the oldbehavior or API.
依赖于一下接口的应用程序需要更新
The default build is now using Scala 2.11 rather than Scala 2.10.
使用Scala2.11而不是2.10
In SQL, floating literals are now parsed as decimal data type rather than double data type.
在SQL中,浮点类型解析为decimal而不是double
Kryo version is bumped to 3.0.
Kryo版本为3.0
Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
?Java RDD’s flatMap和mapPartitions 方法返回iterator而不是Iterable,从而不需要遍历所有数据
Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
Java RDD’s countByKey 和countAprroxDistinctByKey返回<k,Long>而不是<k,Object>
When writing Parquet files, the summary files are not written by default. To re-enable it, users must set“parquet.enable.summary-metadata”
to true.
Parquet形式存储文件时不再写入sunmmary信息,可以使用parquet.enable.summary-metadata来打开
The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg. This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration
guide for a full list of API changes.
以DataFrame为基础的接口使用ml.linalg的数学类型,而不是mllib.linalg
For a morecomplete list, please see SPARK-11806 fordeprecations
and removals.
Deprecations
The followingfeatures have been deprecated in Spark 2.0, and might be removed in futureversions of Spark 2.x:
一下方面可能将在以后得2.x系列删除
Fine-grained mode in Apache Mesos
Mesos的细粒度模式
Support for Java 7
支持Java7
Support for Python 2.6
支持Python2.6
Known Issues
Lead and Lag’s behaviors have been changed to ignoring nulls from respecting nulls (1.6’s behaviors). In 2.0.1, the behavioral changes will be fixed in 2.0.1 (SPARK-16721).
Lead and Lag functions using constant input values does not return the default value when the offset row does not exist (SPARK-16633).
 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: