您的位置:首页 > 其它

Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

2016-01-06 09:43 429 查看
2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.

有了前面的编译经验和之前下载好的java类包,花了大概一分钟就编译妥当,于是重新部署配置一下,马上OK。简直是高效率。

对于scala的编译,还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。





对spark 1.6中的新特性进行测试: (DataSet)



其中1.6的新特性还包括:

Spark Core/SQL

API Updates

SPARK-9999
Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
SPARK-10810
Session Management - Different users can share a cluster while having different configuration and temporary tables.
SPARK-11197
SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
SPARK-11745
Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412
Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
SPARK-11329
Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
SPARK-4849
Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-11778 - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table
called “tableName” in the database “dbName”.
SPARK-10947 - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings.
The default value of primitivesAsString is false.

Performance

SPARK-10000
Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
SPARK-11787
Parquet Performance - Improve Parquet scan performance when using flat schemas.
SPARK-9241
Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.

SPARK-9858
Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
SPARK-10978
Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
SPARK-11111
Fast null-safe joins - Joins using null-safe equality (
<=>
) will now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-10917,
SPARK-11149
In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
SPARK-11389
SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming

API Updates

SPARK-2629
New improved state management -
mapWithState
- a DStream transformation for stateful stream processing, supercedes
updateStateByKey
in functionality and performance.
SPARK-11198
Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
SPARK-10891
Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
SPARK-6328
Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.

UI Improvements

Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
Made output operations visible in the streaming tab as progress bars.

MLlib

New algorithms/models

SPARK-8518
Survival analysis - Log-linear model for survival analysis
SPARK-9834
Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
SPARK-3147
Online hypothesis testing - A/B testing in the Spark Streaming framework
SPARK-9930
New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
SPARK-6517
Bisecting K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

ML Pipelines

SPARK-6725
Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
SPARK-5565
LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines

R API

SPARK-9836
R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
SPARK-9681
Feature interactions in R formula - Interaction operator “:” in R formula

Python API - Many improvements to Python API to approach feature parity

Misc improvements

SPARK-7685 ,
SPARK-9642
Instance weights for GLMs - Logistic and Linear Regression can take instance weights
SPARK-10384,
SPARK-10385
Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
SPARK-10117
LIBSVM data source - LIBSVM as a SQL data source

Documentation improvements

SPARK-7751
@since versions - Documentation includes initial version when classes and methods were added
SPARK-11337
Testable example code - Automated testing for code in user guide examples
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: