神刀安全网

Apache Spark 2.0 Technical Preview

Two years after the first release of Apache Spark, Databricks announced the technical preview of Apache Spark 2.0 , based on upstream branch 2.0.0-preview . The preview is not ready for production, both in terms of stability and also API, but is a release intended to gather feedback from the community ahead of the general availability release.

This new release is focused on feature improvements based on community feedback. There are two main areas of improvement regarding Spark’s development.

One of the most used interfaces for Apache Spark based applications is SQL. Spark 2.0 offers support for all the 99 TPC-DS queries which are largely based on SQL:2003 specification. This alone can help porting existing data loads into a Spark backend with minimal rewriting of the application stack.

The second aspect is based on the programming APIs. Machine Learning has a big emphasis in this new release. spark.mllib package is deprecated in favor of the new spark.ml package that focuses on pipeline based APIs and is based on DataFrames . Machine Learning pipelines and models can now be persisted across all languages supported by Spark.

K-Means, Generalized Linear Models (GLM) , Naive Bayes and Survival Regression are now supported in R.

DataFrames and Datasets are now unified for Scala and Java programming languages under the new Datasets class, which also serves as an abstraction for structured streaming. In languages that don’t support compile time type safety, this is not applicable and instead DataFrames remains as the primary abstraction. SQLContext and HiveContext are now replaced by the unified SparkSession . Finally, the new Accumulator API has a simpler type hierarchy and supports specialization for primitive types. Old APIs have been deprecated but remain for backwards compatibility.

The new Structured Streaming API aims to allow managing streaming data sets without added complexity, in the same way that programmers and existing machine learning algorithms, deal with batch loaded data sets. Performance has also improved with the second generation Tungsten engine, allowing for up to 10 times faster execution .

The technical preview release is available on DataBricks .

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Apache Spark 2.0 Technical Preview

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址