神刀安全网

Apache Spark powers live SQL analytics in SnappyData

The team behindPivotal’s GemFire in-memory transactional data store recently unveiled a new database solution powered by GemFire and Apache Spark, called SnappyData .

SnappyData is another example of the way Spark has recently been employed as a component in a larger database solution, with or without other components from Apache Hadoop.

Snap and spark

SnappyData — which is the name of both the new database and the organization producing it — was built to span two worlds. It uses the Apache Spark in-memory data-analytics engine so it can perform live SQL analytics on both static datasets and streams. Queries against SnappyData can be written as conventional SQL or as Spark abstractions, so existing work done in both paradigms can be re-used, alone or together, on the same data.

To store and retrieve the data, SnappyData has a distributed data store called Snappy-Store, which is derived from a variant of Pivotal’s GemFire technology. It works as either its own data store or as a sort of asynchronous write-back cache to other data sources, such as Hadoop/HDFS. This implies that existing datasets could be accessed through SnappyData without having to be formally migrated.

SnappyData also tries to offer novel solutions to problems that can arise when using streaming data. For instance, if there’s too much data coming through to get a real-time response to a query in a timely fashion, SnappyData uses approximate query processing (AQP) or a method of sampling streaming data to generate an answer.

The results are less exact than operating on the entire dataset, and AQP isn’t available for every kind of query. That said, AQP queries are intended to be faster to run and are less demanding of CPU and memory than working on the full dataset.

One among many

This isn’t the first time Spark has been used at the heart of a data analysis solution that covers both OLTP and OLAP workloads. In-memory database systemSplice Machine was originally built on top of Hadoop components and leveraged them to scale out and be able to run both OLTP and OLAP jobs under the same hood. Version 2.0 of that productadded Spark as an OLAP processing engine.

Where SnappyData diverges from Splice Machine, though, is in how Spark is leveraged. SnappyData claims it’s extending Spark Streaming  in various ways, such as allowing streams to be manipulated and queried as though they were tables, including operations like joins.

SnappyData also seems like a good environment to leverage changes that are slated for Apache Spark in the near term. For instance,Spark 2.0, scheduled to come out later this year, will heavily rework how Spark handles memory management and introduce changes to its streaming system that make it easier to pull down streaming data.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Apache Spark powers live SQL analytics in SnappyData

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址