March 28th, 2016
During the seven-week Insight Data Engineering Fellows Program experienced software engineers and recent grads learn the latest open source technologies by building a data platform to handle large, real-time datasets. Andy Chu (now a Data Engineer at Netflix) discusses his project of implementing Lambda Archirecture to track real-time updates.
Today’s traders on the stock market have a vast wealth of information available to them. From financial news to traditional newspapers and magazines to blogs and social media posts, there is more data than is possible to manually filter for news relevant to specific stocks of interest. For my project at Insight, I wanted to build a news service that can hook into a trader’s portfolio, and customize news specific to the stocks a trader is holding.
This news service, named AutoNews Financial , reads in financial news from various sources, as well as individual trades that users are making in real time. Whenever a piece of news arrives that relates to a company that makes up a significant percent of a users portfolio, it will be shown on the user’s dashboard. With a large number of users making many trades and news coming in, it would be ideal to have a big data system holding the history of trades for all users as a source of truth. However, processing that source of truth of large data sets is too slow to maintain real-time updates of a user’s holdings. The two requirements for real-time tracking and keeping results accurately up to date can be satisfied by building a lambda architecture.
Advantages of Lambda Architecture In a traditional SQL system, updates to a table change the existing value of the field. This works well for datasets that fit on a small number of servers, which can be vertically scaled with slave servers and backups. However, as the dataset scales to more servers, when hardware fails, it becomes more and more difficult to restore the data to the failure point, and will likely require downtime. In addition, since history is not kept in the database, only in logs, data corruptions that result in incorrect data may go unnoticed.
A big data system with a distributed, replicated messaging queue ensures that once data is entered into the system, it would not be lost even in the case of hardware/network failures. Storing the entire history of updates allows recalculating the results from this source of truth, and results are guaranteed to be correct after each batch processing run. However, reprocessing the entire historical set of data would take too long for results to be presented in real-time. To bridge the gap, the design of lambda architecture adds a real-time component that works like traditional SQL systems – by only storing the current value after updates, results are fast enough to be presented in real-time. Errors in data in the real-time layer are resolved by overwriting the results with each run of batch processing. This allows accurate results in a highly available, eventually consistent system : any errors in the current value, reported by the real-time layer, caused by hardware/network failure, data corruption, or software bugs will be corrected by the next automated batch processing run which processes the entire dataset from the beginning of time.
Data Pipeline Behind AutoNews Financial The diagram below illustrates the data pipeline.
The input data comes in as JSON messages, with trades synthesized from a normal distribution, and news coming from the Twitter API. These JSON messages are pushed into Kafka, and consumed by both the batch layer and the real-time layer.
By using Kafka at the beginning of the pipeline to accept inputs, it can be guaranteed that messages will be delivered as long as they enter the system, regardless of hardware or network failure.
In the batch layer, Camus is used to consume all messages from Kafka and save them into HDFS, then Spark sums through the transaction history to get an accurate count of stocks held by each user. The aggregate results are then written to a Cassandra database table.
In the streaming layer, Kafka messages are consumed in real time using Spark Streaming. Spark Streaming is not fully real-time like Storm, in that it micro-batches stream data into RDDs with a resolution of up to 500ms. However, it allows for reusing the Spark code in the batch layer, and the micro-batch latency is small enough that it’s not significant for this use case of updating news articles.
Results from both the batch and real-time layers are written out to Cassandra, and the data is served to a web interface via Flask. With the high number of trades being written to the system, the write-fast capability of Cassandra is most suitable for this use case.
Coordinating Real-Time and Batch Processing The results served on the web interface are always up to date with the latest messages coming into the system and this is achieved by combining the results from batch and real-time layer. Illustrating how the real-time and batch results are used is most easily done with an example.
In the diagrams below, there are three database tables: one that stores the result of the batch processing, one that stores only the trades made since the last batch processing run was finished, and one that stores the correct up to date values, which is the highlighted table.
By using a separate database table to record only deltas, and having the full counts replaced by the results of the batch processing, any incorrect results caused by software, hardware or network problems will be resolved after a successful run of the batch processing, ensuring accurate results after each run.
In this example, assume that before a round of batch processing starts at time t0 , a user makes a trade that results in him having 5000 shares of 3M
And at t0 , batch processing begins. When this finishes, the result will reflect the totals as of t0 . The real-time table, Real Time 1, with the up to date values also correctly shows the current value of 5000.
During the batch processing, the user sells 1000 shares of 3M. The Real Time 1 table will correctly update itself to hold 4000 shares, while the Real Time 2 table stores -1000, the delta from t0 , as shown below.
When the batch processing finishes, these are the values in the database tables: 5000, 4000, -1000.
At this time, I swap my active database table to Real Time 2, and sum the batch result with the delta to get a consistent, up-to-date count. I then reset table Real Time 1 to 0 so it can start recording deltas from t1.
And the the next round begins with table Real Time 2 storing the consistent up-to-date values.
Correctness of the data being displayed to the user requires that each of the above steps are performed only after the previous step has been fully completed, and the results have been written out. Rather than scheduling these steps based on time of day, scheduling them sequentially allows for scalability as processing time increases with dataset size. Workflows in a system like this can become very complex as when more features are added, the various jobs scheduled have dependencies and use shared resources in a production system. Hence, platforms which can programmatically schedule and monitor workflows are used to manage these systems. In this project, I used Airflow – a scheduling and monitoring platform by Airbnb, which models the sequence of tasks and upstream dependencies as a directed acyclic graph. Airflow is a Python implementation that can model individual jobs as Bash operators. This makes Airflow simple to use, since anything that can be called by Bash can be called directly by Airflow. The programmatic interface is less verbose than XML configuration based tools such as Oozie , and the use of the Bash operator results in less coding than with a tool like Luigi , which models each job as a Python object. By wrapping each task as a Bash command, each step in the process of combining the real-time and batch results are performed only after the previous step exits successfully.
In summary, Lambda Architecture involves having both a batch layer and a real-time layer processing historical data and real-time updates respectively to get robust, real-time results on top of a horizontally scalable hardware platform where failures are expected. For Lambda Architecture to be a viable solution, the data processing in the two layers must be designed such that the results of the batch layer can be combined with the live data streams that were read into the real-time layer. For this project, a second database table holding only the sums for inputs, not yet read into the batch layer, allows for a simple sum to aggregate the batch and real-time counts. This is how I track real-time updates in a highly available, eventually consistent system using Lambda Architecture.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Implementing Lambda architecture to track real-time updates