Source:- ericsson.com
In our latest post, we investigate the impact of today’s data pipelining challenges and explore how increased automation of stream processing frameworks such as Spark and Flink can help to yield better performance for telecom operators.
It’s been so long since we wrote our blog series (apache-storm-vs-spark-streaming and apache-storm-performance-tuners) on stream processing frameworks, and we thought to share our recent views on the current stream processing engines and their applicability towards 5G and IoT use cases.
On the journey of preparing Ericsson to expand its origin to 5G and IoT use cases, Ericsson has studied various scalable and flexible streaming processing frameworks to solve the data pipelining issues and its impact on the overall performance. There is also a rise in the need for automation in every domain, that’s achieved using machine learning on streaming data which aids in adaptive learning and intelligent decision making. It’s indeed a great challenge to incrementally learn models and gain information from the streaming data using machine learning algorithms.
In this post, we will talk about the challenges of AI in streaming data and how stream processing frameworks, primarily the Spark Streaming framework, can be used to solve those problems.
Spark Streaming framework
The following contents in the blog have been discussed as Input, Process (ETL and ML) and Output phases. We also talk about various machine learning and data analysis techniques that are used at stream processing frameworks to enable efficient control and optimization.
Input Phase:
Even though, there are different input sources like file, databases and various end-points, the interesting development in the current setup is how efficiently we can use Apache Kafka with the Spark Streaming Platform. In addition to the default receiver-based approach, there has been an inclusion of a “direct” technique where the performance and duplication issues have been resolved. In our telco domain, as we need to handle data rates of 1TB/sec from our network probes, this “direct” approach has been a precise technique to apply. In addition to the performance efficiency, we also need a simple approach to maintain the distribution technique in our complex telco systems. The telecom domain also requires the accuracy of 99.9999% which lays a tremendous need for us to handle failure scenarios. This “direct” technique also reduces the complexity of handling those failures and maintain less number for replicated data across the system.
Process Phase:
Extraction, transformation and loading (ETL):
In the old days, when we practiced stream processing, we usually talked about Bolts which run simultaneously on executors and our main task was to determine the deployment topology to have good distribution and maximum usage of available resources. Then, we started talking about micro-batches and how effective and fault-tolerant they were, compared to the pure stream processing setup. We also used to talk about Lambda architecture, combining both the batch and stream processing in a single query. At present, due to increased popularity of the Spark Streaming framework, industries have started shifting towards Structured Stream Querying where even flat tables have been treated as streaming data and incrementally processed. Structured Stream Querying allows us to process the newly arrived data with more priority to answer the streaming query while compared to processing of historic data.
In our telecom world, we have various transformations such as number mapping, cleaning, replacing null values, variable transformations, etc. To perform all these operations, which can be handled in a pure streaming manner, we use Apache Flink as there is no micro-batch concept. While for operations such as replacing missing values, mean of last N values, etc. – anything which requires historic data – we use Spark Streaming with Structural Querying as our preferred approach.
Machine Learning (ML):
For our telecom domain, we need to create both trained models and test data in a streaming manner. We have tried various approaches to update the model when new data points stream in, and found hierarchical models were much easier to perform the incremental model updates. These hierarchical data models can be easily deployed using the Spark Streaming framework as it internally supports micro-batch processing for these kinds of model preparation. We also understood that with the flexibility and pure streaming nature of Apache Flink, the implementation of reinforcement learning can be easily realized and the performance metrics for these implementations are quite competitive compared to the other frameworks.
Sink Phase:
After the data processing layer, we can store data into various options such as permanent data store, or in a distributed memory, or back to a message bus, or just visualize the data points. In our internal study, we have stored the processed data in Cassandra, which is a No-SQL data store given importance to availability when there is failure in partition tolerance. Working with Apache Cassandra in telecom applications for some years, we have found it very useful for fine-tuning to achieve consistency and availability. Even though I get my hands dirty on Hadoop ecosystem on a regular basis, I got inspired by the fact that we can play with consistency levels in Cassandra while I was not able to change availability levels in HBase.
And we also need to store the data in the “best” site. The resource may be created by the executer on site A in storage A, but the client application always queries it from site B, which would require us to determine where to better store the resource on site B, ensuring data locality. There was internal optimization which is done at Sink Level to ensure this data locality
In this post, I hopefully addressed some of the concerns in stream processing frameworks and the best ways of working. This intro on data pipelining on streaming systems should help to tune your systems to yield good performance results. As to how streaming engines can support 5G Network Slicing and IoT data, that will all be covered in our next blog. Stay Tuned!
In the meantime, recap by reading our other blogs about Apache Storm Performance Tuners and a comparison of Apache Storm vs. Spark Streaming.