Thursday, May 24, 2018

Streaming Moving Averages with Apache Spark

Target audience: Intermediate
Estimated reading time: 5'

Spark Streaming offers an abstraction called Discretized Stream, or DStream, which signifies a continuous flow of data. This can either stem from an initial source or emerge as a result of transforming the input data. Essentially, a DStream is characterized by a consistent sequence of Resilient Distributed Datasets (RDD) [1].


Follow me on LinkedIn
Important notes
  • This post describes the streaming features as implemented in Apache Spark 2.0.x 
  • Stream should not confused with structured streaming [2].

Use case

The creation of models through supervised learning from a given training sets usually requires the application of some pre-processing algorithm such as smoothing, filtering or extrapolating existing data for missing values. 

Computing the moving average of a time series is a simple and convenient technique used to filter out noise and reduce the impact of outliers on trend lines. For this post, we consider the simplest form of moving average over a period of p values, as defined in the following formula \[\tilde{x_{t}}= \frac{1}{p}\sum_{j=t-p+1}^{t}x_{j}\] 
The iterative (or recursive) form is defined as \[\tilde{x_{t}}=\tilde{x_{t-1}}+\frac{1}{p}(x_{t}-x_{t-p})\] 

Spark DStreams

This section refers to discretized streams that were introduced in Spark 1.x. Structured streaming will be discussed in a future post.



Thank you for reading this article. For more information ...

References



---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3