This problem primarily affects sliding window operations in spark streaming.
Consider the following scenario:
- a DStream is created from any source. (I've checked with file and socket)
- No actions are applied to this DStream
- Sliding Window operation is applied to this DStream and an action is applied to the sliding window.
In this case, Spark will not even read the input stream in the batch in which the sliding interval isn't a multiple of batch interval. Put another way, it won't read the input when it doesn't have to apply the window function. This is happening because all transformations in Spark are lazy.
How to fix this or workaround it (see line#3):
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 * 60 * 1000));
JavaDStream<String> inputStream = stcObj.textFileStream("/Input");
inputStream.print(); // This is the workaround
JavaDStream<String> objWindow = inputStream.window(new Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
The "Window operations" example on the streaming guide implies that Spark will read the stream in every batch, which is not happening because of the lazy transformations.
Wherever sliding window would be used, in most of the cases, no actions will be taken on the pre-window batch, hence my gut feeling was that Streaming would read every batch if any actions are being taken in the windowed stream.
As per Tathagata,
"Ideally every batch should read based on the batch interval provided in the StreamingContext."
Refer the original thread on http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html for more details, including Tathagata's conclusion.