Note: this JIRA has changed since its inception - its not a bug, but something which can be tricky to surmise from existing docs. So the attached patch is a doc improvement.
Below is the original JIRA which was filed:
Please note that Im somewhat new to spark streaming's API, and am not a spark expert - so I've done the best to write up and reproduce this "bug". If its not a bug i hope an expert will help to explain why and promptly close it. However, it appears it could be a bug after discussing with R J Nowling who is a spark contributor.
It appears that in a DStream context, a call to MappedRDD.count() blocks progress and prevents emission of RDDs from a stream.
The above code block should inevitably halt, after 20 intervals of RDDs... However, if we uncomment the call to rdd.count(), it turns out that we get an infinite stream which emits no RDDs , and thus our program runs forever (ssc.stop is unreachable), because forEach doesnt receive any more entries.
I suspect this is actually because the foreach block never completes, because count() is winds up calling compute, which ultimately just reads from the stream.
I havent put together a minimal reproducer or unit test yet, but I can work on doing so if more info is needed.
I guess this could be seen as an application bug - but i think spark might be made smarter to throw its hands up when people execute blocking code in a stream processor.