Details
-
Documentation
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Note: this JIRA has changed since its inception - its not a bug, but something which can be tricky to surmise from existing docs. So the attached patch is a doc improvement.
Below is the original JIRA which was filed:
Please note that Im somewhat new to spark streaming's API, and am not a spark expert - so I've done the best to write up and reproduce this "bug". If its not a bug i hope an expert will help to explain why and promptly close it. However, it appears it could be a bug after discussing with rnowling who is a spark contributor.
It appears that in a DStream context, a call to MappedRDD.count() blocks progress and prevents emission of RDDs from a stream.
tweetStream.foreachRDD((rdd,lent)=> { tweetStream.repartition(1) //val count = rdd.count() DONT DO THIS ! checks += 1; if (checks > 20) { ssc.stop() } }
The above code block should inevitably halt, after 20 intervals of RDDs... However, if we uncomment the call to rdd.count(), it turns out that we get an infinite stream which emits no RDDs , and thus our program runs forever (ssc.stop is unreachable), because forEach doesnt receive any more entries.
I suspect this is actually because the foreach block never completes, because count() is winds up calling compute, which ultimately just reads from the stream.
I havent put together a minimal reproducer or unit test yet, but I can work on doing so if more info is needed.
I guess this could be seen as an application bug - but i think spark might be made smarter to throw its hands up when people execute blocking code in a stream processor.
Attachments
Issue Links
- is related to
-
SPARK-4381 User should get warned when set spark.master with local in Spark Streaming
- Resolved
- links to