Description
In pyspark, when some kinds of jobs fail, Spark hangs rather than returning an error. This is partially a scheduler problem – the scheduler sometimes thinks failed tasks succeed, even though they have a stack trace and exception.
You can reproduce this problem with:
ardd = sc.parallelize([(1,2,3), (4,5,6)])
brdd = sc.parallelize([(1,2,6), (4,5,9)])
ardd.join(brdd).count()
The last line will run forever (the problem in this code is that the RDD entries have 3 values instead of the expected 2). I haven't verified if this is a problem for 1.0 as well as 0.9.
Thanks to Shivaram for helping diagnose this issue!