Description
When running a reduceByKey over a cached RDD, Python fails with an exception, but the failure is not detected by the task runner. Spark and the pyspark shell hang waiting for the task to finish.
The error is:
PySpark worker failed with exception: Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in dump_stream self._write_with_length(obj, stream) File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in _write_with_length stream.write(serialized) IOError: [Errno 104] Connection reset by peer 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 4257 bytes in 47 ms Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in launch_worker worker(listen_sock) File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker outfile.flush() IOError: [Errno 32] Broken pipe
I can reproduce the error by running take(10) on the cached RDD before running reduceByKey (which looks at the whole input file).
Affects Version 1.0.0-SNAPSHOT (4d88030486)