Details
Description
zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by SPARK-4841
>> a.zip(b).count() 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main process() File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in pipeline_func return func(split, prev_func(split, iterator)) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in pipeline_func return func(split, prev_func(split, iterator)) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func return f(iterator) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in <genexpr> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in load_stream " in pair: (%d, %d)" % (len(keys), len(vals))) ValueError: Can not deserialize RDD with different number of items in pair: (123, 64)
Attachments
Issue Links
- duplicates
-
SPARK-6008 zip two rdds derived from pickleFile fails
- Resolved
- relates to
-
SPARK-4841 Batch serializer bug in PySpark's RDD.zip
- Resolved
- links to