[SPARK-21985] PySpark PairDeserializer is broken for double-zipped RDDs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0, 2.1.1, 2.2.0
Fix Version/s: 2.1.2, 2.2.1, 2.3.0
Component/s: PySpark
Labels:
- bug

Target Version/s:

2.1.2

Description

PySpark fails to deserialize double-zipped RDDs. For example, the following example used to work in Spark 2.0.2:

>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]

But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

>>> a_bc.collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", line 329, in _load_stream_without_unbatching
    if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()

As you can see, the error seems to be caused by a check in the PairDeserializer class:

if len(key_batch) != len(val_batch):
    raise ValueError("Can not deserialize PairRDD with different number of items"
                     " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))

If that check is removed, then the example above works without error. Can the check simply be removed?

Attachments

Issue Links

links to

[Github] Pull Request #19226 (aray)

[Github] Pull Request #19228 (HyukjinKwon)

Activity

People

Assignee:: Andrew Ray

Reporter:: Stuart Berg

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Sep/17 16:05

Updated:: 12/Dec/22 18:11

Resolved:: 17/Sep/17 17:47