Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21985

PySpark PairDeserializer is broken for double-zipped RDDs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0, 2.1.1, 2.2.0
    • Fix Version/s: 2.1.2, 2.2.1, 2.3.0
    • Component/s: PySpark
    • Labels:
    • Target Version/s:

      Description

      PySpark fails to deserialize double-zipped RDDs. For example, the following example used to work in Spark 2.0.2:

      >>> a = sc.parallelize('aaa')
      >>> b = sc.parallelize('bbb')
      >>> c = sc.parallelize('ccc')
      >>> a_bc = a.zip( b.zip(c) )
      >>> a_bc.collect()
      [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
      

      But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

      >>> a_bc.collect()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", line 329, in _load_stream_without_unbatching
          if len(key_batch) != len(val_batch):
      TypeError: object of type 'itertools.izip' has no len()
      

      As you can see, the error seems to be caused by a check in the PairDeserializer class:

      if len(key_batch) != len(val_batch):
          raise ValueError("Can not deserialize PairRDD with different number of items"
                           " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
      

      If that check is removed, then the example above works without error. Can the check simply be removed?

        Attachments

          Activity

            People

            • Assignee:
              a1ray Andrew Ray
              Reporter:
              stuarteberg Stuart Berg
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: