Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21985

PySpark PairDeserializer is broken for double-zipped RDDs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0, 2.1.1, 2.2.0
    • 2.1.2, 2.2.1, 2.3.0
    • PySpark

    Description

      PySpark fails to deserialize double-zipped RDDs. For example, the following example used to work in Spark 2.0.2:

      >>> a = sc.parallelize('aaa')
      >>> b = sc.parallelize('bbb')
      >>> c = sc.parallelize('ccc')
      >>> a_bc = a.zip( b.zip(c) )
      >>> a_bc.collect()
      [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
      

      But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

      >>> a_bc.collect()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", line 329, in _load_stream_without_unbatching
          if len(key_batch) != len(val_batch):
      TypeError: object of type 'itertools.izip' has no len()
      

      As you can see, the error seems to be caused by a check in the PairDeserializer class:

      if len(key_batch) != len(val_batch):
          raise ValueError("Can not deserialize PairRDD with different number of items"
                           " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
      

      If that check is removed, then the example above works without error. Can the check simply be removed?

      Attachments

        Activity

          People

            a1ray Andrew Ray
            stuarteberg Stuart Berg
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: