Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27810

PySpark breaks Cloudpickle serialization of collections.namedtuple objects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • PySpark
    • None

    Description

      After importing pyspark, cloudpickle is no longer able to properly serialize objects inheriting from collections.namedtuple, and drops all other class data such that calls to isinstance will fail.

      Here's a minimal reproduction of the issue:

      import collections
      import cloudpickle
      import pyspark

      class A(object):
          pass

      class B(object):
          pass

      class C(A, B, collections.namedtuple('C', ['field'])):
          pass

      c = C(1)

      def print_bases(obj):
          bases = obj._class.bases_
          for base in bases:
              print(base)

      print('original objects:')
      print_bases(c)

      print('\ncloudpickled objects:')
      c2 = cloudpickle.loads(cloudpickle.dumps(c))
      print_bases(c2)

      This prints:

      original objects:
      <class '_main_.A'>
      <class '_main_.B'>
      <class 'collections.C'>

      cloudpickled objects:
      <class 'tuple'>

      Effectively dropping all other types.  It appears this issue is being caused by the _hijack_namedtuple function, which replaces the namedtuple class with another one.

      Note that I can workaround this issue by setting collections.namedtuple.__hijack = 1 before importing pyspark, so I feel pretty confident this is what's causing the issue.

      This issue comes up when working with TensorFlow feature columns, which derive from collections.namedtuple among other classes.

      Cloudpickle also supports collections.namedtuple serialization, but doesn't appear to need to replace the class. Possibly PySpark can do something similar?

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tgaddair Travis Addair
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: