Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.3
-
None
-
None
Description
After importing pyspark, cloudpickle is no longer able to properly serialize objects inheriting from collections.namedtuple, and drops all other class data such that calls to isinstance will fail.
Here's a minimal reproduction of the issue:
import collections
import cloudpickle
import pyspark
class A(object):
pass
class B(object):
pass
class C(A, B, collections.namedtuple('C', ['field'])):
pass
c = C(1)
def print_bases(obj):
bases = obj._class.bases_
for base in bases:
print(base)
print('original objects:')
print_bases(c)
print('\ncloudpickled objects:')
c2 = cloudpickle.loads(cloudpickle.dumps(c))
print_bases(c2)
This prints:
original objects:
<class '_main_.A'>
<class '_main_.B'>
<class 'collections.C'>
cloudpickled objects:
<class 'tuple'>
Effectively dropping all other types. It appears this issue is being caused by the _hijack_namedtuple function, which replaces the namedtuple class with another one.
Note that I can workaround this issue by setting collections.namedtuple.__hijack = 1 before importing pyspark, so I feel pretty confident this is what's causing the issue.
This issue comes up when working with TensorFlow feature columns, which derive from collections.namedtuple among other classes.
Cloudpickle also supports collections.namedtuple serialization, but doesn't appear to need to replace the class. Possibly PySpark can do something similar?
Attachments
Issue Links
- duplicates
-
SPARK-32079 PySpark <> Beam pickling issues for collections.namedtuple
- Resolved
- relates to
-
SPARK-22674 PySpark breaks serialization of namedtuple subclasses
- Resolved