[SPARK-10544] Serialization of Python namedtuple subclasses in functions / closures is broken - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Duplicate
Affects Version/s: 1.5.0
Fix Version/s: 1.5.1, 1.6.0
Component/s: PySpark
Labels:
None

Target Version/s:

1.5.1

Description

The following example works on Spark 1.4.1 but not in 1.5:

from collections import namedtuple
Person = namedtuple("Person", "id firstName lastName")
rdd = sc.parallelize([1]).map(lambda x: Person(1, "Jon", "Doe"))
rdd.count()

In 1.5, this gives an "AttributeError: 'builtin_function_or_method' object has no attribute '_code_'" error.

Digging a bit deeper, it seems that the problem is the serialization of the Person class itself, since serializing instances of the class in the closure seems to work properly:

from collections import namedtuple
Person = namedtuple("Person", "id firstName lastName")
jon = Person(1, "Jon", "Doe")
rdd = sc.parallelize([1]).map(lambda x: jon)
rdd.count()

It looks like PySpark has unit tests for serializing individual namedtuples with cloudpickle.dumps and for serializing RDDs of namedtuples, but I don't think that we have any tests for namedtuple classes in closures.

Attachments

Issue Links

is duplicated by

SPARK-10542 The PySpark 1.5 closure serializer can't serialize a namedtuple instance.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Sep/15 19:01

Updated:: 20/Sep/15 18:51

Resolved:: 10/Sep/15 19:08