The following example works on Spark 1.4.1 but not in 1.5:
from collections import namedtuple Person = namedtuple("Person", "id firstName lastName") rdd = sc.parallelize([1]).map(lambda x: Person(1, "Jon", "Doe")) rdd.count()
In 1.5, this gives an "AttributeError: 'builtin_function_or_method' object has no attribute '_code_'" error.
Digging a bit deeper, it seems that the problem is the serialization of the Person class itself, since serializing instances of the class in the closure seems to work properly:
from collections import namedtuple Person = namedtuple("Person", "id firstName lastName") jon = Person(1, "Jon", "Doe") rdd = sc.parallelize([1]).map(lambda x: jon) rdd.count()
It looks like PySpark has unit tests for serializing individual namedtuples with cloudpickle.dumps and for serializing RDDs of namedtuples, but I don't think that we have any tests for namedtuple classes in closures.
Issue Links
- is duplicated by
SPARK-10542 The PySpark 1.5 closure serializer can't serialize a namedtuple instance.
- Resolved