[SPARK-32079] PySpark <> Beam pickling issues for collections.namedtuple - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: PySpark
Labels:
- release-notes

Description

PySpark monkeypatching namedtuple makes it difficult/impossible to depickle collections.namedtuple instances from outside of a pyspark environment.

When PySpark has been loaded into the environment, any time that you try to pickle a namedtuple, you are only able to unpickle it from an environment where the hijack has been applied.

This conflicts directly when trying to use Beam from a non-Spark environment (namingly Flink or Dataflow) making it impossible to use the pipeline if it has a namedtuple loaded somewhere.

import collections
import dill
ColumnInfo = collections.namedtuple(
    "ColumnInfo",
    [
        "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
        "type",  # type: Optional[ColumnType]  # pytype: disable=ignored-type-comment
    ])
dill.dumps(ColumnInfo('test', int))

b'\x80\x03cdill.dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00main_q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'

import pyspark
import collections
import dill
ColumnInfo = collections.namedtuple(
    "ColumnInfo",
    [
        "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
        "type",  # type: Optional[ColumnType]  # pytype: disable=ignored-type-comment
    ])
dill.dumps(ColumnInfo('test', int))

b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'

Second pickled object can only be used from an environment with PySpark.

Attachments

Issue Links

is duplicated by

SPARK-22674 PySpark breaks serialization of namedtuple subclasses

Resolved

SPARK-27810 PySpark breaks Cloudpickle serialization of collections.namedtuple objects

Resolved

links to

[Github] Pull Request #34688 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Gerard Casas Saez

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jun/20 01:30

Updated:: 12/Dec/22 18:11

Resolved:: 25/Nov/21 23:21