[SPARK-26566] Upgrade apache/arrow to 0.12.0 - ASF JIRA

XML

Word

Printable

JSON

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

Safe cast fails from numpy float64 array with nans to integer, ~~ARROW-4258~~
Java, Reduce heap usage for variable width vectors, ~~ARROW-4147~~
Binary identity cast not implemented, ~~ARROW-4101~~
pyarrow open_stream deprecated, use ipc.open_stream, ~~ARROW-4098~~
conversion to date object no longer needed, ~~ARROW-3910~~
Error reading IPC file with no record batches, ~~ARROW-3894~~
Signed to unsigned integer cast yields incorrect results when type sizes are the same, ~~ARROW-3790~~
from_pandas gives incorrect results when converting floating point to bool, ~~ARROW-3428~~
Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ~~ARROW-3048~~
Java update to official Flatbuffers version 1.9.0, ~~ARROW-3175~~

complete list here

PySpark requires the following fixes to work with PyArrow 0.12.0

Encrypted pyspark worker fails due to ChunkedStream missing closed property
pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
ArrowTests fails due to difference in raised error message
pyarrow.open_stream deprecated
tests fail because groupby adds index column with duplicate name

is a clone of

SPARK-23874 Upgrade apache/arrow to 0.10.0

is related to

SPARK-29875 Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x

relates to

SPARK-29376 Upgrade Apache Arrow to 0.15.1

links to

GitHub Pull Request #23657