Description
Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:
- Safe cast fails from numpy float64 array with nans to integer,
ARROW-4258 - Java, Reduce heap usage for variable width vectors,
ARROW-4147 - Binary identity cast not implemented,
ARROW-4101 - pyarrow open_stream deprecated, use ipc.open_stream,
ARROW-4098 - conversion to date object no longer needed,
ARROW-3910 - Error reading IPC file with no record batches,
ARROW-3894 - Signed to unsigned integer cast yields incorrect results when type sizes are the same,
ARROW-3790 - from_pandas gives incorrect results when converting floating point to bool,
ARROW-3428 - Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue),
ARROW-3048 - Java update to official Flatbuffers version 1.9.0,
ARROW-3175
complete list here
PySpark requires the following fixes to work with PyArrow 0.12.0
- Encrypted pyspark worker fails due to ChunkedStream missing closed property
- pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
- ArrowTests fails due to difference in raised error message
- pyarrow.open_stream deprecated
- tests fail because groupby adds index column with duplicate name
Attachments
Issue Links
- is a clone of
-
SPARK-23874 Upgrade apache/arrow to 0.10.0
- Closed
- is related to
-
SPARK-29875 Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
- Resolved
- relates to
-
SPARK-29376 Upgrade Apache Arrow to 0.15.1
- Resolved
- links to