Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26566

Upgrade apache/arrow to 0.12.0

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • PySpark
    • None

    Description

      Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

      • Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
      • Java, Reduce heap usage for variable width vectors, ARROW-4147
      • Binary identity cast not implemented, ARROW-4101
      • pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
      • conversion to date object no longer needed, ARROW-3910
      • Error reading IPC file with no record batches, ARROW-3894
      • Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
      • from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
      • Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
      • Java update to official Flatbuffers version 1.9.0, ARROW-3175

      complete list here

      PySpark requires the following fixes to work with PyArrow 0.12.0

      • Encrypted pyspark worker fails due to ChunkedStream missing closed property
      • pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
      • ArrowTests fails due to difference in raised error message
      • pyarrow.open_stream deprecated
      • tests fail because groupby adds index column with duplicate name

       

      Attachments

        Issue Links

          Activity

            People

              bryanc Bryan Cutler
              bryanc Bryan Cutler
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: