Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11463

[Python] Allow configuration of IpcWriterOptions 64Bit from PyArrow

    XMLWordPrintableJSON

Details

    Description

      For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization of the table with combined chunks (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length`

      I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions)

      I was able to serialize successfully after changing the default and rebuilding

      modified   cpp/src/arrow/ipc/options.h
      @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
         /// \brief If true, allow field lengths that don't fit in a signed 32-bit int.
         ///
         /// Some implementations may not be able to parse streams created with this option.
      -  bool allow_64bit = false;
      +  bool allow_64bit = true;
       
         /// \brief The maximum permitted schema nesting depth.
         int max_recursion_depth = kMaxNestingDepth;
      

      Attachments

        Issue Links

          Activity

            People

              sighingnow Tao He
              lausen Leonard Lausen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h