[ARROW-11463] [Python] Allow configuration of IpcWriterOptions 64Bit from PyArrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27347

Description

For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization of the table with combined chunks (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length`

I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding

modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit int.
   ///
   /// Some implementations may not be able to parse streams created with this option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;

Attachments

Issue Links

links to

GitHub Pull Request #9394

Activity

People

Assignee:: Tao He

Reporter:: Leonard Lausen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Feb/21 21:30

Updated:: 11/Jan/23 08:19

Resolved:: 02/Feb/21 16:16

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: