[ARROW-4099] [Python] Pretty printing very large ChunkedArray objects can use unbounded memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/20692

Description

In working on ~~ARROW-2970~~, I have the following dataset:

values = [b'x'] + [
    b'x' * (1 << 20)
] * 2 * (1 << 10)

arr = np.array(values)

arrow_arr = pa.array(arr)

The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:

In [10]: rep = repr(arrow_arr)

In [11]: len(rep)
Out[11]: 637536258

There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

Attachments

Issue Links

is a child of

ARROW-18359 PrettyPrint Improvements

Open

Activity

People

Assignee:: Unassigned

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Dec/18 05:21

Updated:: 11/Jan/23 07:31