Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4099

[Python] Pretty printing very large ChunkedArray objects can use unbounded memory

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: Python
    • Labels:
      None

      Description

      In working on ARROW-2970, I have the following dataset:

      values = [b'x'] + [
          b'x' * (1 << 20)
      ] * 2 * (1 << 10)
      
      arr = np.array(values)
      
      arrow_arr = pa.array(arr)
      

      The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:

      In [10]: rep = repr(arrow_arr)
      
      In [11]: len(rep)
      Out[11]: 637536258
      

      There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              wesm Wes McKinney
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: