Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4099

[Python] Pretty printing very large ChunkedArray objects can use unbounded memory

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Python
    • None

    Description

      In working on ARROW-2970, I have the following dataset:

      values = [b'x'] + [
          b'x' * (1 << 20)
      ] * 2 * (1 << 10)
      
      arr = np.array(values)
      
      arrow_arr = pa.array(arr)
      

      The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:

      In [10]: rep = repr(arrow_arr)
      
      In [11]: len(rep)
      Out[11]: 637536258
      

      There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: