Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1382

[Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Python

      Description

      If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of 100 * [np.zeros(10 ** 6)] will be 100 times bigger than it needs to be).

      import pyarrow as pa
      
      l = [0]
      original_object = [l, l]
      
      # Serialize and deserialize the object.
      buf = pa.serialize(original_object).to_buffer()
      new_object = pa.deserialize(buf)
      
      # This works.
      assert original_object[0] is original_object[1]
      
      # This fails.
      assert new_object[0] is new_object[1]
      

      One potential way to address this is to use the Arrow dictionary encoding.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                robertnishihara Robert Nishihara
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 20m
                  4h 20m