Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1382

[Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Python

    Description

      If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of 100 * [np.zeros(10 ** 6)] will be 100 times bigger than it needs to be).

      import pyarrow as pa
      
      l = [0]
      original_object = [l, l]
      
      # Serialize and deserialize the object.
      buf = pa.serialize(original_object).to_buffer()
      new_object = pa.deserialize(buf)
      
      # This works.
      assert original_object[0] is original_object[1]
      
      # This fails.
      assert new_object[0] is new_object[1]
      

      One potential way to address this is to use the Arrow dictionary encoding.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              robertnishihara Robert Nishihara
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 20m
                  4h 20m