Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
Description
If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of 100 * [np.zeros(10 ** 6)] will be 100 times bigger than it needs to be).
import pyarrow as pa l = [0] original_object = [l, l] # Serialize and deserialize the object. buf = pa.serialize(original_object).to_buffer() new_object = pa.deserialize(buf) # This works. assert original_object[0] is original_object[1] # This fails. assert new_object[0] is new_object[1]
One potential way to address this is to use the Arrow dictionary encoding.
Attachments
Issue Links
- links to