[ARROW-2041] [Python] pyarrow.serialize has high overhead for list of NumPy arrays - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- Performance

External issue URL:
https://github.com/apache/arrow/issues/18020

Description

Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa, numpy as np
>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(10000)]
>>> with open('test.pyarrow', 'w') as f:
... f.write(pa.serialize(arrays).to_buffer().to_pybytes())
...
>>> import cPickle as pickle
>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)

test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Richard Shin

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Jan/18 05:25

Updated:: 11/Jan/23 07:18

Resolved:: 28/Aug/19 21:44