[ARROW-1854] [Python] Improve performance of serializing object dtype ndarrays - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/17848

Description

I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

In [1]: import pickle

In [2]: import numpy as np

In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop

In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop

robertnishihara pcmoritz I encountered this while working on ~~ARROW-1783~~, but it can likely be resolved independently

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

text.html
15/Dec/17 22:15
2 kB
Brian Bowman

Issue Links

is related to

ARROW-1784 [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format

Resolved

links to

GitHub Pull Request #1360

Activity

People

Assignee:: Wes McKinney

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Nov/17 20:15

Updated:: 11/Jan/23 07:17

Resolved:: 29/Nov/17 01:06