Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1854

[Python] Improve performance of serializing object dtype ndarrays

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Python

      Description

      I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

      In [1]: import pickle
      
      In [2]: import numpy as np
      
      In [3]: import pyarrow as pa
      a
      In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
      
      In [5]: timeit serialized = pa.serialize(arr).to_buffer()
      10 loops, best of 3: 27.1 ms per loop
      
      In [6]: timeit pickled = pickle.dumps(arr)
      100 loops, best of 3: 6.03 ms per loop
      

      Robert Nishihara Philipp Moritz I encountered this while working on ARROW-1783, but it can likely be resolved independently

        Attachments

        1. text.html
          2 kB
          Brian Bowman

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                wesmckinn Wes McKinney
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: