Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1854

[Python] Improve performance of serializing object dtype ndarrays

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0
    • Python

    Description

      I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

      In [1]: import pickle
      
      In [2]: import numpy as np
      
      In [3]: import pyarrow as pa
      a
      In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
      
      In [5]: timeit serialized = pa.serialize(arr).to_buffer()
      10 loops, best of 3: 27.1 ms per loop
      
      In [6]: timeit pickled = pickle.dumps(arr)
      100 loops, best of 3: 6.03 ms per loop
      

      robertnishihara pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently

      Attachments

        1. text.html
          2 kB
          Brian Bowman

        Issue Links

          Activity

            People

              wesm Wes McKinney
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: