Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9623

[Python] Performance difference between pc.multiply vs pd.multiply

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.0.0
    • None
    • Python
    • None
    • Windows
      Pyarrow 1.0.0

    Description

      Wanted to report the performance difference observed between Pandas and Pyarrow.

       

       import numpy as np
       import pandas as pd
       import pyarrow as pa
       import pyarrow.compute as pc
      df = pd.DataFrame(np.random.randn(100000000))
       %timeit -n 5 -r 5 df.multiply(df)
      table = pa.Table.from_pandas(df)
       %timeit -n 5 -r 5 pc.multiply(table[0],table[0])
      

      Results:

      %timeit -n 5 -r 5 df.multiply(df)
       374 ms ± 15.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)``

       

      %timeit -n 5 -r 5 pc.multiply(table[0],table[0])
       698 ms ± 297 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

      Attachments

        Activity

          People

            Unassigned Unassigned
            zacqed H G
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: