Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10569

[C++][Python] Poor Table filtering performance

    XMLWordPrintableJSON

Details

    Description

      From the mailing list

       

      import pandas as pd
      import pyarrow as pa
      import pyarrow.compute as pc
      import numpy as np
      
      num_rows = 10_000_000
      data = np.random.randn(num_rows)
      
      df = pd.DataFrame({'data{}'.format(i): data
                         for i in range(100)})
      
      df['key'] = np.random.randint(0, 100, size=num_rows)
      
      rb = pa.record_batch(df)
      t = pa.table(df)
      
      I found that the performance of filtering a record batch is very similar:
      
      In [22]: timeit df[df.key == 5]
      71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
      
      In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
      75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
      
      Whereas the performance of filtering a table is absolutely abysmal (no
      idea what's going on here)
      
      In [23]: %timeit t.filter(pc.equal(t[-1], 5))
      961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
       

       

      https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h