Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11568

[C++][Compute] Mode kernel performance is bad in some conditions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.0.0
    • C++

    Description

      Comparing with scipy.stats.mode, arrow mode kernel is much slower in some conditions. See below example.

      In [1]: import numpy as np
      
      In [2]: import scipy.stats
      
      In [3]: import pyarrow.compute as pc
      
      In [4]: f = np.random.rand(12345678)
      
      In [5]: time scipy.stats.mode(f)
      CPU times: user 1.14 s, sys: 111 ms, total: 1.25 s
      Wall time: 1.25 s
      Out[5]: ModeResult(mode=array([2.25710692e-08]), count=array([1]))
      
      In [6]: time pc.mode(f)[0]
      CPU times: user 8.44 s, sys: 338 ms, total: 8.77 s
      Wall time: 8.77 s
      Out[6]: <pyarrow.StructScalar: {'mode': 2.2571069235866048e-08, 'count': 1}>
      
      In [7]: i = np.random.randint(0, 1234567, 12345678)
      
      In [8]: time scipy.stats.mode(i)
      CPU times: user 1.03 s, sys: 3.11 ms, total: 1.03 s
      Wall time: 1.03 s
      Out[8]: ModeResult(mode=array([607002]), count=array([28]))
      
      In [9]: time pc.mode(i)[0]
      CPU times: user 1.57 s, sys: 0 ns, total: 1.57 s
      Wall time: 1.57 s
      Out[9]: <pyarrow.StructScalar: {'mode': 607002, 'count': 28}>
      

      Attachments

        Issue Links

          Activity

            People

              yibocai#1 yibocai#1
              yibocai Yibo Cai
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m