Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9723

[C++] Expected behaviour of "mode" kernel with NaNs ?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • C++

    Description

      ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining discussion on how NaNs should be handled.

      The merged PR added the behaviour to "skip" NaNs (similarly as it skips nulls). So eg:

      [NaN, NaN, 1] -> mode:1, count:1
      [null, null, 1] -> mode:1, count:1
      [null, null, null] -> null
      [NaN, NaN, NaN] -> null  # should this be NaN instead?
      

      But, for example scipy.stats does not skip NaNs and would for the last line above return mode:NaN, count:1 (the NaNs are not equal to each other, so each NaN is counted separately, giving a count of 1).
      Also, in other aggregations like sum we skip nulls but not NaNs (so sum([NaN, NaN, 1]) would be NaN).

      On the other hand, as apitrou argued in the PR, for sum it's more straightforward and informative to propagate the NaN to the result (at least it indicates there are NaNs in the data), while for mode the count of 1 can also be surprising/misleading.

      Attachments

        Issue Links

          Activity

            People

              yibocai Yibo Cai
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h