[ARROW-9723] [C++] Expected behaviour of "mode" kernel with NaNs ? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25778

Description

~~ARROW-9638~~ added a "mode" kernel to arrow::compute. There was some remaining discussion on how NaNs should be handled.

The merged PR added the behaviour to "skip" NaNs (similarly as it skips nulls). So eg:

[NaN, NaN, 1] -> mode:1, count:1
[null, null, 1] -> mode:1, count:1
[null, null, null] -> null
[NaN, NaN, NaN] -> null  # should this be NaN instead?

But, for example scipy.stats does not skip NaNs and would for the last line above return mode:NaN, count:1 (the NaNs are not equal to each other, so each NaN is counted separately, giving a count of 1).
Also, in other aggregations like sum we skip nulls but not NaNs (so sum([NaN, NaN, 1]) would be NaN).

On the other hand, as apitrou argued in the PR, for sum it's more straightforward and informative to propagate the NaN to the result (at least it indicates there are NaNs in the data), while for mode the count of 1 can also be surprising/misleading.

Attachments

Issue Links

links to

GitHub Pull Request #8061

Activity

People

Assignee:: Yibo Cai

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Aug/20 12:44

Updated:: 11/Jan/23 08:08

Resolved:: 27/Aug/20 11:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h