Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining discussion on how NaNs should be handled.
The merged PR added the behaviour to "skip" NaNs (similarly as it skips nulls). So eg:
[NaN, NaN, 1] -> mode:1, count:1
[null, null, 1] -> mode:1, count:1
[null, null, null] -> null
[NaN, NaN, NaN] -> null # should this be NaN instead?
But, for example scipy.stats does not skip NaNs and would for the last line above return mode:NaN, count:1 (the NaNs are not equal to each other, so each NaN is counted separately, giving a count of 1).
Also, in other aggregations like sum we skip nulls but not NaNs (so sum([NaN, NaN, 1]) would be NaN).
On the other hand, as apitrou argued in the PR, for sum it's more straightforward and informative to propagate the NaN to the result (at least it indicates there are NaNs in the data), while for mode the count of 1 can also be surprising/misleading.
Attachments
Issue Links
- links to