Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
ghx-label-5
Description
If the first value of a column chunk is NaN, then mix_value = max_value = NaN.
If the first value of a column chunk is not NaN, i.e. it is an ordinary number or +/-infinity, then in the end min_value != NaN and max_value != NaN.
Until the Parquet community doesn't agree on the ordering of floating point numbers, we can make our write path consistent.
A quick fix is to ignore NaNs when calculating min/max statistics, except for the case when all the values are NaN. This behavior would be the same as the fmax()/fmin() functions behave in the standard math library of C/C++.
This way we can use min/max statistics and still the results remain correct, because only binary predicates that contain constants are tested against min/max statistics. In other words, if we want to get NaNs back by a predicate (e.g. 'NOT x < 3', 'x != x'), min/max statistics won't be used, ie. we will get the NaNs as well.