Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
ghx-label-5
Description
(I'll only write min and max, but I'll also mean min_value and max_value by that)
When both min and max is NaN:
- Written by Impala:
- first element in the row group is NaN, but not all of them (Impala writer bug)
- all element is NaN
- Written by Hive/Parquet-mr:
- all element is NaN
Either min or max is NaN, but not both:
- Written by Impala:
- this cannot happen currently
- Written by Hive/Parquet-mr:
- only the max can be NaN (needs to be checked)
Therefore, if both min and max is NaN, we can't use the statistics for filtering.
If only the max is NaN, we still have a valid lower bound.
A workaround can be to change the NaNs to infinities, ie. max => Inf, min => -Inf
Based on my experiments, min/max statistics are not applied to predicates that can be true for NaN, e.g. 'NOT x < 3'