If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data.
First, create a Parquet table with a double column:
Insert two values in a single statement, the first of which is a NaN:
Check that both values are actually present in the table:
Filter using a condition that should match the regular number:
Expectation: The row with the regular number should be returned.
Actual result: No rows are returned.
Parquet files contain statistics metadata including the fields min and max or min_value and max_value (depending on the Impala version). If the first number is a NaN, the minimum and maximum values that Impala writes in the metadata are NaN. Based on this metadata, the row group can not contain any value that matches the condition, thereby Impala discards its contents without checking the individual entries. The problem is that the statistics were incorrectly written in the first place. (This can be and has been checked by using parquet-tools meta on the Parquet file.)
What follows are just my assumptions without checking the actual code: While writing data, Impala keeps track of the smallest and largest value encountered so far. Let's call them min_so_far and max_so_far, respectively.
Initially, the first (non_NULL) value is set as both the min_so_far and max_so_far. Then each new value is compared against min_so_far and max_so_far, updating each one if necessary. In pseudo_code:
The problem is that any comparison involving NaN returns false, thereby if NaN is already in min_so_far, then no value can ever replace it and NaN will be stuck there.
On the positive side, min_so_far can only become NaN if the first value in the row group is NaN. If the first value is not NaN, then NaN can never replace min_so_far, since the comparison will always return false when it involves a NaN.