Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6527

NaN values lead to incorrect statistics filtering under certain circumstances

    XMLWordPrintableJSON

Details

    • ghx-label-7

    Description

      Summary

      If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data.

      Reproduction

      First, create a Parquet table with a double column:

      create table test_nan(val double) stored as parquet;
      

      Insert two values in a single statement, the first of which is a NaN:

      insert into test_nan values (cast('NaN' as double)), (42);
      

      Check that both values are actually present in the table:

      select * from test_nan;
      +-----+
      | val |
      +-----+
      | NaN |
      | 42  |
      +-----+
      Fetched 2 row(s) in 0.13s
      

      Filter using a condition that should match the regular number:

      select * from test_nan where val > 0;
      Fetched 0 row(s) in 0.13s
      

      Expectation: The row with the regular number should be returned.
      Actual result: No rows are returned.

      Explanation

      Parquet files contain statistics metadata including the fields min and max or min_value and max_value (depending on the Impala version). If the first number is a NaN, the minimum and maximum values that Impala writes in the metadata are NaN. Based on this metadata, the row group can not contain any value that matches the condition, thereby Impala discards its contents without checking the individual entries. The problem is that the statistics were incorrectly written in the first place. (This can be and has been checked by using parquet-tools meta on the Parquet file.)

      What follows are just my assumptions without checking the actual code: While writing data, Impala keeps track of the smallest and largest value encountered so far. Let's call them min_so_far and max_so_far, respectively.

      Initially, the first (non_NULL) value is set as both the min_so_far and max_so_far. Then each new value is compared against min_so_far and max_so_far, updating each one if necessary. In pseudo_code:

      if (new_value < min_so_far) {
        min_so_far = new_value;
      }
      

      The problem is that any comparison involving NaN returns false, thereby if NaN is already in min_so_far, then no value can ever replace it and NaN will be stuck there.

      On the positive side, min_so_far can only become NaN if the first value in the row group is NaN. If the first value is not NaN, then NaN can never replace min_so_far, since the comparison will always return false when it involves a NaN.

      Attachments

        Issue Links

          Activity

            People

              boroknagyz Zoltán Borók-Nagy
              zi Zoltan Ivanfi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: