XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.11.0
Fix Version/s: Impala 3.0, Impala 2.12.0
Component/s: Backend
Labels:
- correctness
- parquet

Epic Color:
ghx-label-7

Description

Summary

If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data.

Reproduction

First, create a Parquet table with a double column:

create table test_nan(val double) stored as parquet;

Insert two values in a single statement, the first of which is a NaN:

insert into test_nan values (cast('NaN' as double)), (42);

Check that both values are actually present in the table:

select * from test_nan;
+-----+
| val |
+-----+
| NaN |
| 42  |
+-----+
Fetched 2 row(s) in 0.13s

Filter using a condition that should match the regular number:

select * from test_nan where val > 0;
Fetched 0 row(s) in 0.13s

Expectation: The row with the regular number should be returned.
Actual result: No rows are returned.

Explanation

Parquet files contain statistics metadata including the fields min and max or min_value and max_value (depending on the Impala version). If the first number is a NaN, the minimum and maximum values that Impala writes in the metadata are NaN. Based on this metadata, the row group can not contain any value that matches the condition, thereby Impala discards its contents without checking the individual entries. The problem is that the statistics were incorrectly written in the first place. (This can be and has been checked by using parquet-tools meta on the Parquet file.)

What follows are just my assumptions without checking the actual code: While writing data, Impala keeps track of the smallest and largest value encountered so far. Let's call them min_so_far and max_so_far, respectively.

Initially, the first (non_NULL) value is set as both the min_so_far and max_so_far. Then each new value is compared against min_so_far and max_so_far, updating each one if necessary. In pseudo_code:

if (new_value < min_so_far) {
  min_so_far = new_value;
}

The problem is that any comparison involving NaN returns false, thereby if NaN is already in min_so_far, then no value can ever replace it and NaN will be stuck there.

On the positive side, min_so_far can only become NaN if the first value in the row group is NaN. If the first value is not NaN, then NaN can never replace min_so_far, since the comparison will always return false when it involves a NaN.

Attachments

Issue Links

is related to

IMPALA-6552 Add tests for Parquet stats filtering with +0/-0 edge cases

Open

Sub-Tasks

1.	Fix read path when Parquet min(_value)/max(_value) statistics contain NaN		Resolved	Zoltán Borók-Nagy
2.	Fix inconsistent write path of Parquet min/max statistics		Resolved	Zoltán Borók-Nagy

Activity

People

Assignee:: Zoltán Borók-Nagy

Reporter:: Zoltan Ivanfi

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 15/Feb/18 14:16

Updated:: 24/May/18 03:07

Resolved:: 08/Mar/18 17:28