Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6527 NaN values lead to incorrect statistics filtering under certain circumstances
  3. IMPALA-6538

Fix read path when Parquet min(_value)/max(_value) statistics contain NaN

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 3.0, Impala 2.12.0
    • None
    • None
    • ghx-label-5

    Description

      (I'll only write min and max, but I'll also mean min_value and max_value by that)

      When both min and max is NaN:

      • Written by Impala:
        • first element in the row group is NaN, but not all of them (Impala writer bug)
        • all element is NaN
      • Written by Hive/Parquet-mr:
        • all element is NaN

      Either min or max is NaN, but not both:

      • Written by Impala:
        • this cannot happen currently
      • Written by Hive/Parquet-mr:
        • only the max can be NaN (needs to be checked)

      Therefore, if both min and max is NaN, we can't use the statistics for filtering.

      If only the max is NaN, we still have a valid lower bound.

       

      A workaround can be to change the NaNs to infinities, ie. max => Inf, min => -Inf

      Based on my experiments, min/max statistics are not applied to predicates that can be true for NaN, e.g. 'NOT x < 3'

      Attachments

        Activity

          People

            boroknagyz Zoltán Borók-Nagy
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: