Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1225

NaN values may lead to incorrect filtering under certain circumstances

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • cpp-1.4.0
    • parquet-cpp
    • None

    Description

      This JIRA describes a generic problem with floating point comparisons that most probably affects parquet-cpp. It is known to affect Impala and by taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as well, but it has not yet been confirmed in practice.

      For comparing float and double values for min/max stats, parquet-cpp uses the C++ less-than operator (<) that returns false for comparisons involving a NaN. This means that while garthering statistics, if a NaN is the smallest value encountered so far (which happens to be the case after reading the first value if that value is NaN), no other value can ever replace it, since < will always be false. On the other hand, if NaN is not the first value, it won't affect the min value. So the min value depends on the order of elements.

      If looking for specific values while reading back the data, the NaN value may lead to row groups being incorrectly discarded in spite of having matching rows. For details, please see the Impala bug IMPALA-6527.

      Attachments

        Issue Links

          Activity

            People

              mdeepak Deepak Majeti
              zi Zoltan Ivanfi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: