Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3909

Parquet file writer should populate the min/max statistics per block per column to be used by the reader

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      The Parquet file writer should populate the min/max indexes while writing data such that it can be used by the reader.
      Today data written by Hive populates those statistics but not Impala.

        Issue Links

          Activity

          Hide
          tarmstrong Tim Armstrong added a comment -

          There are a bunch of issues with the Parquet spec and parquet-mr implementation here that make it difficult to implement for some of our data types. Essentially only bool, integer and float work consistently. The issues that I'm aware of are:

          Show
          tarmstrong Tim Armstrong added a comment - There are a bunch of issues with the Parquet spec and parquet-mr implementation here that make it difficult to implement for some of our data types. Essentially only bool, integer and float work consistently. The issues that I'm aware of are: Binary/string types should use unsigned byte comparison https://issues.apache.org/jira/browse/PARQUET-686 Decimals should be ordered based on logical type: https://issues.apache.org/jira/browse/PARQUET-839 timestamp/int96 should be ordered in a sensible way: https://issues.apache.org/jira/browse/PARQUET-840
          Hide
          lv Lars Volker added a comment -

          IMPALA-3909: Populate min/max statistics in Parquet writer

          Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
          Reviewed-on: http://gerrit.cloudera.org:8080/5611
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>

          Show
          lv Lars Volker added a comment - IMPALA-3909 : Populate min/max statistics in Parquet writer Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Reviewed-on: http://gerrit.cloudera.org:8080/5611 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Hide
          lv Lars Volker added a comment -

          This feature had broken tests for non-default filesystems, which were fixed in IMPALA-4887

          Show
          lv Lars Volker added a comment - This feature had broken tests for non-default filesystems, which were fixed in IMPALA-4887
          Hide
          jrussell John Russell added a comment -

          There are separate JIRAs to do the min/max stats for DECIMAL, STRING, and TIMESTAMP. Are those the only types covered by this Parquet improvement in 2.9, or are BOOLEAN and all the floating-point and integer types covered by this JIRA? (Initially I was thinking of this JIRA as the umbrella one and IMPALA-4815 et al as subtasks, but perhaps that's a misperception.)

          Show
          jrussell John Russell added a comment - There are separate JIRAs to do the min/max stats for DECIMAL, STRING, and TIMESTAMP. Are those the only types covered by this Parquet improvement in 2.9, or are BOOLEAN and all the floating-point and integer types covered by this JIRA? (Initially I was thinking of this JIRA as the umbrella one and IMPALA-4815 et al as subtasks, but perhaps that's a misperception.)
          Hide
          lv Lars Volker added a comment -

          We have support for writing statistics for all types in 2.9. Originally
          this JIRA was meant to cover all types, but then we took out the more
          complicated ones and addressed them in 4815 et al.

          On Jun 2, 2017 15:43, "John Russell (JIRA)" <jira@apache.org> wrote:

          [ https://issues.apache.org/jira/browse/IMPALA-3909?page=
          com.atlassian.jira.plugin.system.issuetabpanels:comment-
          tabpanel&focusedCommentId=16035554#comment-16035554 ]

          John Russell commented on IMPALA-3909:
          --------------------------------------

          There are separate JIRAs to do the min/max stats for DECIMAL, STRING, and
          TIMESTAMP. Are those the only types covered by this Parquet improvement in
          2.9, or are BOOLEAN and all the floating-point and integer types covered by
          this JIRA? (Initially I was thinking of this JIRA as the umbrella one and
          IMPALA-4815 et al as subtasks, but perhaps that's a misperception.)

          column to be used by the reader
          --------------------------------------------
          data such that it can be used by the reader.


          This message was sent by Atlassian JIRA
          (v6.3.15#6346)

          Show
          lv Lars Volker added a comment - We have support for writing statistics for all types in 2.9. Originally this JIRA was meant to cover all types, but then we took out the more complicated ones and addressed them in 4815 et al. On Jun 2, 2017 15:43, "John Russell (JIRA)" <jira@apache.org> wrote: [ https://issues.apache.org/jira/browse/IMPALA-3909?page= com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel&focusedCommentId=16035554#comment-16035554 ] John Russell commented on IMPALA-3909 : -------------------------------------- There are separate JIRAs to do the min/max stats for DECIMAL, STRING, and TIMESTAMP. Are those the only types covered by this Parquet improvement in 2.9, or are BOOLEAN and all the floating-point and integer types covered by this JIRA? (Initially I was thinking of this JIRA as the umbrella one and IMPALA-4815 et al as subtasks, but perhaps that's a misperception.) column to be used by the reader -------------------------------------------- data such that it can be used by the reader. – This message was sent by Atlassian JIRA (v6.3.15#6346)

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development