Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2328

Parquet scan should use min/max statistics to skip blocks based on predicate

    Details

      Description

      Parquet stores min/max stats which can be used to skip reading blocks if they don't qualify a certain predicate

      The query below ends up scanning all rows, which is not needed.

      select count(*) from tpch_parquet.lineitem where l_orderkey = -1;
      

        Issue Links

          Activity

          Hide
          jan.chou.wu@gmail.com Jian Wu added a comment -

          I have worked on this, do you mind if I contribute our code?

          Show
          jan.chou.wu@gmail.com Jian Wu added a comment - I have worked on this, do you mind if I contribute our code?
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Jian Wu
          Yes, that would be great.
          Can you post a code review for your work?

          Show
          mmokhtar Mostafa Mokhtar added a comment - Jian Wu Yes, that would be great. Can you post a code review for your work?
          Hide
          jan.chou.wu@gmail.com Jian Wu added a comment -

          Sure, I'll post a code review, then we can have a further discussion about this.

          Show
          jan.chou.wu@gmail.com Jian Wu added a comment - Sure, I'll post a code review, then we can have a further discussion about this.
          Hide
          jan.chou.wu@gmail.com Jian Wu added a comment -

          I have posted a code review of my implementation, maybe you could have a look on it.
          https://gerrit.cloudera.org/#/c/3623/

          Show
          jan.chou.wu@gmail.com Jian Wu added a comment - I have posted a code review of my implementation, maybe you could have a look on it. https://gerrit.cloudera.org/#/c/3623/
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Jian Wu
          Are you able to continue the work on this jira?

          Show
          mmokhtar Mostafa Mokhtar added a comment - Jian Wu Are you able to continue the work on this jira?
          Hide
          abiao.chen Biao Chen added a comment -

          I noticed the code posted by Jian was only for numeric valued columns.
          Can this feature work for non-numeric columns finally?

          Show
          abiao.chen Biao Chen added a comment - I noticed the code posted by Jian was only for numeric valued columns. Can this feature work for non-numeric columns finally?
          Hide
          lv Lars Volker added a comment -

          IMPALA-2328: Read support for min/max Parquet statistics

          This change adds support for skipping row groups based on Parquet row
          group statistics. With this change we only support reading statistics
          from Parquet files for numerical types (bool, integer, floating point)
          and for simple predicates of the forms <slot> <op> <constant> or
          <constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ.

          Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
          Reviewed-on: http://gerrit.cloudera.org:8080/6032
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-2328 : Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the forms <slot> <op> <constant> or <constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Reviewed-on: http://gerrit.cloudera.org:8080/6032 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          lv Lars Volker added a comment -

          We submitted a follow-up patch to address questions that had come up after the first patch had been merged.

          IMPALA-2328: Address additional comments

          • test_parquet_stats.py was missing and the tests weren't run during
            GVO.
          • The tests in parquet_stats.test assume that the queries were executed
            in a single fragment, so they now run with 'num_nodes = 1'.
          • Parquet columns are now resolved correctly.
          • Parquet files with missing columns are now handled correctly.
          • Predicates with implicit casts can now be evaluated against
            parquet::Statistics.
          • This change also cleans up some old friend declarations I came across.

          Change-Id: I54c205fad7afc4a0b0a7d0f654859de76db29a02
          Reviewed-on: http://gerrit.cloudera.org:8080/6147
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          lv Lars Volker added a comment - We submitted a follow-up patch to address questions that had come up after the first patch had been merged. IMPALA-2328 : Address additional comments test_parquet_stats.py was missing and the tests weren't run during GVO. The tests in parquet_stats.test assume that the queries were executed in a single fragment, so they now run with 'num_nodes = 1'. Parquet columns are now resolved correctly. Parquet files with missing columns are now handled correctly. Predicates with implicit casts can now be evaluated against parquet::Statistics. This change also cleans up some old friend declarations I came across. Change-Id: I54c205fad7afc4a0b0a7d0f654859de76db29a02 Reviewed-on: http://gerrit.cloudera.org:8080/6147 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development