Hive
  1. Hive
  2. HIVE-5483

use metastore statistics to optimize max/min/etc. queries

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      We have discussed this a little bit.
      Hive can answer queries such as select max(c1) from t purely from metastore using partition statistics, provided that we know the statistics are up to date.
      All data changes (e.g. adding new partitions) currently go thru metastore so we can track up-to-date-ness. If they are not up-to-date, the queries will have to read data (at least for outdated partitions) until someone runs analyze table. We can also analyze new partitions after add, if that is configured/specified in the command.

      1. HIVE-5483.3.patch
        173 kB
        Ashutosh Chauhan
      2. HIVE-5483.2.patch
        165 kB
        Ashutosh Chauhan
      3. HIVE-5483.patch
        105 kB
        Ashutosh Chauhan

        Issue Links

          Activity

          Show
          Sergey Shelukhin added a comment - Ashutosh Chauhan Prasanth J Arun C Murthy fyi
          Hide
          Ashutosh Chauhan added a comment -

          Initial implementation. Ready for review.

          Show
          Ashutosh Chauhan added a comment - Initial implementation. Ready for review.
          Hide
          Prasanth J added a comment -

          Ashutosh Chauhan In scenarios where metastore column stats are not available, do you think we can fallback to file format and see if it exposes column level statistics? ORC reader provides interface for column statistics. To make it more generic, I think we can add new interface like StatsProvidingRecordReader, implementation of which should expose file/column statistics. We can fallback to this record reader in case if the metastore stats are not available or stale. Since there are two sources of truth (file and metastore), there are two possibilities.
          1) Check metastore, if column stats are not available fallback to file format.
          2) Keep metastore as the only source of truth and make sure its always consistent with the underlying file format. (currently we don't make sure this is always consistent).

          Another thing that can be fixed is, there are some redundancies in computing stats with file format vs analyze command. If file format gathers file level and column level statistics, then analyze command should get it from the file format instead of computing it which is way cheaper.

          Show
          Prasanth J added a comment - Ashutosh Chauhan In scenarios where metastore column stats are not available, do you think we can fallback to file format and see if it exposes column level statistics? ORC reader provides interface for column statistics. To make it more generic, I think we can add new interface like StatsProvidingRecordReader, implementation of which should expose file/column statistics. We can fallback to this record reader in case if the metastore stats are not available or stale. Since there are two sources of truth (file and metastore), there are two possibilities. 1) Check metastore, if column stats are not available fallback to file format. 2) Keep metastore as the only source of truth and make sure its always consistent with the underlying file format. (currently we don't make sure this is always consistent). Another thing that can be fixed is, there are some redundancies in computing stats with file format vs analyze command. If file format gathers file level and column level statistics, then analyze command should get it from the file format instead of computing it which is way cheaper.
          Hide
          Sergey Shelukhin added a comment -

          can you post rb/fb? thanks

          Show
          Sergey Shelukhin added a comment - can you post rb/fb? thanks
          Hide
          Hive QA added a comment -

          Overall: +1 all checks pass

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12609177/HIVE-5483.patch

          SUCCESS: +1 4428 tests passed

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1169/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1169/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12609177/HIVE-5483.patch SUCCESS: +1 4428 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1169/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1169/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
          Hide
          Ashutosh Chauhan added a comment -
          Show
          Ashutosh Chauhan added a comment - Review request at https://reviews.facebook.net/D13605
          Hide
          Ashutosh Chauhan added a comment -

          Fair points, Prashanth. I think option 2) is better because of two reasons. First, not all file formats have this capability, so tying these kind of optimization with a particular format should be avoided whenever possible. Secondly, we anyway would want to have stats fresh as much as possible in metastore for query planning purposes, so we are already down the path of making stats fresh. By the way, there is already a way to collect stats fast without full scan, for RC (via HIVE-3958 ). We can do same for ORC via HIVE-4177

          I also agree we need to streamline our stats collection, stats storage and stats access api.

          Show
          Ashutosh Chauhan added a comment - Fair points, Prashanth. I think option 2) is better because of two reasons. First, not all file formats have this capability, so tying these kind of optimization with a particular format should be avoided whenever possible. Secondly, we anyway would want to have stats fresh as much as possible in metastore for query planning purposes, so we are already down the path of making stats fresh. By the way, there is already a way to collect stats fast without full scan, for RC (via HIVE-3958 ). We can do same for ORC via HIVE-4177 I also agree we need to streamline our stats collection, stats storage and stats access api.
          Hide
          Ashutosh Chauhan added a comment -

          Rewrote most of the patch. Addressed comments. Added more checks and tests.

          Show
          Ashutosh Chauhan added a comment - Rewrote most of the patch. Addressed comments. Added more checks and tests.
          Hide
          Thejas M Nair added a comment -

          Thanks for the updated patch. I have added some minor comments in reviewboard. Can you please address those . +1 once those are addressed.

          Show
          Thejas M Nair added a comment - Thanks for the updated patch. I have added some minor comments in reviewboard. Can you please address those . +1 once those are addressed.
          Hide
          Sergey Shelukhin added a comment -

          some comments also on RB

          Show
          Sergey Shelukhin added a comment - some comments also on RB
          Hide
          Ashutosh Chauhan added a comment -

          Addressed comments from Sergey & Thejas. Added more tests.

          Show
          Ashutosh Chauhan added a comment - Addressed comments from Sergey & Thejas. Added more tests.
          Hide
          Sergey Shelukhin added a comment -

          +1

          Show
          Sergey Shelukhin added a comment - +1
          Hide
          Thejas M Nair added a comment -

          Patch committed to trunk. Thanks Ashutosh!

          Show
          Thejas M Nair added a comment - Patch committed to trunk. Thanks Ashutosh!

            People

            • Assignee:
              Ashutosh Chauhan
              Reporter:
              Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development