Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6557

Use size in bytes during Hive statistics calculation if present

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13.0
    • 1.14.0
    • None

    Description

      Drill considers Hive statistics valid if it contains number of rows and size in bytes. If at least of them is absent, statistics is calculated based on input splits size in bytes. This means that we fetch all input splits though we might not need some after planning optimizations (ex: partition pruning). Though if number of rows are missing and size in bytes is present, there is no need to fetch all input splits since their size in bytes will be the same as in statistics, this would improve time planning since fetching input splits is rather costly operation.

      This Jira aims to:
      1. check size in bytes presence in stats before fetching input splits and use it if present;
      2. add log trace suggesting to use ANALYZE command before running queries if statistics is unavailable and Drill had to fetch all input splits;
      3. minor refactoring /  cleanup in HiveMetadataProvider class.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            arina Arina Ielchiieva
            arina Arina Ielchiieva
            Vova Vysotskyi Vova Vysotskyi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment