Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7038

Queries on partitioned columns scan the entire datasets

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.16.0
    • Component/s: None

      Description

      For tables with hive-style partitions like

      /table/2018/Q1
      /table/2018/Q2
      /table/2019/Q1
      etc.
      

      if any of the following queries is run:

      select distinct dir0 from dfs.`/table`
      
      select dir0 from dfs.`/table` group by dir0
      

      it will actually scan every single record in the table rather than just getting a list of directories at the dir0 level. This applies even when cached metadata is available. This is a big penalty especially as the datasets grow.

      To avoid such situations, a logical prune rule can be used to collect partition columns (`dir0`), either from metadata cache (if available) or group scan, and drop unnecessary files from being read. The rule will be applied on following conditions:
      1) all queried columns are partitoin columns, and
      2) either DISTINCT or GROUP BY operations are performed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bohdan Bohdan Kazydub
                Reporter:
                bohdan Bohdan Kazydub
                Reviewer:
                Gautam Parai
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: