Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7038

Queries on partitioned columns scan the entire datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.16.0
    • None

    Description

      For tables with hive-style partitions like

      /table/2018/Q1
      /table/2018/Q2
      /table/2019/Q1
      etc.
      

      if any of the following queries is run:

      select distinct dir0 from dfs.`/table`
      
      select dir0 from dfs.`/table` group by dir0
      

      it will actually scan every single record in the table rather than just getting a list of directories at the dir0 level. This applies even when cached metadata is available. This is a big penalty especially as the datasets grow.

      To avoid such situations, a logical prune rule can be used to collect partition columns (`dir0`), either from metadata cache (if available) or group scan, and drop unnecessary files from being read. The rule will be applied on following conditions:
      1) all queried columns are partitoin columns, and
      2) either DISTINCT or GROUP BY operations are performed.

      Attachments

        Issue Links

          Activity

            People

              bohdan Bohdan Kazydub
              bohdan Bohdan Kazydub
              Gautam Parai Gautam Parai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: