Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5773

Project pushdown into a subquery with select *

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.16.0
    • None
    • None

    Description

      If a subquery / table expression/ view has a `select *` and out query is requesting a subset of columns/fields, Drill currently does not do project pushdown into the subquery. As a result, the scan operator will return every column/field in the table, this would significantly impact query performance, especially if # of column/field is large.

      For instance,

      SELECT n_regionkey, count(*) AS cnt 
      FROM (SELECT * FROM cp.`tpch/nation.parquet`) AS n 
      GROUP BY n_regionkey;
      

      Here is the plan

      00-00    Screen
      00-01      Project(n_regionkey=[$0], cnt=[$1])
      00-02        Project(n_regionkey=[$0], cnt=[$1])
      00-03          HashAgg(group=[{0}], cnt=[COUNT()])
      00-04            Project(n_regionkey=[ITEM($0, 'n_regionkey')])
      00-05              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, usedMetadataFile=false, columns=[`*`]]])
      

      Notice that in Scan operator `columns = *`, indicating that it will read every column.

      From performance perspective, Drill should push project into subquery with select *.

      Attachments

        Activity

          People

            hanu.ncr Hanumath Rao Maduri
            jni Jinfeng Ni
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: