Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4250

File system directory-based partition pruning does not work when a directory contains both subdirectories and files.

    XMLWordPrintableJSON

    Details

      Description

      When a directory contains both subdirectories and files, then the directory-based partition pruning would not work.

      For example, I have the following directory structure with nation.parquet (copied from tpch sample dataset).

      .//2001/Q1/nation.parquet
      .//2001/Q2/nation.parquet

      The following query has the directory-based partition pruning work correctly.

      explain plan for select * from dfs.tmp.fileAndDir where dir0 = 2001 and dir1 = 'Q1';
      00-00    Screen
      00-01      Project(*=[$0])
      00-02        Project(*=[$0])
      00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q1/nation.parquet]], selectionRoot=file:/tmp/fileAndDir, numFiles=1, usedMetadataFile=false, columns=[`*`]]])
      

      However, if I add a nation.parquet file to 2001 directory, like the following:

      .//2001/nation.parquet
      .//2001/Q1/nation.parquet
      .//2001/Q2/nation.parquet

      Then, the same query will not have the partition pruning applied.

      explain plan for select * from dfs.tmp.fileAndDir where dir0 = 2001 and dir1 = 'Q1';
      +------+------+
      | text | json |
      +------+------+
      | 00-00    Screen
      00-01      Project(*=[$0])
      00-02        Project(T0¦¦*=[$0])
      00-03          SelectionVectorRemover
      00-04            Filter(condition=[AND(=($1, 2001), =($2, 'Q1'))])
      00-05              Project(T0¦¦*=[$0], dir0=[$1], dir1=[$2])
      00-06                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/nation.parquet], ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q1/nation.parquet], ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q2/nation.parquet]], selectionRoot=file:/tmp/fileAndDir, numFiles=3, usedMetadataFile=false, columns=[`*`]]])
      

      I should note that for the second case where partition pruning did not work, the query did return the correct result. Therefore, this issue is only impact the query performance, not the query result.

        Attachments

          Activity

            People

            • Assignee:
              jni Jinfeng Ni
              Reporter:
              jni Jinfeng Ni
              Reviewer:
              Rahul Kumar Challapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: