Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7028 Reduce the planning time of queries on large Parquet tables with large metadata cache files
  3. DRILL-7064

Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.16.0
    • Metadata

    Description

      This sub-task is meant to leverage the Parquet metadata cache's summary stats: totalRowCount (across all files and row groups) and the per-column totalNullCount (across all files and row groups) to answer plain COUNT aggregation queries without Group-By. These are currently converted to a DirectScan by the ConvertCountToDirectScanRule which utilizes the row group metadata; however this rule is applied on Drill Logical rels and converts the logical plan to a physical plan with DirectScanPrel but this is too late since the DrillScanRel that is already created during logical planning has already read the entire metadata cache file along with its full list of row group entries. The metadata cache file can grow quite large and this does not scale.

      The solution is to use the Metadata Summary file that is created in DRILL-7063 and create a new rule that will apply early on such that it operates on the Calcite logical rels instead of the Drill logical rels and prevents eager expansion of the list of files/row groups.

      We will not remove the existing rule. The existing rule will continue to operate as before because it is possible that after some transformations, we still want to apply the optimizations for COUNT queries.

      Attachments

        Issue Links

          Activity

            People

              amansinha100 Aman Sinha
              vdonapati Venkata Jyothsna Donapati
              Vova Vysotskyi Vova Vysotskyi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified