Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-684

Use parquet row count in cost-based optimization. Use parquet row count, column value count to optimize count() aggregate function.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Parquet group scan provides the exact row count and the exact value count for each individual column. Such information could be leveraged in the following two ways:

      1. Use the count in the cost estimation, when query refers parquet files.

      2. Use the row count or column value count to optimize count() aggregate function.

      For instance, select count from parquet_file;
      select count(column_a) from parquet_file;

      First query could be transformed to return the row count directly, the second one could return the column value count for 'column_a'. Both of the two cases will avoid scan the whole parquet files, thus improve query performance.

        Attachments

          Activity

            People

            • Assignee:
              jni Jinfeng Ni
              Reporter:
              jni Jinfeng Ni
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: