Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-684

Use parquet row count in cost-based optimization. Use parquet row count, column value count to optimize count() aggregate function.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • None
    • None

    Description

      Parquet group scan provides the exact row count and the exact value count for each individual column. Such information could be leveraged in the following two ways:

      1. Use the count in the cost estimation, when query refers parquet files.

      2. Use the row count or column value count to optimize count() aggregate function.

      For instance, select count from parquet_file;
      select count(column_a) from parquet_file;

      First query could be transformed to return the row count directly, the second one could return the column value count for 'column_a'. Both of the two cases will avoid scan the whole parquet files, thus improve query performance.

      Attachments

        1. DRILL-684.1.patch
          37 kB
          Jinfeng Ni

        Activity

          People

            jni Jinfeng Ni
            jni Jinfeng Ni
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: