Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2373

Extrapolate the number of rows in a scan based on the rows/byte ratio

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Fix Version/s: Impala 2.10.0
    • Component/s: Frontend
    • Labels:

      Description

      This JIRA is intended to address the following problems

      • Some partitions may be missing the #rows stat
      • Some partitions may have the #rows stat but it is stale because files were added/dropped since computing the #rows stat

      The main idea is to use available #rows stats to extrapolate the missing stats

      • Store an additional statistic rows/byte in the TBLPROPERTIES of the table (could also be rows/kbyte or whatever seems most suitable)
      • That statistic is computed as part of COMPUTE [INCREMENTAL] STATS on the impalad side, and then shipped to the catalogd for it to be stored in the Metastore
      • During query planning we use the rows/byte statistic to estimate the number of rows scanned for all partitions regardless of whether a partition has #rows or not. The rationale is that the #rows of a partition may be outdated and using the rows/byte ratio is more robust to data changes.
      • We should augment SHOW TABLE STATS to display the stored #rows as well as the extrapolated #rows.
      • We should have some way of reporting the stored rows/byte ratio for debugging purposes (maybe SHOW TABLE STATS or EXPLAIN?)

      Additional considerations

      • A table could have mixed formats
      • Even if a table has the same format, files could be compressed differently
      • It seems reasonable to ignore these issues in the first cut

      Non-Goals

      • Estimate statistics if there are no stats at all, e.g. purely based on file size without knowing any #rows
      • Extrapolate column stats like NDV in a similar fashion. That is a much more invasive change with a smaller impact.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                alex.behm Alexander Behm
                Reporter:
                alan@cloudera.com Alan Choi
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: