Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2373

Extrapolate the number of rows in a scan based on the rows/byte ratio

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Impala 2.10.0
    • Frontend

    Description

      This JIRA is intended to address the following problems

      • Some partitions may be missing the #rows stat
      • Some partitions may have the #rows stat but it is stale because files were added/dropped since computing the #rows stat

      The main idea is to use available #rows stats to extrapolate the missing stats

      • Store an additional statistic rows/byte in the TBLPROPERTIES of the table (could also be rows/kbyte or whatever seems most suitable)
      • That statistic is computed as part of COMPUTE [INCREMENTAL] STATS on the impalad side, and then shipped to the catalogd for it to be stored in the Metastore
      • During query planning we use the rows/byte statistic to estimate the number of rows scanned for all partitions regardless of whether a partition has #rows or not. The rationale is that the #rows of a partition may be outdated and using the rows/byte ratio is more robust to data changes.
      • We should augment SHOW TABLE STATS to display the stored #rows as well as the extrapolated #rows.
      • We should have some way of reporting the stored rows/byte ratio for debugging purposes (maybe SHOW TABLE STATS or EXPLAIN?)

      Additional considerations

      • A table could have mixed formats
      • Even if a table has the same format, files could be compressed differently
      • It seems reasonable to ignore these issues in the first cut

      Non-Goals

      • Estimate statistics if there are no stats at all, e.g. purely based on file size without knowing any #rows
      • Extrapolate column stats like NDV in a similar fashion. That is a much more invasive change with a smaller impact.

      Attachments

        Issue Links

          Activity

            People

              alex.behm Alexander Behm
              alan@cloudera.com Alan Choi
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: