Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10083

Improve row count estimates when stats are not available

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:
      None
    • Epic Color:
      ghx-label-7

      Description

      There are various improvements that we can make to estimate row count stats even if stats are not available for a table.

      There are various factors to consider here:

      • Handling for partitioned vs. non-partitioned tables
        • Handling for partitioned tables can be a bit tricky if the table is in a mixed state - some partitions have row counts while other don't
      • Interoperability with other systems such as Hive and Spark
      • Users can run alter table statements to manually set the value of the row count
      • Handling of corrupt stats vs. missing stats
        • Corrupt stats are defined as stats value less than -1, or values of 0 when the underlying table has nonempty files
        • Missing stats are stats that have just not been computed, and are marked as such with the value -1

      The JIRA will be used to track the various improvements via sub-tasks.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stakiar Sahil Takiar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: