Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10083

Improve row count estimates when stats are not available

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Frontend
    • None
    • ghx-label-7

    Description

      There are various improvements that we can make to estimate row count stats even if stats are not available for a table.

      There are various factors to consider here:

      • Handling for partitioned vs. non-partitioned tables
        • Handling for partitioned tables can be a bit tricky if the table is in a mixed state - some partitions have row counts while other don't
      • Interoperability with other systems such as Hive and Spark
      • Users can run alter table statements to manually set the value of the row count
      • Handling of corrupt stats vs. missing stats
        • Corrupt stats are defined as stats value less than -1, or values of 0 when the underlying table has nonempty files
        • Missing stats are stats that have just not been computed, and are marked as such with the value -1

      The JIRA will be used to track the various improvements via sub-tasks.

      Attachments

        Activity

          People

            Unassigned Unassigned
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: