Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8032

Gather minimum, maximum values to better estimate inequality selectivity

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Catalog
    • None
    • ghx-label-8

    Description

      A query may contain an inequality predicate. TPC-H has many such as l_shipdate <= '1998-09-02'.

      The planer must know the selectivity of each predicate applied to filter a table. Inequalities are impossible to estimate from just the NDV value available in the catalog. As a result, most systems assume some value around .3 or .4. (Textbooks recommend .3).

      The query literature notes that the best way to estimate an inequality is with histograms. The literature also knows a cheaper alternative:

      • Assume uniform value distribution, and
      • Gather the minimum and maximum column values.

      Given this it is easy to estimate an inequality as:

      sel(c < x) = (x - min(c)) / (max(c) - min(c))
      
      sel(c > x) = (max(c) - x) / (max(c) - min(c))
      

      The cost is just two extra values per column rather than the full cost of a histogram.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul.Rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: