Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4792

NDV estimates for case expressions with limited number of output values could be improved

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Frontend
    • Labels:

      Description

      In the below example, the case expression only has two possible values but the planner estimates 7300 distinct values. This is a fairly common pattern in real-world queries and we could get a better estimate by analysing the case expression.

      [localhost:21000] > set explain_level=2;
      EXPLAIN_LEVEL set to 2
      [localhost:21000] > explain select distinct case when id = 1 then 'yes' else 'no' end from functional.alltypes;
      Query: explain select distinct case when id = 1 then 'yes' else 'no' end from functional.alltypes
      +---------------------------------------------------------------+
      | Explain String                                                |
      +---------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=180.00MB VCores=2     |
      |                                                               |
      | PLAN-ROOT SINK                                                |
      | |                                                             |
      | 04:EXCHANGE [UNPARTITIONED]                                   |
      | |  hosts=3 per-host-mem=unavailable                           |
      | |  tuple-ids=1 row-size=16B cardinality=7300                  |
      | |                                                             |
      | 03:AGGREGATE [FINALIZE]                                       |
      | |  group by: CASE WHEN id = 1 THEN 'yes' ELSE 'no' END        |
      | |  hosts=3 per-host-mem=10.00MB                               |
      | |  tuple-ids=1 row-size=16B cardinality=7300                  |
      | |                                                             |
      | 02:EXCHANGE [HASH(CASE WHEN id = 1 THEN 'yes' ELSE 'no' END)] |
      | |  hosts=3 per-host-mem=0B                                    |
      | |  tuple-ids=1 row-size=16B cardinality=7300                  |
      | |                                                             |
      | 01:AGGREGATE [STREAMING]                                      |
      | |  group by: CASE WHEN id = 1 THEN 'yes' ELSE 'no' END        |
      | |  hosts=3 per-host-mem=10.00MB                               |
      | |  tuple-ids=1 row-size=16B cardinality=7300                  |
      | |                                                             |
      | 00:SCAN HDFS [functional.alltypes, RANDOM]                    |
      |    partitions=24/24 files=24 size=478.45KB                    |
      |    table stats: 7300 rows total                               |
      |    column stats: all                                          |
      |    hosts=3 per-host-mem=160.00MB                              |
      |    tuple-ids=0 row-size=4B cardinality=7300                   |
      +---------------------------------------------------------------+
      Fetched 28 row(s) in 0.01s
      [localhost:21000] > 
      

        Activity

        Hide
        joemcdonnell Joe McDonnell added a comment -

        commit 59cdf6b8f2a6180b727bcb9ee336a65381377ace
        Author: Joe McDonnell <joemcdonnell@cloudera.com>
        Date: Mon Jan 23 11:11:25 2017 -0800

        IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

        If all the return values of a Case expression have a known number of
        distinct values (i.e. they are constant or statistics exist), then
        the number of distinct values for the Case can be computed using this
        information.

        In order for the value from Case to be used at higher levels in the tree,
        the implementation of computeNumDistinctValues for Expr needed to change.
        Previously, Expr calculated the number of distinct values by finding any
        SlotRefs in its tree and taking the maximum of the distinct values from
        those SlotRefs. This would ignore the value from CaseExpr. To fix this,
        Expr now takes the maximum number of distinct values across all of its
        children.

        – explaining this statement shows cardinality = 2
        explain select distinct case when id = 1 then 'yes' else 'no' end
        from functional.alltypes;

        – explaining this statement shows cardinality = 2
        explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
        from functional.alltypes;

        – explaining this statement shows cardinality = 7300
        explain select distinct case when id = 1 then 0 else id end
        from functional.alltypes;

        – explaining this statement shows cardinality = 737 (date_string_col has lower
        – cardinality than id)
        explain select distinct case when id = 1 then 'yes' else date_string_col end
        from functional.alltypes;

        For cases when the number of distinct values is not known for all the outputs,
        this will return -1, indicating that the number of distinct values is not
        known. The inputs (whens) are not used for calculating the number of distinct
        values.

        Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
        Reviewed-on: http://gerrit.cloudera.org:8080/5768
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        joemcdonnell Joe McDonnell added a comment - commit 59cdf6b8f2a6180b727bcb9ee336a65381377ace Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Mon Jan 23 11:11:25 2017 -0800 IMPALA-4792 : Fix number of distinct values for a CASE with constant outputs If all the return values of a Case expression have a known number of distinct values (i.e. they are constant or statistics exist), then the number of distinct values for the Case can be computed using this information. In order for the value from Case to be used at higher levels in the tree, the implementation of computeNumDistinctValues for Expr needed to change. Previously, Expr calculated the number of distinct values by finding any SlotRefs in its tree and taking the maximum of the distinct values from those SlotRefs. This would ignore the value from CaseExpr. To fix this, Expr now takes the maximum number of distinct values across all of its children. – explaining this statement shows cardinality = 2 explain select distinct case when id = 1 then 'yes' else 'no' end from functional.alltypes; – explaining this statement shows cardinality = 2 explain select distinct char_length(case when id = 1 then 'yes' else 'no' end) from functional.alltypes; – explaining this statement shows cardinality = 7300 explain select distinct case when id = 1 then 0 else id end from functional.alltypes; – explaining this statement shows cardinality = 737 (date_string_col has lower – cardinality than id) explain select distinct case when id = 1 then 'yes' else date_string_col end from functional.alltypes; For cases when the number of distinct values is not known for all the outputs, this will return -1, indicating that the number of distinct values is not known. The inputs (whens) are not used for calculating the number of distinct values. Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2 Reviewed-on: http://gerrit.cloudera.org:8080/5768 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            joemcdonnell Joe McDonnell
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development