Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9911

IS [NOT] NULL predicate selectivity estimate is wrong if #nulls is 0

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 3.4.0
    • Impala 4.0.0
    • Frontend
    • None
    • ghx-label-13

    Description

      Consider the tpcds customer table .. its c_current_addr_sk column has #Nulls = 0 as shown below.

       tpcds> show column stats customer;
      +------------------------+--------+------------------+--------+----------+-------------------+
      | Column                 | Type   | #Distinct Values | #Nulls | Max Size | Avg Size          |
      +------------------------+--------+------------------+--------+----------+-------------------+
      ....
      | c_current_cdemo_sk     | INT    | 91558            | 3438   | 4        | 4                 |
      | c_current_hdemo_sk     | INT    | 7376             | 3431   | 4        | 4                 |
      | c_current_addr_sk      | INT    | 42003            | 0      | 4        | 4                 |
      ....
      

      The cardinality estimate for the following predicates shows a default selectivity of 10% being applied which is not correct:

      explain select c_current_addr_sk from customer where c_current_addr_sk is not null;
      | 00:SCAN HDFS [tpcds.customer]                              |
      |    HDFS partitions=1/1 files=1 size=12.60MB                |
      |    predicates: c_current_addr_sk IS NOT NULL               |
      |    row-size=4B cardinality=10.00K                          |
      +------------------------------------------------------------+
      
      explain select c_current_addr_sk from customer where c_current_addr_sk is null;
      | 00:SCAN HDFS [tpcds.customer]                              |
      |    HDFS partitions=1/1 files=1 size=12.60MB                |
      |    predicates: c_current_addr_sk IS NULL                   |
      |    row-size=4B cardinality=10.00K                          |
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amansinha Aman Sinha
            amansinha Aman Sinha
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment