Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12006

Outer/inner join cardinality highly overestimated

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 4.2.0
    • Impala 4.3.0
    • Frontend
    • None
    • ghx-label-12

    Description

      In one of the use cases, we have seen the cardinality estimate for left outer join highly overestimated. The plan is complex and only a partial output is shown below (with the column names anonymized):

        57:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
        |  hash-table-id=121
        |  hash predicates: a.id = a.id
        |  fk/pk conjuncts: none
        |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
        |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N,29N,31N row-size=2.63KB cardinality=3.90T
        |  in pipelines: 06(GETNEXT), 26(OPEN)
        |
        |--F1253:PLAN FRAGMENT hosts=13 instances=13
        |  |  Per-Instance Resources: mem-estimate=1.10GB mem-reservation=204.00MB thread-reservation=1
        |  JOIN BUILD
        |  |  join-table-id=121 plan-id=122 cohort-id=25
        |  |  build expressions: a.id
        |  |  mem-estimate=1.08GB mem-reservation=204.00MB spill-buffer=2.00MB thread-reservation=0
        |  |
        |  1758:EXCHANGE [BROADCAST]
        |  |  mem-estimate=20.87MB mem-reservation=0B thread-reservation=0
        |  |  tuple-ids=29,31 row-size=85B cardinality=9.56M
        |  |  in pipelines: 26(GETNEXT)
        |  |
        ...
        ...
        ...
        56:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
        |  hash predicates: ifnull(a.id, a.id) = a.id
        |  fk/pk conjuncts: assumed fk/pk
        |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
        |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N row-size=2.55KB cardinality=14.97G
        |  in pipelines: 06(GETNEXT), 22(OPEN)
      

      Note that the left input of the join is estimated as 14.97G rows, right input as 9.56M rows but the LOJ estimate is 3.9T rows. We need to investigate why that is so and fix it. The NDV of the based column involved in the join is 36661 but in the lower join there are functions involved in the join condition.

      Attachments

        Activity

          People

            amansinha Aman Sinha
            amansinha Aman Sinha
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: