Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8026

Actual row counts for nested loop join are way too high while the query is executing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 3.1.0
    • Impala 3.2.0
    • Backend
    • None
    • ghx-label-4

    Description

      Consider this extract from a query plan:

      Operator                      #Rows  Est. #Rows
      --------------------------------------------------------------
      …
      |  10:HASH JOIN               9.53M      18.14K 
      |  |--19:EXCHANGE                 1           1
      |  |  00:SCAN HDFS                1           1
      |  06:NESTED LOOP JOIN        4.88B     863.84K 
      |  |--18:EXCHANGE                 1           1
      |  |  04:SCAN HDFS                1           1
      |  05:HASH JOIN               9.53M     863.84K
      

      If the above is to be believed, the 06 nested loop join produced 5 billion rows. But, the actual number is far too huge for that: joining 1 row with 10 million rows cannot produce 500 times that number of rows.

      It appears that the nested loop join actually processed and returned the 9.5 million rows, since that is the same number produced by the 10 hash join which joins a single row with the output of the nested loop join.

      Because this same bogus result appears across multiple plans, it is likely that the actual number is completely wrong and bears no relation to the number of rows actually returned.

      Attachments

        Issue Links

          Activity

            People

              tarmstrong Tim Armstrong
              Paul.Rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: