Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3495

Stress test: codegened spilling joins return incorrect results

    Details

      Description

      The stress test started reporting some queries are returning incorrect results in some cases. The rate is very small, tens out of 10,000 queries. The symptoms seem a lot like IMPALA-2987, so I tried to provide some of the information that's included in that bug.

      See for example for this build:
      http://sandbox.jenkins.cloudera.com/job/Impala-Stress-Test-Physical/486/

      This is the console log containing one of the failed queries:

      00:38:57 Process Process-88:
      00:38:57 Traceback (most recent call last):
      00:38:57   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
      00:38:57     self.run()
      00:38:57   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
      00:38:57     self._target(*self._args, **self._kwargs)
      00:38:57   File "tests/stress/concurrent_select.py", line 625, in _start_single_runner
      00:38:57     % (query.result_hash, report.result_hash, query.sql))
      00:38:57 Exception: Result hash mismatch; expected 3440873535184108466, got 3440873535183883670
      00:38:57 Query: select
      00:38:57   p_brand,
      00:38:57   p_type,
      00:38:57   p_size,
      00:38:57   count(distinct ps_suppkey) as supplier_cnt
      00:38:57 from
      00:38:57   partsupp,
      00:38:57   part
      00:38:57 where
      00:38:57   p_partkey = ps_partkey
      00:38:57   and p_brand <> 'Brand#45'
      00:38:57   and p_type not like 'MEDIUM POLISHED%'
      00:38:57   and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
      00:38:57   and ps_suppkey not in (
      00:38:57     select
      00:38:57       s_suppkey
      00:38:57     from
      00:38:57       supplier
      00:38:57     where
      00:38:57       s_comment like '%Customer%Complaints%'
      00:38:57   )
      00:38:57 group by
      00:38:57   p_brand,
      00:38:57   p_type,
      00:38:57   p_size
      00:38:57 order by
      00:38:57   supplier_cnt desc,
      00:38:57   p_brand,
      00:38:57   p_type,
      00:38:57   p_size
      00:38:57 00:38:57 3153 139801718388480 INFO:concurrent_select[374]:Checking for crashes
      00:38:57 00:38:57 3153 139801718388480 INFO:concurrent_select[377]:No crashes detected
      

      Using the saved artifacts from the Jenkins job (results.zip) we see that corresponds to query ID 447bd5594405d60:f8b5fcc30b7b2eb6. We do this by grepping the result in the directory containing all the queries and results. The filename is the query ID.

      I grabbed the logs and query profiles off the cluster and put them on impala desktop in a directory with this bug ID. I can then run a small script in the dev user's homedir to get the query profile (~dev/print_profile).

      The incorrect query's profile is:

      Operator              #Hosts   Avg Time   Max Time   #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail
      ---------------------------------------------------------------------------------------------------------------------------------
      12:MERGING-EXCHANGE        1   51.893ms   51.893ms  27.84K     188.75K          0        -1.00 B  UNPARTITIONED
      07:SORT                    8   49.124ms  216.409ms  27.84K     188.75K   24.02 MB       16.00 MB    
      06:AGGREGATE               8    1s876ms    2s426ms  27.84K     188.75K    2.52 MB       10.00 MB  FINALIZE
      11:AGGREGATE               8    9s396ms   12s008ms  11.82M       3.20M  266.11 MB      257.14 MB    
      10:EXCHANGE                8  373.182ms    1s089ms  11.83M       3.20M          0              0  HASH(p_brand,p_type,p_size)
      05:AGGREGATE               8    3s818ms    4s954ms  11.83M       3.20M   12.58 MB      257.14 MB  STREAMING
      04:HASH JOIN               8    4s531ms    5s881ms  11.83M       3.20M   10.20 MB        9.07 MB  NULL AWARE LEFT ANTI JOIN, ...
      |--09:EXCHANGE             8  406.771us    2.317ms     479     100.00K          0              0  BROADCAST
      |  02:SCAN HDFS            7    4s091ms    5s725ms     479     100.00K   16.15 MB       48.00 MB  tpch_100_parquet.supplier
      03:HASH JOIN               8   23s811ms   27s344ms  11.83M       3.20M  266.17 MB       64.29 MB  INNER JOIN, BROADCAST
      |--08:EXCHANGE             8    3s183ms    3s883ms   2.97M     800.00K          0              0  BROADCAST
      |  01:SCAN HDFS            7    3s915ms    4s865ms   2.97M     800.00K   96.45 MB      352.00 MB  tpch_100_parquet.part
      00:SCAN HDFS               8    4s051ms    6s761ms  80.00M      80.00M   39.94 MB      176.00 MB  tpch_100_parquet.partsupp
      

      Compare this to a profile of the same query that returned correct results. There are many such runs since we re-run the queries. Here's 134425b2e619c627:48b806a7130653b4:

      Operator              #Hosts   Avg Time   Max Time   #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail
      ---------------------------------------------------------------------------------------------------------------------------------
      12:MERGING-EXCHANGE        1   27.808ms   27.808ms  27.84K     188.75K          0        -1.00 B  UNPARTITIONED
      07:SORT                    8   13.794ms   21.751ms  27.84K     188.75K   24.02 MB       16.00 MB    
      06:AGGREGATE               8    1s321ms    1s714ms  27.84K     188.75K    2.52 MB       10.00 MB  FINALIZE
      11:AGGREGATE               8    1s860ms    2s663ms  11.88M       3.20M  171.10 MB      257.14 MB    
      10:EXCHANGE                8  717.939ms    1s620ms  11.88M       3.20M          0              0  HASH(p_brand,p_type,p_size)
      05:AGGREGATE               8    3s784ms    5s524ms  11.88M       3.20M   12.71 MB      257.14 MB  STREAMING
      04:HASH JOIN               8  234.658ms  325.862ms  11.88M       3.20M    2.69 MB        9.07 MB  NULL AWARE LEFT ANTI JOIN, ...
      |--09:EXCHANGE             8  158.408us  333.299us     479     100.00K          0              0  BROADCAST
      |  02:SCAN HDFS            7    1s807ms    2s859ms     479     100.00K   16.15 MB       48.00 MB  tpch_100_parquet.supplier
      03:HASH JOIN               8    3s331ms    3s921ms  11.89M       3.20M  394.08 MB       64.29 MB  INNER JOIN, BROADCAST
      |--08:EXCHANGE             8    2s387ms    6s663ms   2.97M     800.00K          0              0  BROADCAST
      |  01:SCAN HDFS            7    1s452ms    2s243ms   2.97M     800.00K   96.28 MB      352.00 MB  tpch_100_parquet.part
      00:SCAN HDFS               8    1s084ms    2s400ms  80.00M      80.00M   39.99 MB      176.00 MB  tpch_100_parquet.partsupp 
      

      What follows is the full list of query IDs that were reported as having hash mismatches:

      447bd5594405d60:f8b5fcc30b7b2eb6
      5642c03e84fec850:e4c4e7d4b57e5291
      bf4ada8ece00f3b4:31e11d41904adfa4
      2d4e8ec89dc9f072:67cf3c6843ed7bd
      a46cf7d01964dff:9e9eabb44ca2ca9a
      af4490619ef34246:f08c4695612c4c85
      7443d268a0ba7201:3893ffcc845a78d
      e34dfe2fb3ff9f33:c0d18ac4f45b978b
      f4ac95f838ad22e:1f9a0a66dafa46a1
      7d4713f6b51241b1:a8c7c8d00ea6a49c
      

      It's interesting to note these 10 failures are from the same 2 queries. Some fail on the query above. The other query is this:

      select
        c_custkey,
        c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,
        n_name,
        c_address,
        c_phone,
        c_comment
      from
        customer,
        orders,
        lineitem,
        nation
      where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= '1993-10-01'
        and o_orderdate < '1994-01-01'
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
      group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
      order by
        revenue desc
      limit 20
      

      I'll note that this started on April 21 but the test framework didn't fail the job. IMPALA-3487 filed to change that. What brought this to my attention was a crash that happened yesterday in the same job. That's IMPALA-3485.

      I think everything someone needs to look into this further is in the appropriate directory on impala-desktop.

        Attachments

          Activity

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              mikeb Michael Brown
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: