Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2987

Incorrect results without error in stress test

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 2.5.0
    • Impala 2.5.0
    • Backend

    Description

      5 out of ~300 queries in a stress test run on the physical cluster had incorrect results running TPC-H queries on parquet (they are not all the same query).

      The archived profile for one of the queries shows different numbers (especially for 2 of the exchanges).

      Operator              #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail
      ----------------------------------------------------------------------------------------------------------------------------------
      15:MERGING-EXCHANGE        1  134.504us  134.504us      140          20          0        -1.00 B  UNPARTITIONED
      08:TOP-N                   7  243.026ms  269.440ms      140          20   68.00 KB        4.88 KB
      14:AGGREGATE               7      1m10s       8m4s    3.33M      20.00M  154.78 MB        5.11 GB  FINALIZE
      13:EXCHANGE                7  171.997ms  212.488ms    3.33M      20.00M          0              0  HASH(c_custkey,c_name,c_acc...
      07:AGGREGATE               7      1m10s       8m3s    3.33M      20.00M  186.28 MB        5.11 GB
      06:HASH JOIN               7  142.926ms  182.497ms    9.83M      20.00M    2.04 MB       855.00 B  INNER JOIN, BROADCAST
      |--12:EXCHANGE             7    6.539us    7.819us       25          25          0              0  BROADCAST
      |  03:SCAN HDFS            1    6.666ms    6.666ms       25          25   58.00 KB       32.00 MB  tpch_100_parquet.nation
      05:HASH JOIN               7    4s179ms    4s900ms   11.46M      20.00M  586.03 MB      491.14 MB  INNER JOIN, PARTITIONED
      |--11:EXCHANGE             7      1m39s     11m27s   15.00M      15.00M          0              0  HASH(c_custkey)
      |  00:SCAN HDFS            7  857.603ms    1s963ms   15.00M      15.00M  189.95 MB      616.00 MB  tpch_100_parquet.customer
      10:EXCHANGE                7  124.665ms  214.146ms   11.46M      20.00M          0              0  HASH(o_custkey)
      04:HASH JOIN               7      1m21s      8m14s   11.46M      20.00M  650.02 MB      660.90 MB  INNER JOIN, BROADCAST
      |--09:EXCHANGE             7    1s092ms    1s890ms    5.73M      15.00M          0              0  BROADCAST
      |  01:SCAN HDFS            7      2m19s      16m6s    5.73M      15.00M  141.39 MB      264.00 MB  tpch_100_parquet.orders
      02:SCAN HDFS               7    1s190ms    1s563ms  148.07M     200.01M   72.87 MB      352.00 MB  tpch_100_parquet.lineitem
      

      A correct summary is

      +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+
      | Operator            | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                                                              |
      +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+
      | 15:MERGING-EXCHANGE | 1      | 124.17us | 124.17us | 140     | 20         | 0 B       | -1 B          | UNPARTITIONED                                                       |
      | 08:TOP-N            | 7      | 164.38ms | 183.72ms | 140     | 20         | 68.00 KB  | 4.88 KB       |                                                                     |
      | 14:AGGREGATE        | 7      | 727.67ms | 750.61ms | 3.88M   | 20.00M     | 194.13 MB | 5.11 GB       | FINALIZE                                                            |
      | 13:EXCHANGE         | 7      | 81.16ms  | 84.50ms  | 3.88M   | 20.00M     | 0 B       | 0 B           | HASH(c_custkey,c_name,c_acctbal,c_phone,n_name,c_address,c_comment) |
      | 07:AGGREGATE        | 7      | 958.25ms | 1.01s    | 3.88M   | 20.00M     | 186.28 MB | 5.11 GB       |                                                                     |
      | 06:HASH JOIN        | 7      | 67.95ms  | 70.91ms  | 11.46M  | 20.00M     | 2.04 MB   | 855 B         | INNER JOIN, BROADCAST                                               |
      | |--12:EXCHANGE      | 7      | 8.83us   | 10.81us  | 175     | 25         | 0 B       | 0 B           | BROADCAST                                                           |
      | |  03:SCAN HDFS     | 1      | 167.55ms | 167.55ms | 25      | 25         | 58.00 KB  | 32.00 MB      | tpch_100_parquet.nation                                             |
      | 05:HASH JOIN        | 7      | 714.31ms | 750.06ms | 11.46M  | 20.00M     | 586.03 MB | 491.14 MB     | INNER JOIN, PARTITIONED                                             |
      | |--11:EXCHANGE      | 7      | 277.96ms | 286.59ms | 15.00M  | 15.00M     | 0 B       | 0 B           | HASH(c_custkey)                                                     |
      | |  00:SCAN HDFS     | 7      | 449.21ms | 1.05s    | 15.00M  | 15.00M     | 317.50 MB | 616.00 MB     | tpch_100_parquet.customer                                           |
      | 10:EXCHANGE         | 7      | 45.12ms  | 49.42ms  | 11.46M  | 20.00M     | 0 B       | 0 B           | HASH(o_custkey)                                                     |
      | 04:HASH JOIN        | 7      | 2.90s    | 3.01s    | 11.46M  | 20.00M     | 650.02 MB | 660.90 MB     | INNER JOIN, BROADCAST                                               |
      | |--09:EXCHANGE      | 7      | 503.27ms | 560.62ms | 40.11M  | 15.00M     | 0 B       | 0 B           | BROADCAST                                                           |
      | |  01:SCAN HDFS     | 7      | 730.01ms | 858.07ms | 5.73M   | 15.00M     | 228.06 MB | 264.00 MB     | tpch_100_parquet.orders                                             |
      | 02:SCAN HDFS        | 7      | 136.49ms | 153.02ms | 148.07M | 200.01M    | 702.55 MB | 352.00 MB     | tpch_100_parquet.lineitem                                           |
      +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+
      

      The logs have messages like

      data-stream-mgr.cc:111] Datastream sender timed-out waiting for recvr for fragment instance: 72486d469653b918:e02b80cc64dfe98d (time-out was: 1m). If query was cancelled, this is not an error.
      

      But many other queries have the same messages. (I haven't check if they returned results, they could have errored due to mem limits.)

      Attachments

        1. c441d45ac6132d78_profile_reference.txt
          262 kB
          Skye Wanderman-Milne
        2. c441d45ac6132d78_profile.txt
          161 kB
          Skye Wanderman-Milne
        3. impala-2987-ec2-stress-profile.txt
          81 kB
          Tim Armstrong
        4. profile.txt
          47 kB
          Tim Armstrong
        5. profile-incorrect-results-no-filters.txt
          102 kB
          Tim Armstrong
        6. profile-incorrect-results-no-filters-2.txt
          102 kB
          Tim Armstrong

        Activity

          People

            henryr Henry Robinson
            caseyc casey
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: