Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5198

Error messages are sometimes dropped before reaching client

    Details

    • Epic Color:
      ghx-label-6

      Description

      On the nightly core s3 tests, the custom cluster test test_exchange_delays failed. It looks like the query failed for an unexplained reason. It says 'aborted' but we need to look at it more closely.

      07:56:58 =================================== FAILURES ===================================
      07:56:58  TestExchangeDelays.test_exchange_small_delay[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] 
      07:56:58 
      07:56:58 self = <test_exchange_delays.TestExchangeDelays object at 0x3f054d0>
      07:56:58 vector = <tests.common.test_vector.ImpalaTestVector object at 0x5a32750>
      07:56:58 
      07:56:58     @pytest.mark.execute_serially
      07:56:58     @CustomClusterTestSuite.with_args("--stress_datastream_recvr_delay_ms=10000"
      07:56:58           " --datastream_sender_timeout_ms=5000")
      07:56:58     def test_exchange_small_delay(self, vector):
      07:56:58       """Test delays in registering data stream receivers where the first one or two
      07:56:58         batches will time out before the receiver registers, but subsequent batches will
      07:56:58         arrive after the receiver registers. Before IMPALA-2987, this scenario resulted in
      07:56:58         incorrect results.
      07:56:58         """
      07:56:58 >     self.run_test_case('QueryTest/exchange-delays', vector)
      07:56:58 
      07:56:58 custom_cluster/test_exchange_delays.py:39: 
      07:56:58 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      07:56:58 common/impala_test_suite.py:362: in run_test_case
      07:56:58     self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
      07:56:58 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      07:56:58 
      07:56:58 self = <test_exchange_delays.TestExchangeDelays object at 0x3f054d0>
      07:56:58 expected_strs = ['Sender timed out waiting for receiver fragment instance\n']
      07:56:58 actual_str = 'ImpalaBeeswaxException: Query aborted: (1 of 3 similar)'
      07:56:58 use_db = None
      07:56:58 
      07:56:58     def __verify_exceptions(self, expected_strs, actual_str, use_db):
      07:56:58       """
      07:56:58         Verifies that at least one of the strings in 'expected_str' is a substring of the
      07:56:58         actual exception string 'actual_str'.
      07:56:58         """
      07:56:58       actual_str = actual_str.replace('\n', '')
      07:56:58       for expected_str in expected_strs:
      07:56:58         # In error messages, some paths are always qualified and some are not.
      07:56:58         # So, allow both $NAMENODE and $FILESYSTEM_PREFIX to be used in CATCH.
      07:56:58         expected_str = expected_str.strip() \
      07:56:58             .replace('$FILESYSTEM_PREFIX', FILESYSTEM_PREFIX) \
      07:56:58             .replace('$NAMENODE', NAMENODE) \
      07:56:58             .replace('$IMPALA_HOME', IMPALA_HOME)
      07:56:58         if use_db: expected_str = expected_str.replace('$DATABASE', use_db)
      07:56:58         # Strip newlines so we can split error message into multiple lines
      07:56:58         expected_str = expected_str.replace('\n', '')
      07:56:58         if expected_str in actual_str: return
      07:56:58       assert False, 'Unexpected exception string. Expected: %s\nNot found in actual: %s' % \
      07:56:58 >       (expected_str, actual_str)
      07:56:58 E     AssertionError: Unexpected exception string. Expected: Sender timed out waiting for receiver fragment instance
      07:56:58 E     Not found in actual: ImpalaBeeswaxException: Query aborted: (1 of 3 similar)
      07:56:58 
      07:56:58 common/impala_test_suite.py:253: AssertionError
      
      07:56:58 ----------------------------- Captured stderr call -----------------------------
      07:56:58 -- executing against localhost:21000
      07:56:58 use functional;
      07:56:58 
      07:56:58 SET disable_codegen=False;
      07:56:58 SET abort_on_error=1;
      07:56:58 SET exec_single_node_rows_threshold=0;
      07:56:58 SET batch_size=0;
      07:56:58 SET num_nodes=0;
      07:56:58 -- executing against localhost:21000
      07:56:58 select count(*)
      07:56:58 from tpch.lineitem
      07:56:58   inner join tpch.orders on l_orderkey = o_orderkey;
      07:56:58 
      07:56:58 ======== 1 failed, 43 passed, 20 skipped, 8 xfailed in 3213.63 seconds =========
      

        Issue Links

          Activity

          Hide
          mjacobs Matthew Jacobs added a comment -

          looks to me like a warning got in the way of the test error handling. Attaching a profile of the failed query from the profile directory that shows the query error is correct, but there is some ambiguous warning mentioned:

              Errors:  (1 of 3 similar)
          
          

          No detail, just an empty string

          Show
          mjacobs Matthew Jacobs added a comment - looks to me like a warning got in the way of the test error handling. Attaching a profile of the failed query from the profile directory that shows the query error is correct, but there is some ambiguous warning mentioned: Errors: (1 of 3 similar) No detail, just an empty string
          Hide
          mjacobs Matthew Jacobs added a comment -

          Looks like there is something funny with the error message handling, the log shows:

          E0411 07:11:48.006916  4793 data-stream-sender.cc:278] channel send status: 
          Sender timed out waiting for receiver fragment instance: 294eccb4ce838db9:225a9f8b00000006
          I0411 07:11:48.006999  4793 runtime-state.cc:197] Error from query 294eccb4ce838db9:225a9f8b00000000: 
          I0411 07:11:48.008085  4780 impala-internal-service.cc:77] ReportExecStatus(): instance_id=294eccb4ce838db9:225a9f8b00000003
          

          and runtime-state.cc:197 should have a message printed there

          bool RuntimeState::LogError(const ErrorMsg& message, int vlog_level) {
            lock_guard<SpinLock> l(error_log_lock_);
            // All errors go to the log, unreported_error_count_ is counted independently of the
            // size of the error_log to account for errors that were already reported to the
            // coordinator
            VLOG(vlog_level) << "Error from query " << query_id() << ": " << message.msg();
            if (ErrorCount(error_log_) < query_options().max_errors) {
              AppendError(&error_log_, message);
              return true;
            }
            return false;
          }
          
          Show
          mjacobs Matthew Jacobs added a comment - Looks like there is something funny with the error message handling, the log shows: E0411 07:11:48.006916 4793 data-stream-sender.cc:278] channel send status: Sender timed out waiting for receiver fragment instance: 294eccb4ce838db9:225a9f8b00000006 I0411 07:11:48.006999 4793 runtime-state.cc:197] Error from query 294eccb4ce838db9:225a9f8b00000000: I0411 07:11:48.008085 4780 impala-internal-service.cc:77] ReportExecStatus(): instance_id=294eccb4ce838db9:225a9f8b00000003 and runtime-state.cc:197 should have a message printed there bool RuntimeState::LogError( const ErrorMsg& message, int vlog_level) { lock_guard<SpinLock> l(error_log_lock_); // All errors go to the log, unreported_error_count_ is counted independently of the // size of the error_log to account for errors that were already reported to the // coordinator VLOG(vlog_level) << "Error from query " << query_id() << ": " << message.msg(); if (ErrorCount(error_log_) < query_options().max_errors) { AppendError(&error_log_, message); return true ; } return false ; }
          Hide
          mjacobs Matthew Jacobs added a comment -

          Sailesh, can you take a look at this one?

          Show
          mjacobs Matthew Jacobs added a comment - Sailesh, can you take a look at this one?
          Hide
          sailesh Sailesh Mukil added a comment -

          This looks like the outcome of a race and also a bug in the ErrorMsg code:

          __execute_query() calls wait_for_completion() in tests/beeswax/impala_beeswax.py. This gets the state of the query as 'EXCEPTION':
          https://github.com/apache/incubator-impala/blob/master/tests/beeswax/impala_beeswax.py#L351

          This query_state is set to EXCEPTION over here:
          https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/query-exec-state.cc#L731

          However, before it goes on to the next line (L732) and sets the query_status_ as well, the beeswax test client calls get_log():
          https://github.com/apache/incubator-impala/blob/master/tests/beeswax/impala_beeswax.py#L358

          Which ends up here:
          https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/impala-beeswax-server.cc#L289

          which will return Status::OK(), since this races with the UpdateQueryStatus() call above and checks the status before that thread can set the status to an error status.

          We get the "(1 of 3 similar)" messages because the above get_log() call ultimately calls exec_state->coord()->GetErrorLog():
          https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/impala-beeswax-server.cc#L299

          GetErrorLog() merges errors from the different fragment instances (from their respective ErrorLogMap objects)and only prints the merged log which just prints how many times the error occurred. The ErrorLogMap has ErrorMsg objects that contains 'message_' and 'details_'. It looks like none of the fragment instances have 'message_' populated, but rather have had 'details_' populated with "Sender timed out waiting for receiver fragment instance". GetErrorLog() ultimately only looks at 'message_' of each frag instance (as that's what gets populated in RuntimeState::LogError()), which is empty. So it finally just returns "(1 of 3 similar)", which really just means "1 of 3 similar empty strings".

          I'm not sure why the passing around of error strings needs to be this complicated. I'm pretty sure some modification can be done to make this simpler.

          So as a consequence of the above race and the way GetErrorLog() works, the final error seen by the client is only "(1 of 3 similar)" which causes the test to fail.

          The fix for this needs to be thought through a little bit more before going forward with it.

          Show
          sailesh Sailesh Mukil added a comment - This looks like the outcome of a race and also a bug in the ErrorMsg code: __execute_query() calls wait_for_completion() in tests/beeswax/impala_beeswax.py. This gets the state of the query as 'EXCEPTION': https://github.com/apache/incubator-impala/blob/master/tests/beeswax/impala_beeswax.py#L351 This query_state is set to EXCEPTION over here: https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/query-exec-state.cc#L731 However, before it goes on to the next line (L732) and sets the query_status_ as well, the beeswax test client calls get_log(): https://github.com/apache/incubator-impala/blob/master/tests/beeswax/impala_beeswax.py#L358 Which ends up here: https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/impala-beeswax-server.cc#L289 which will return Status::OK(), since this races with the UpdateQueryStatus() call above and checks the status before that thread can set the status to an error status. We get the "(1 of 3 similar)" messages because the above get_log() call ultimately calls exec_state->coord()->GetErrorLog(): https://github.com/apache/incubator-impala/blob/23100102c0a9a8f3a8a7ff069cbfaa7a56628238/be/src/service/impala-beeswax-server.cc#L299 GetErrorLog() merges errors from the different fragment instances (from their respective ErrorLogMap objects)and only prints the merged log which just prints how many times the error occurred. The ErrorLogMap has ErrorMsg objects that contains 'message_' and 'details_'. It looks like none of the fragment instances have 'message_' populated, but rather have had 'details_' populated with "Sender timed out waiting for receiver fragment instance". GetErrorLog() ultimately only looks at 'message_' of each frag instance (as that's what gets populated in RuntimeState::LogError()), which is empty. So it finally just returns "(1 of 3 similar)", which really just means "1 of 3 similar empty strings". I'm not sure why the passing around of error strings needs to be this complicated. I'm pretty sure some modification can be done to make this simpler. So as a consequence of the above race and the way GetErrorLog() works, the final error seen by the client is only "(1 of 3 similar)" which causes the test to fail. The fix for this needs to be thought through a little bit more before going forward with it.
          Hide
          mjacobs Matthew Jacobs added a comment -

          So the main issue is that the query state is set to exception but the query status isn't set yet?

          It looks like none of the fragment instances have 'message_' populated

          ... and this is the second bug? Makes sense, though I couldn't find where this was being not set.

          Show
          mjacobs Matthew Jacobs added a comment - So the main issue is that the query state is set to exception but the query status isn't set yet? It looks like none of the fragment instances have 'message_' populated ... and this is the second bug? Makes sense, though I couldn't find where this was being not set.
          Hide
          sailesh Sailesh Mukil added a comment -

          Matthew Jacobs Here's the culprit:
          https://github.com/apache/incubator-impala/blob/0d0c93ec8c4949940ec113192731f2adb66a0c5e/be/src/util/error-util.h#L85

          I guess fixing that should make all of this go away, even though there is that race.

          Show
          sailesh Sailesh Mukil added a comment - Matthew Jacobs Here's the culprit: https://github.com/apache/incubator-impala/blob/0d0c93ec8c4949940ec113192731f2adb66a0c5e/be/src/util/error-util.h#L85 I guess fixing that should make all of this go away, even though there is that race.
          Hide
          mjacobs Matthew Jacobs added a comment -

          nice catch! feel free to separate these out if you'd like? we can resolve the build issue in the short term and file a separate jira for the get_log() race, which seems worth fixing but hopefully we'll kill beeswax soon anyway.

          Show
          mjacobs Matthew Jacobs added a comment - nice catch! feel free to separate these out if you'd like? we can resolve the build issue in the short term and file a separate jira for the get_log() race, which seems worth fixing but hopefully we'll kill beeswax soon anyway.
          Hide
          tarmstrong Tim Armstrong added a comment -

          I don't think this is the same bug but looks like there is some connection in that the error details are dropped.

          Show
          tarmstrong Tim Armstrong added a comment - I don't think this is the same bug but looks like there is some connection in that the error details are dropped.
          Show
          sailesh Sailesh Mukil added a comment - https://github.com/apache/incubator-impala/commit/e0d1db51ed4627cbe47fcec0d9a2909d14abbe4a

            People

            • Assignee:
              sailesh Sailesh Mukil
              Reporter:
              mjacobs Matthew Jacobs
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development