Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4348

TestRPCTimeout.test_execplanfragment_timeout fails in exhaustive

    Details

      Description

      The following is a test failure I saw when running the tests in exhaustive mode:

       TestRPCTimeout.test_execplanfragment_timeout[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] 
      
      self = <test_rpc_timeout.TestRPCTimeout object at 0x5f4ffd0>
      vector = <tests.common.test_vector.TestVector object at 0x6424a10>
      
          @pytest.mark.execute_serially
          @CustomClusterTestSuite.with_args("--backend_client_rpc_timeout_ms=1000"
              " --fault_injection_rpc_delay_ms=3000 --fault_injection_rpc_type=1"
              " --datastream_sender_timeout_ms=30000")
          def test_execplanfragment_timeout(self, vector):
            for i in range(3):
              ex= self.execute_query_expect_failure(self.client, self.TEST_QUERY)
              assert "RPC recv timed out" in str(ex)
            verifiers = [ MetricVerifier(i.service) for i in ImpalaCluster().impalads ]
          
            for v in verifiers:
      >       v.wait_for_metric("impala-server.num-fragments-in-flight", 0)
      
      custom_cluster/test_rpc_timeout.py:90: 
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      verifiers/metric_verifier.py:59: in wait_for_metric
          self.impalad_service.wait_for_metric_value(metric_name, expected_value, timeout)
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      
      self = <tests.common.impala_service.ImpaladService object at 0x7ffb24029350>
      metric_name = 'impala-server.num-fragments-in-flight', expected_value = 0
      timeout = 60, interval = 1
      
          def wait_for_metric_value(self, metric_name, expected_value, timeout=10, interval=1):
            start_time = time()
            while (time() - start_time < timeout):
              LOG.info("Getting metric: %s from %s:%s" %
                  (metric_name, self.hostname, self.webserver_port))
              value = None
              try:
                value = self.get_metric_value(metric_name)
              except Exception, e:
                LOG.error(e)
          
              if value == expected_value:
                LOG.info("Metric '%s' has reach desired value: %s" % (metric_name, value))
                return value
              else:
                LOG.info("Waiting for metric value '%s'=%s. Current value: %s" %
                    (metric_name, expected_value, value))
              LOG.info("Sleeping %ds before next retry." % interval)
              sleep(interval)
            assert 0, 'Metric value %s did not reach value %s in %ss' %\
      >         (metric_name, expected_value, timeout)
      E     AssertionError: Metric value impala-server.num-fragments-in-flight did not reach value 0 in 60s
      
      common/impala_service.py:81: AssertionError
      

        Activity

        Hide
        jyu@cloudera.com Juan Yu added a comment -

        Jim Apple could you send me the job link? thanks.

        Show
        jyu@cloudera.com Juan Yu added a comment - Jim Apple could you send me the job link? thanks.
        Show
        tarmstrong Tim Armstrong added a comment - I also hit this here: http://sandbox.jenkins.cloudera.com/job/impala-private-build-and-test/4492/consoleFull
        Hide
        jyu@cloudera.com Juan Yu added a comment -

        This only happens after change IMPALA-2905.
        Looks like the coordinator fragment stuck at https://github.com/apache/incubator-impala/blob/master/be/src/exec/plan-root-sink.cc#L93
        and cannot be cancelled.
        This only happens when RPC timeout happens in ExecPlanfragment() RPC.

        Please note that the test only run in exhaustive mode.

        Show
        jyu@cloudera.com Juan Yu added a comment - This only happens after change IMPALA-2905 . Looks like the coordinator fragment stuck at https://github.com/apache/incubator-impala/blob/master/be/src/exec/plan-root-sink.cc#L93 and cannot be cancelled. This only happens when RPC timeout happens in ExecPlanfragment() RPC. Please note that the test only run in exhaustive mode.
        Hide
        jyu@cloudera.com Juan Yu added a comment -

        Henry Robinson could you take a look? thx.

        Show
        jyu@cloudera.com Juan Yu added a comment - Henry Robinson could you take a look? thx.
        Hide
        dhecht Dan Hecht added a comment -

        Henry Robinson have you had a chance to look? this is causing exhaustive builds to fail regularly? Should we back out the change or xfail the test temporarily while you investigate?

        Show
        dhecht Dan Hecht added a comment - Henry Robinson have you had a chance to look? this is causing exhaustive builds to fail regularly? Should we back out the change or xfail the test temporarily while you investigate?
        Hide
        henryr Henry Robinson added a comment -

        Haven't looked in detail yet, but it looks like there's a cancellation path that doesn't call CloseConsumer. I can look more closely this afternoon, but can't give an ETA on a fix. Happy to xfail the test as long as this stays open and a blocker.

        Show
        henryr Henry Robinson added a comment - Haven't looked in detail yet, but it looks like there's a cancellation path that doesn't call CloseConsumer . I can look more closely this afternoon, but can't give an ETA on a fix. Happy to xfail the test as long as this stays open and a blocker.
        Hide
        henryr Henry Robinson added a comment -

        The problem is that Coordinator::Exec() returns if there's a cancellation error before it sets root_sink_ - so when we call TearDown() the sink pointer is null and presumed to be invalid.

        I see three possible fixes:

        • In TearDown(), don't check root_sink_ but ask the FragmentMgr for the fragment instance directly.
        • Don't return from Exec() on error until root_sink_ is set - even if there's an error.
        • Make cancellation force tear-down of the PlanRootSink - whether or not CloseConsumer() has been called. That means making it safe for the PlanRootSink to be torn down concurrently to the coordinator using it. (This is the same issue as IMPALA-4333).

        Although options 1 and 2 are the easiest, we should probably go try option 3 because we need to fix IMPALA-4333 as well. I realise that actually if we keep the fragment instance shared_ptr around until the coordinator is destroyed, the PlanRootSink will still be valid; that might be the most straightforward route to avoiding concurrent destruction of the PlanRootSink object.

        Show
        henryr Henry Robinson added a comment - The problem is that Coordinator::Exec() returns if there's a cancellation error before it sets root_sink_ - so when we call TearDown() the sink pointer is null and presumed to be invalid. I see three possible fixes: In TearDown() , don't check root_sink_ but ask the FragmentMgr for the fragment instance directly. Don't return from Exec() on error until root_sink_ is set - even if there's an error. Make cancellation force tear-down of the PlanRootSink - whether or not CloseConsumer() has been called. That means making it safe for the PlanRootSink to be torn down concurrently to the coordinator using it. (This is the same issue as IMPALA-4333 ). Although options 1 and 2 are the easiest, we should probably go try option 3 because we need to fix IMPALA-4333 as well. I realise that actually if we keep the fragment instance shared_ptr around until the coordinator is destroyed, the PlanRootSink will still be valid; that might be the most straightforward route to avoiding concurrent destruction of the PlanRootSink object.
        Hide
        henryr Henry Robinson added a comment -

        Fixing this revealed a couple of other bugs that I think have been latent for a while:

        • If ExecRemoteFragment() times out, the coordinator will not try to cancel the fragment instance - even though it could have been started.
        • PlanFragmentExecutor::ReportProfile() has a bug where report_thread_active_ may not be set before the method starts, meaning that profiles are never sent. Critically, this means the fragment instance may never be cancelled (because we use ReportProfile() to determine if the coordinator is still there).

        Both are easy fixes; it's good in a way that the unification of the execution paths has started to reveal these issues.

        Show
        henryr Henry Robinson added a comment - Fixing this revealed a couple of other bugs that I think have been latent for a while: If ExecRemoteFragment() times out, the coordinator will not try to cancel the fragment instance - even though it could have been started. PlanFragmentExecutor::ReportProfile() has a bug where report_thread_active_ may not be set before the method starts, meaning that profiles are never sent. Critically, this means the fragment instance may never be cancelled (because we use ReportProfile() to determine if the coordinator is still there). Both are easy fixes; it's good in a way that the unification of the execution paths has started to reveal these issues.
        Show
        henryr Henry Robinson added a comment - Fixed in https://github.com/apache/incubator-impala/commit/3b8a0891d9eebec00fc2654bb4a705752edc6759

          People

          • Assignee:
            henryr Henry Robinson
            Reporter:
            jbapple Jim Apple
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development