Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-13167

Impala's coordinator could not be connected after a restart in custom cluster test in the ASAN build

    XMLWordPrintableJSON

Details

    • ghx-label-6

    Description

      In an internal Jenkins run, we found that it's possible that Impala's coordinator could not be connected after a restart that occurred after the coordinator hit a DCHECK during the custom cluster test in the ASAN build on ARM.

      Specifically, in that Jenkins run, we found that Impala's coordinator hit the DCHECK in RuntimeProfile::EventSequence::Start(int64_t start_time_ns) while running a query in ranger_column_masking_complex_types.test that was run by test_column_masking(). This is a known issue as described in IMPALA-4631.

      Since Impala daemons and the catalog server are restarted for each test in test_ranger.py, the next test run after test_column_masking() should most likely be passed. However it did not seem like this. We found that for the following few tests (e.g., test_block_metadata_update()) in test_ranger.py, Impala's pytest framework was not able to connect to the coordinator with the following error and hence those tests failed.

      -- 2024-06-18 08:49:43,350 INFO     MainThread: Starting cluster with command: /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/bin/start-impala-cluster.py '--state_store_args=--statestore_update_frequency_ms=50     --statestore_priority_update_frequency_ms=50     --statestore_heartbeat_frequency_ms=50' --cluster_size=3 --num_coordinators=3 --log_dir=/data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests --log_level=1 '--impalad_args=--server-name=server1 --ranger_service_type=hive --ranger_app_id=impala --authorization_provider=ranger ' '--state_store_args=None ' '--catalogd_args=--server-name=server1 --ranger_service_type=hive --ranger_app_id=impala --authorization_provider=ranger ' --impalad_args=--default_query_options=
      08:49:43 MainThread: Found 0 impalad/0 statestored/0 catalogd process(es)
      08:49:43 MainThread: Starting State Store logging to /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests/statestored.INFO
      08:49:43 MainThread: Starting Catalog Service logging to /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests/catalogd.INFO
      08:49:44 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad.INFO
      08:49:44 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad_node1.INFO
      08:49:44 MainThread: Starting Impala Daemon logging to /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad_node2.INFO
      08:49:47 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:47 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:47 MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25000
      08:49:47 MainThread: Debug webpage not yet available: HTTPConnectionPool(host='impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com', port=25000): Max retries exceeded with url: /backends?json (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff8d176750>: Failed to establish a new connection: [Errno 111] Connection refused',))
      08:49:49 MainThread: Debug webpage did not become available in expected time.
      08:49:49 MainThread: Waiting for num_known_live_backends=3. Current value: None
      08:49:50 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:50 MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25000
      08:49:50 MainThread: Waiting for num_known_live_backends=3. Current value: 0
      08:49:51 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:51 MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25000
      08:49:51 MainThread: num_known_live_backends has reached value: 3
      08:49:51 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:51 MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25001
      08:49:51 MainThread: num_known_live_backends has reached value: 3
      08:49:52 MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      08:49:52 MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25002
      08:49:52 MainThread: num_known_live_backends has reached value: 3
      08:49:52 MainThread: Impala Cluster Running with 3 nodes (3 coordinators, 3 executors).
      -- 2024-06-18 08:49:52,811 DEBUG    MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
      -- 2024-06-18 08:49:52,811 INFO     MainThread: Getting metric: statestore.live-backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25010
      -- 2024-06-18 08:49:52,814 INFO     MainThread: Metric 'statestore.live-backends' has reached desired value: 4
      -- 2024-06-18 08:49:52,814 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25000
      -- 2024-06-18 08:49:52,816 INFO     MainThread: num_known_live_backends has reached value: 3
      -- 2024-06-18 08:49:52,816 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25001
      -- 2024-06-18 08:49:52,818 INFO     MainThread: num_known_live_backends has reached value: 3
      -- 2024-06-18 08:49:52,818 DEBUG    MainThread: Getting num_known_live_backends from impala-ec2-rhel88-m7g-4xlarge-ondemand-1d18.vpc.cloudera.com:25002
      -- 2024-06-18 08:49:52,820 INFO     MainThread: num_known_live_backends has reached value: 3
      SET client_identifier=authorization/test_ranger.py::TestRanger::()::test_block_metadata_update[protocol:beeswax|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;'exec_single_node_rows_thresh;
      -- connecting to: localhost:21000
      -- 2024-06-18 08:49:52,821 INFO     MainThread: Could not connect to ('::1', 21000, 0, 0)
      Traceback (most recent call last):
        File "/data0/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open
          handle.connect(sockaddr)
        File "/data/jenkins/workspace/impala-asf-master-core-asan-arm/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth
          return getattr(self._sock,name)(*args)
      error: [Errno 111] Connection refused
      -- connecting to localhost:21050 with impyla
      -- 2024-06-18 08:49:52,821 INFO     MainThread: Could not connect to ('::1', 21050, 0, 0)
      Traceback (most recent call last):
        File "/data0/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open
          handle.connect(sockaddr)
        File "/data/jenkins/workspace/impala-asf-master-core-asan-arm/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth
          return getattr(self._sock,name)(*args)
      error: [Errno 111] Connection refused
      -- 2024-06-18 08:49:53,036 INFO     MainThread: Closing active operation
      -- connecting to localhost:28000 with impyla
      -- 2024-06-18 08:49:53,058 INFO     MainThread: Closing active operation
      SET client_identifier=authorization/test_ranger.py::TestRanger::()::test_block_metadata_update[protocol:beeswax|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;'exec_single_node_rows_thresh;
      -- connecting to: localhost:21000
      -- 2024-06-18 08:49:53,061 INFO     MainThread: Could not connect to ('::1', 21000, 0, 0)
      Traceback (most recent call last):
        File "/data0/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open
          handle.connect(sockaddr)
        File "/data/jenkins/workspace/impala-asf-master-core-asan-arm/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth
          return getattr(self._sock,name)(*args)
      error: [Errno 111] Connection refused
      SET client_identifier=authorization/test_ranger.py::TestRanger::()::test_block_metadata_update[protocol:beeswax|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;'exec_single_node_rows_thresh;
      -- connecting to: localhost:21000
      -- 2024-06-18 08:49:53,062 INFO     MainThread: Could not connect to ('::1', 21000, 0, 0)
      Traceback (most recent call last):
        File "/data0/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open
          handle.connect(sockaddr)
        File "/data/jenkins/workspace/impala-asf-master-core-asan-arm/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth
          return getattr(self._sock,name)(*args)
      error: [Errno 111] Connection refused
      

      Attachments

        Issue Links

          Activity

            People

              jasonmfehr Jason Fehr
              fangyurao Fang-Yu Rao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: