Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2799

Query hang up if remote impalad hosts shut down

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: Impala 2.2, Impala 2.3.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Distributed Exec
    • Labels:
    • Environment:
      impala version 2.3.0-cdh5.5.1 RELEASE
      Linux 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux (VMware)

      Description

      I test impala2.3 in a 5 hosts cluster´╝îand 3 of them running impalad. Sometimes when I shut down 2 impalad hosts, the query hang up. This situation is rarely seen. By checking the impalad log and tcp connection information (through lsof), I found that when I shut down the 2 remote impalad hosts, the local impalad, i.e. the impalad accepting the query request, disconnected tcp connection with one of the 2 remote impalad, but still had tcp connection with the other one of the 2 remote impalad, and the query hang up. Every time the query hang up, the execution state is 'STARTED', and the last event is 'Ready to start remote fragments', and I cannot cancel the query.

      BTW, I modified default tcp keepalive parameters, include setting net.ipv4.tcp_keepalive_time=30, net.ipv4.tcp_keepalive_probes=3 and net.ipv4.tcp_keepalive_intvl=10. This means if the tcp server is unreachable, keepalive settings guarantee the tcp client disconnecting the tcp connection actively after 30+3*10=60 seconds, but it seems it do not.

      Following is the log related to the hang up query.

      I1223 19:15:36.448956 23603 coordinator.cc:315] Exec() query_id=1542be5811b01f41:4624e416aa592b8c
      I1223 19:15:36.449033 23603 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8d
      I1223 19:15:36.449177 23603 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8d
      tuples:
      Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[])
      I1223 19:15:36.449282 23603 coordinator.cc:391] starting 3 backends for query 1542be5811b01f41:4624e416aa592b8c
      I1223 19:15:36.450311 24554 fragment-mgr.cc:36] ExecPlanFragment() instance_id=1542be5811b01f41:4624e416aa592b8f coord=vm3:22000 backend#=1
      I1223 19:15:36.450402 24554 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8f
      I1223 19:15:36.450562 24554 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8f
      tuples:
      Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[])
      I1223 19:15:36.700852 21520 plan-fragment-executor.cc:303] Open(): instance_id=1542be5811b01f41:4624e416aa592b8f
      I1223 19:16:15.860250 20878 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.115 Port: 45152>Connection reset by peer
      I1223 19:16:15.860384 20878 thrift-util.cc:109] TThreadedServer client died: ECONNRESET
      I1223 19:16:16.463649 20879 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.114 Port: 49091>Connection reset by peer
      I1223 19:16:16.463825 20879 thrift-util.cc:109] TThreadedServer client died: ECONNRESET
      I1223 19:19:35.522938 22979 status.cc:112] Cancelled from Impala's debug web interface
          @           0x788a33  impala::Status::Status()
          @           0x9e34ea  impala::ImpalaServer::CancelQueryUrlCallback()
          @           0xae1bd1  impala::Webserver::RenderUrlWithTemplate()
          @           0xae2a61  impala::Webserver::BeginRequestCallback()
          @           0xaf2e03  handle_request
          @           0xaf45a7  process_new_connection
          @           0xaf4dd8  worker_thread
          @       0x31aa6079d1  (unknown)
          @       0x31a9ee8b6d  (unknown)
      I1223 19:19:35.522980 22979 impala-server.cc:862] UnregisterQuery(): query_id=1542be5811b01f41:4624e416aa592b8c
      I1223 19:19:35.523000 22979 impala-server.cc:943] Cancel(): query_id=1542be5811b01f41:4624e416aa592b8c
      I1223 19:19:35.575162 22979 status.cc:112] Query not yet running
          @           0x788a33  impala::Status::Status()
          @           0x9ba69f  impala::ImpalaServer::CancelInternal()
          @           0x9c2a17  impala::ImpalaServer::UnregisterQuery()
          @           0x9e3510  impala::ImpalaServer::CancelQueryUrlCallback()
          @           0xae1bd1  impala::Webserver::RenderUrlWithTemplate()
          @           0xae2a61  impala::Webserver::BeginRequestCallback()
          @           0xaf2e03  handle_request
          @           0xaf45a7  process_new_connection
          @           0xaf4dd8  worker_thread
          @       0x31aa6079d1  (unknown)
          @       0x31a9ee8b6d  (unknown)
      

        Issue Links

          Activity

          Hide
          bresso zharui added a comment -

          I just added a method called setKeepAliveInterval in thrift TSocket class and called it in CreateClient method to trigger timeout after tcp keepalive sent to fix this issue.

          Show
          bresso zharui added a comment - I just added a method called setKeepAliveInterval in thrift TSocket class and called it in CreateClient method to trigger timeout after tcp keepalive sent to fix this issue.
          Hide
          sailesh Sailesh Mukil added a comment -

          Dup of IMPALA-3875

          Show
          sailesh Sailesh Mukil added a comment - Dup of IMPALA-3875

            People

            • Assignee:
              sailesh Sailesh Mukil
              Reporter:
              bresso zharui
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development