Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4950

Frequent RPC recv timed out when praticipating nodes are in different subnets

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.7.0
    • Fix Version/s: None
    • Component/s: Backend
    • Environment:
      CentOS 2.6.32-642.11.1.el6.x86_64
      CDH 5.9

      Description

      We are experiencing strange RPC timeouts when datanodes are on different subnets.
      No matter what the query is we receive the flowing messages and queries got terminated:

      Sender timed out waiting for receiver fragment instance: e64f0518e53c9825:db4db2b000000005
      I0215 05:06:54.952647 11814 runtime-state.cc:215] Error from query e64f0518e53c9825:db4db2b000000000:
      I0215 05:09:54.281371 11199 status.cc:47] RPC recv timed out: Client nodename-dn02:22000 timed-out during recv call.
          @           0x84db0a  (unknown)
          @           0xdec2ca  (unknown)
          @           0xde6cac  (unknown)
          @           0xde8b9e  (unknown)
          @           0xa24c49  (unknown)
          @           0xa26375  (unknown)
          @           0xbf5b09  (unknown)
          @           0xbf64a4  (unknown)
          @           0xe5c7aa  (unknown)
          @     0x7f412a4f2aa1  start_thread
          @     0x7f412a23faad  clone
      I0215 05:09:54.281416 11199 coordinator.cc:1406] ExecPlanRequest rpc query_id=e64f0518e53c9825:db4db2b000000000 instance_id=e64f0518e53c9825:db4db2b000000002 failed: RPC recv timed out: Client ph-hdp-tst-dn02:22000 timed-out during recv call.
      I0215 05:09:54.282639 11203 status.cc:47] RPC recv timed out: Client nodename-dn01:22000 timed-out during recv call.
          @           0x84db0a  (unknown)
          @           0xdec2ca  (unknown)
          @           0xde6cac  (unknown)
          @           0xde8b9e  (unknown)
          @           0xa24c49  (unknown)
          @           0xa26375  (unknown)
          @           0xbf5b09  (unknown)
          @           0xbf64a4  (unknown)
          @           0xe5c7aa  (unknown)
          @     0x7f412a4f2aa1  start_thread
          @     0x7f412a23faad  clone
      I0215 05:09:54.282663 11203 coordinator.cc:1406] ExecPlanRequest rpc query_id=e64f0518e53c9825:db4db2b000000000 instance_id=e64f0518e53c9825:db4db2b000000003 failed: RPC recv timed out: Client ph-hdp-tst-dn01:22000 timed-out during recv call.
      
      

      There are 2 hops in the form of virtual switches between the two subnets. We've tested thoughtfully both the bandwidth and the latency and they both look adequate (about 300 Mbit/s)

      We are able to replicate the issue both on our production and testing environments.
      We do not experience it when all nodes are in the same subnet, only when activating the "others", but we need to expand and the current one does not have enough resources.

      Has anyone faced similar issue or has any clues in which direction to investigate ?
      Thanks and regards,
      Stan

        Attachments

        1. smokeping.JPG
          67 kB
          Stan
        2. smokeping2.JPG
          67 kB
          Stan

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Blagoev_impala_1121 Stan
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: