Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 2.7.0
-
None
-
CentOS 2.6.32-642.11.1.el6.x86_64
CDH 5.9
Description
We are experiencing strange RPC timeouts when datanodes are on different subnets.
No matter what the query is we receive the flowing messages and queries got terminated:
Sender timed out waiting for receiver fragment instance: e64f0518e53c9825:db4db2b000000005
I0215 05:06:54.952647 11814 runtime-state.cc:215] Error from query e64f0518e53c9825:db4db2b000000000:
I0215 05:09:54.281371 11199 status.cc:47] RPC recv timed out: Client nodename-dn02:22000 timed-out during recv call.
@ 0x84db0a (unknown)
@ 0xdec2ca (unknown)
@ 0xde6cac (unknown)
@ 0xde8b9e (unknown)
@ 0xa24c49 (unknown)
@ 0xa26375 (unknown)
@ 0xbf5b09 (unknown)
@ 0xbf64a4 (unknown)
@ 0xe5c7aa (unknown)
@ 0x7f412a4f2aa1 start_thread
@ 0x7f412a23faad clone
I0215 05:09:54.281416 11199 coordinator.cc:1406] ExecPlanRequest rpc query_id=e64f0518e53c9825:db4db2b000000000 instance_id=e64f0518e53c9825:db4db2b000000002 failed: RPC recv timed out: Client ph-hdp-tst-dn02:22000 timed-out during recv call.
I0215 05:09:54.282639 11203 status.cc:47] RPC recv timed out: Client nodename-dn01:22000 timed-out during recv call.
@ 0x84db0a (unknown)
@ 0xdec2ca (unknown)
@ 0xde6cac (unknown)
@ 0xde8b9e (unknown)
@ 0xa24c49 (unknown)
@ 0xa26375 (unknown)
@ 0xbf5b09 (unknown)
@ 0xbf64a4 (unknown)
@ 0xe5c7aa (unknown)
@ 0x7f412a4f2aa1 start_thread
@ 0x7f412a23faad clone
I0215 05:09:54.282663 11203 coordinator.cc:1406] ExecPlanRequest rpc query_id=e64f0518e53c9825:db4db2b000000000 instance_id=e64f0518e53c9825:db4db2b000000003 failed: RPC recv timed out: Client ph-hdp-tst-dn01:22000 timed-out during recv call.
There are 2 hops in the form of virtual switches between the two subnets. We've tested thoughtfully both the bandwidth and the latency and they both look adequate (about 300 Mbit/s)
We are able to replicate the issue both on our production and testing environments.
We do not experience it when all nodes are in the same subnet, only when activating the "others", but we need to expand and the current one does not have enough resources.
Has anyone faced similar issue or has any clues in which direction to investigate ?
Thanks and regards,
Stan