HBase
  1. HBase
  2. HBASE-11306

Client connection starvation issues under high load on Amazon EC2

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Amazon EC2

      Description

      I am using YCSB 0.1.4 with Hadoop 2.2.0 and HBase 0.98.3 RC2 on an EC2 testbed (c3.8xlarge instances, SSD backed, 10 GigE networking). There are five slaves and five separate clients. I start with a prepopulated table of 100M rows over ~20 regions and run 5 YCSB clients concurrently targeting 250,000 ops/sec in aggregate. (Can reproduce this less effectively at 100k/ops/sec aggregate also.) Workload A. Due to how I set up the test, the data is all in one HFile per region and very likely in cache. All writes will fit in the aggregate memstore. No flushes or compactions are observed on any server during the test, only the occasional log roll. Despite these favorable conditions developed over time to isolate this issue, a few of the clients will stop making progress until socket timeouts after 60 seconds, leading to very large op latency outliers. With the above detail plus some added extra logging we can rule out storage layer effects. Turning to the network, this is where things get interesting.

      I used while true ; do clear ; ss -a -o|grep ESTAB|grep 8120 ; sleep 5 ; done (8120 is the configured RS data port) to watch receive and send socket queues and TCP level timers on all of the clients and servers simultaneously during the run.

      I have Nagle disabled on the clients and servers and JVM networking set up to use IPv4 only. The YCSB clients are configured to use 20 threads. These threads are expected to share 5 active connections. one to each RegionServer. When the test starts we see exactly what we'd expect, 5 established TCPv4 connections.

      On all servers usually the recv and send queues were empty when sampled. I never saw more than 10K waiting. The servers occasionally retransmitted, but with timers ~200ms and retry counts ~0.

      The client side is another story. We see serious problems like:

      tcp    ESTAB      0      8733   10.220.15.45:41428   10.220.2.115:8120     timer:(on,38sec,7)
      

      That is about 9K of data still waiting to be sent after 7 TCP level retransmissions.

      There is some unfair queueing and packet drops happening at the network level, but we should be handling this better.

      During the periods when YCSB is not making progress, there is only that one connection to one RS in established state. There should be 5 established connections, one to each RS, but the other 4 have been dropped somehow. The one distressed connection remains established for the duration of the problem, while the retransmission timer count on the connection ticks upward. It is dropped once the socket times out at the app level. Why are the connections to the other RegionServers dropped? Why are all threads blocked waiting on the one connection for the socket timeout interval (60 seconds)? After the socket timeout we see the stuck connection dropped and 5 new connections immediately established. YCSB doesn't do anything that would lead to this behavior, it is using separate HTable instances for each client thread and not closing the table references until test cleanup. These behaviors seem internal to the HBase client.

      Is maintaining only a single multiplexed connection to each RegionServer the best approach?

      A related issue is we collect zombie sockets in ESTABLISHED state on the server. Also likely not our fault per se. Keepalives are enabled so they will eventually be garbage collected by the OS. On Linux systems this will take 2 hours. We might want to drop connections where we don't see activity sooner than that. Before HBASE-11277 we were spinning indefinitely on a core for each connection in this state.

      I have tried this using a narrow range of recent Java 7 and Java 8 runtimes and they all produce the same results. I have also launched several separate EC2 based test clusters and they all produce the same results, so this is a generic platform issue.

        Activity

        Hide
        Qiang Tian added a comment -

        Thanks Andrew Purtell! I missed some important info in HBASE-11277.

        As you mentioned in your test 10,000 limit has more exceptions hit than 100,000 limit(~20 per server vs 2-3 per server), so the problem is just network congestion when client sends rpc request?

        ...The handler does not come back to process the connection again unless it is selectable for reads again.

        yeah with your test this is the best. cool!

        ...Not sure why the client gets stuck with only one stalled connection.

        based on what we know so far, I guess it is caused by the shared connection at client side.
        11277 fix relieves the rpc server reader but at client side the IPCUtil.write call in RpcClient.Connection#writeRequest(0.98.3RC2) is blocked IO right? when multiple threads call the same RPC concurrently and there is one thread stuck in it, all threads will be blocked? the socket out stream has a timeout 60 seconds if the socket has an associated SocketChannel(but I do not see where we open the channel?), after timeout all threads will be released? using different connections in different threads might fix it?

        ...Not sure why it drops the connections to the 4 other RegionServers.

        still not sure about this...

        Show
        Qiang Tian added a comment - Thanks Andrew Purtell ! I missed some important info in HBASE-11277 . As you mentioned in your test 10,000 limit has more exceptions hit than 100,000 limit(~20 per server vs 2-3 per server), so the problem is just network congestion when client sends rpc request? ...The handler does not come back to process the connection again unless it is selectable for reads again. yeah with your test this is the best. cool! ...Not sure why the client gets stuck with only one stalled connection. based on what we know so far, I guess it is caused by the shared connection at client side. 11277 fix relieves the rpc server reader but at client side the IPCUtil.write call in RpcClient.Connection#writeRequest(0.98.3RC2) is blocked IO right? when multiple threads call the same RPC concurrently and there is one thread stuck in it, all threads will be blocked? the socket out stream has a timeout 60 seconds if the socket has an associated SocketChannel(but I do not see where we open the channel?), after timeout all threads will be released? using different connections in different threads might fix it? ...Not sure why it drops the connections to the 4 other RegionServers. still not sure about this...
        Hide
        Andrew Purtell added a comment -

        This patch effectively reverts the change in HBASE-11277 by going around the loop again if count == 0. That would not be correct. After HBASE-11277 we don't loop around and around after zero length reads. Instead we return from the loop. See the comments on HBASE-11277. I also used an approach where we would count the number of times around the loop after zero length reads, but the patch we decided to commit was a simpler change: The handler does not come back to process the connection again unless it is selectable for reads again.

        Show
        Andrew Purtell added a comment - This patch effectively reverts the change in HBASE-11277 by going around the loop again if count == 0. That would not be correct. After HBASE-11277 we don't loop around and around after zero length reads. Instead we return from the loop. See the comments on HBASE-11277 . I also used an approach where we would count the number of times around the loop after zero length reads, but the patch we decided to commit was a simpler change: The handler does not come back to process the connection again unless it is selectable for reads again.
        Hide
        Qiang Tian added a comment -

        hi stack, Andrew Purtell,
        could you take a look the patch? hopefully we could see where it stuck.
        based on my limited observation of client rpc and server rpc code, excellent but complex to analyze&locate such problem... using different connections for different YCSB threads could help isolate as the operations in client connection will be serial.
        thanks.

        Show
        Qiang Tian added a comment - hi stack , Andrew Purtell , could you take a look the patch? hopefully we could see where it stuck. based on my limited observation of client rpc and server rpc code, excellent but complex to analyze&locate such problem... using different connections for different YCSB threads could help isolate as the operations in client connection will be serial. thanks.
        Hide
        Qiang Tian added a comment -

        ...Do we know for sure how the above scenario comes about?

        do not know how at the moment. we are stuck in reading rpc request data or connection header.

        ... It is connection from client kept-alive?

        looks not. there is code to handle it. see PING_CALL_ID

        ...OS is keeping it alive for two hours?

        it looks for most systems the default timeout is 2 hours. but hbase client socket has much smaller timeout, which will be triggered first.

        ...Have you done any debug

        mostly code analysis based on the 2 occurrences we see in mailing list. I thought it is hard to repro. now Andrew can repro it, I will look if I could (know nothing about ycsb and ec2 at the moment )

        ...Not sure why the client gets stuck with only one stalled connection.

        I suspect it is related to the original 11277 issue. looks the 11277 fix leaves the bad connection there. at that point client is probably waiting for rpc response and might not send data again. and server side cleanupConnections will not cleanup the connection either.

        read more rpc client code. it looks client sharing the same connection for each regionserver is not accurate. a connection is identified by 3 tuples: user, RPC method name and server addr.(see getConnection and tracedWriteRequest). so different method call uses different connections(although the same rpcClient instance and same HConnectionImplementation instance). please correct me if I am wrong. will dig more.

        debug:
        1)from client concurrency aspect, could we try using different connections for different threads? --create new config objects in each YCSB thread(as Jonathan mentioned in a recent thread http://search-hadoop.com/m/DHED4zrOq61/HBase+with+multiple+threads&subj=+Discuss+HBase+with+multiple+threads)

        2)from server side, I plan to add the timeout(hopefully tomorrow). we can dump diag(such as the phase we are at, some members in Connection class) when read count is 0 and timeout.

        3)another suspect when I did search today --Reader#readSelector. should it be volatile? it is possible that we got stale data and fall into the read incorrectly?

        Show
        Qiang Tian added a comment - ...Do we know for sure how the above scenario comes about? do not know how at the moment. we are stuck in reading rpc request data or connection header. ... It is connection from client kept-alive? looks not. there is code to handle it. see PING_CALL_ID ...OS is keeping it alive for two hours? it looks for most systems the default timeout is 2 hours. but hbase client socket has much smaller timeout, which will be triggered first. ...Have you done any debug mostly code analysis based on the 2 occurrences we see in mailing list. I thought it is hard to repro. now Andrew can repro it, I will look if I could (know nothing about ycsb and ec2 at the moment ) ...Not sure why the client gets stuck with only one stalled connection. I suspect it is related to the original 11277 issue. looks the 11277 fix leaves the bad connection there. at that point client is probably waiting for rpc response and might not send data again. and server side cleanupConnections will not cleanup the connection either. read more rpc client code. it looks client sharing the same connection for each regionserver is not accurate. a connection is identified by 3 tuples: user, RPC method name and server addr.(see getConnection and tracedWriteRequest). so different method call uses different connections(although the same rpcClient instance and same HConnectionImplementation instance). please correct me if I am wrong. will dig more. debug: 1)from client concurrency aspect, could we try using different connections for different threads? --create new config objects in each YCSB thread(as Jonathan mentioned in a recent thread http://search-hadoop.com/m/DHED4zrOq61/HBase+with+multiple+threads&subj=+Discuss+HBase+with+multiple+threads ) 2)from server side, I plan to add the timeout(hopefully tomorrow). we can dump diag(such as the phase we are at, some members in Connection class) when read count is 0 and timeout. 3)another suspect when I did search today --Reader#readSelector. should it be volatile? it is possible that we got stale data and fall into the read incorrectly?
        Hide
        Andrew Purtell added a comment -

        Not sure why the client gets stuck with only one stalled connection. Not sure why it drops the connections to the 4 other RegionServers. I'm sure I can reproduce this again if you have suggestions for debugging what's going on there further.

        Show
        Andrew Purtell added a comment - Not sure why the client gets stuck with only one stalled connection. Not sure why it drops the connections to the 4 other RegionServers. I'm sure I can reproduce this again if you have suggestions for debugging what's going on there further.
        Hide
        Andrew Purtell added a comment -

        The xennet issue is leaving the TCP FSM in a bad state. On the server this manifests as established connections with no data pending where the client has gone away. They will sit there consuming a file table slot. The TCP keepalive probe will detect eventually that the remote has gone away but on Linux at least the probe interval is fixed at 2 hours. An application keepalive ie idle ping might catch it sooner. I'm on the phone now so can't check code. Do we still have idle pings in our RPC ?

        Show
        Andrew Purtell added a comment - The xennet issue is leaving the TCP FSM in a bad state. On the server this manifests as established connections with no data pending where the client has gone away. They will sit there consuming a file table slot. The TCP keepalive probe will detect eventually that the remote has gone away but on Linux at least the probe interval is fixed at 2 hours. An application keepalive ie idle ping might catch it sooner. I'm on the phone now so can't check code. Do we still have idle pings in our RPC ?
        Hide
        stack added a comment -

        Is maintaining only a single multiplexed connection to each RegionServer the best approach?

        The argument runs something like – one shared connection scales better. In issues I could dig up, its reported that connection per HTable instance has better throughput.

        ...worth looking at if the client can get completely stuck on one stalled connection in other situations I think.

        Yeah, what is that about? Stuck trying to do a lookup on meta and all other threads blocked waiting till this resolves?

        ...when read count is zero, which is not a normal case for rpc data using non-blocking socket. after timeout, close the connection.

        Do we know for sure how the above scenario comes about? It is connection from client kept-alive? No data being sent. OS is keeping it alive for two hours? Have you done any debug in here Qiang Tian?

        so first time cleanup is better?

        Your argument is that we are not doing cleanup of clients that have not sent data in a good while? Thanks for looking in here Qiang.

        Show
        stack added a comment - Is maintaining only a single multiplexed connection to each RegionServer the best approach? The argument runs something like – one shared connection scales better. In issues I could dig up, its reported that connection per HTable instance has better throughput. ...worth looking at if the client can get completely stuck on one stalled connection in other situations I think. Yeah, what is that about? Stuck trying to do a lookup on meta and all other threads blocked waiting till this resolves? ...when read count is zero, which is not a normal case for rpc data using non-blocking socket. after timeout, close the connection. Do we know for sure how the above scenario comes about? It is connection from client kept-alive? No data being sent. OS is keeping it alive for two hours? Have you done any debug in here Qiang Tian ? so first time cleanup is better? Your argument is that we are not doing cleanup of clients that have not sent data in a good while? Thanks for looking in here Qiang.
        Hide
        Qiang Tian added a comment -

        it looks rpc server connection cleanup(cleanupConnections) is quite strict ( fix of hbase11277 falls into it?)
        1)# of connection threshold for clean up is 4000 by default.
        2)random range scan for connection list in most cases.
        3)rpcCount must be 0. (in our cases, we are at the 3rd channelRead call. the rpcCount is not 0)

        so first time cleanup is better?
        thanks.

        Show
        Qiang Tian added a comment - it looks rpc server connection cleanup(cleanupConnections) is quite strict ( fix of hbase11277 falls into it?) 1)# of connection threshold for clean up is 4000 by default. 2)random range scan for connection list in most cases. 3)rpcCount must be 0. (in our cases, we are at the 3rd channelRead call. the rpcCount is not 0) so first time cleanup is better? thanks.
        Hide
        Qiang Tian added a comment - - edited

        Hi Andrew Purtell,
        we also see user reported the issue:
        a recent case in user mailing list which I am looking at: http://mail-archives.apache.org/mod_mbox/hbase-user/201406.mbox/%3C001001cf863e$c81a60e0$584f22a0$@com%3E

        another case:
        http://mail-archives.apache.org/mod_mbox/hbase-dev/201402.mbox/%3CCAHkeaHXAaJFyC6XNV+cBwupqAKoTmkJ349CPQuk02FGS8qcF1w@mail.gmail.com%3E

        ps this is really interesting... I just wanted to propose a fix, and when changing the code... just see the code was changed (HBASE-11277).. and get here.

        The 11277 fix directly returns. what will happen next?
        One reader could be shared by many connections, it looks readAndProcess is designed to service a complete RPC request(e.g. after processOneRpc call, data buffer is nulled). so my proposal is to start a timeout(user can configure the timeout value) when read count is zero, which is not a normal case for rpc data using non-blocking socket. after timeout, close the connection. this could services as a hbase server side keep-alive (as you mentioned, the OS level keep-alive time is 2 hours, and I googled it cannot be changed in java. so actually useless)

        I was thinking the issue is caused by network layer, but do not think about the hbase client side. will look at the client code.

        update for Nagle: it looks both ipc client and ipc server do not follow 1 read--1 write pair..looks nagle is needed in such case.
        cc stack, Ted Yu

        thanks.

        Show
        Qiang Tian added a comment - - edited Hi Andrew Purtell , we also see user reported the issue: a recent case in user mailing list which I am looking at: http://mail-archives.apache.org/mod_mbox/hbase-user/201406.mbox/%3C001001cf863e$c81a60e0$584f22a0$@com%3E another case: http://mail-archives.apache.org/mod_mbox/hbase-dev/201402.mbox/%3CCAHkeaHXAaJFyC6XNV+cBwupqAKoTmkJ349CPQuk02FGS8qcF1w@mail.gmail.com%3E ps this is really interesting... I just wanted to propose a fix, and when changing the code... just see the code was changed ( HBASE-11277 ).. and get here. The 11277 fix directly returns. what will happen next? One reader could be shared by many connections, it looks readAndProcess is designed to service a complete RPC request(e.g. after processOneRpc call, data buffer is nulled). so my proposal is to start a timeout(user can configure the timeout value) when read count is zero, which is not a normal case for rpc data using non-blocking socket. after timeout, close the connection. this could services as a hbase server side keep-alive (as you mentioned, the OS level keep-alive time is 2 hours, and I googled it cannot be changed in java. so actually useless) I was thinking the issue is caused by network layer, but do not think about the hbase client side. will look at the client code. update for Nagle: it looks both ipc client and ipc server do not follow 1 read--1 write pair..looks nagle is needed in such case. cc stack , Ted Yu thanks.
        Hide
        Andrew Purtell added a comment -

        Disabling offload prevents connections from getting into a bad state yes. Worth looking at if the client can get completely stuck on one stalled connection in other situations I think.

        Show
        Andrew Purtell added a comment - Disabling offload prevents connections from getting into a bad state yes. Worth looking at if the client can get completely stuck on one stalled connection in other situations I think.
        Hide
        stack added a comment -

        Something about my setup manages to trigger a xenfront bug, that's for sure

        So, disabling TCP offload fixed starvation Andrew Purtell

        Show
        stack added a comment - Something about my setup manages to trigger a xenfront bug, that's for sure So, disabling TCP offload fixed starvation Andrew Purtell
        Hide
        Andrew Purtell added a comment -

        Kernel is "Linux 3.10.40-50.136.amzn1.x86_64 #1 SMP Tue May 13 21:35:08 UTC 2014"

        Show
        Andrew Purtell added a comment - Kernel is "Linux 3.10.40-50.136.amzn1.x86_64 #1 SMP Tue May 13 21:35:08 UTC 2014"
        Hide
        Andrew Purtell added a comment -

        Something about my setup manages to trigger a xenfront bug, that's for sure

        Show
        Andrew Purtell added a comment - Something about my setup manages to trigger a xenfront bug, that's for sure
        Hide
        Vladimir Rodionov added a comment -

        Sorry, I have not run workload A, so may be this is workload specific?

        Show
        Vladimir Rodionov added a comment - Sorry, I have not run workload A, so may be this is workload specific?
        Hide
        Vladimir Rodionov added a comment -

        I have never run 250K ops on EC2 but 100K have always been successful (5 RS , 5 YCSB clients with target 20K each), but this is mostly for 0.94.x. I have done just a few 0.96 test runs (all of them successful). This may be 0.98 issue. I am going to run 0.98 tests on EC2 and will try 200-250K insert rates.

        Show
        Vladimir Rodionov added a comment - I have never run 250K ops on EC2 but 100K have always been successful (5 RS , 5 YCSB clients with target 20K each), but this is mostly for 0.94.x. I have done just a few 0.96 test runs (all of them successful). This may be 0.98 issue. I am going to run 0.98 tests on EC2 and will try 200-250K insert rates.
        Hide
        Andrew Purtell added a comment -

        Server side dmesg is clean but clients show some stress:

        [ 3898.001528] xennet: skb rides the rocket: 19 slots
        [ 3898.005126] xennet: skb rides the rocket: 22 slots
        [ 3951.460940] xennet: skb rides the rocket: 19 slots
        [ 3951.786580] xennet: skb rides the rocket: 20 slots
        [ 4004.851271] xennet: skb rides the rocket: 20 slots
        [ 4005.643567] xennet: skb rides the rocket: 29 slots
        [ 4059.344717] xennet: skb rides the rocket: 23 slots
        ...
        

        Some Googling turns up some reported Linux issues with EC2 where this leaves the connection in a stalled state until timeout. Disabling TCP offload functionality using

        ethtool -K eth0 rx off tx off sg off tso off ufo off gso off gro off lro off
        

        will work around the problem.

        Despite this I've opened this issue for (further) discussion.

        Show
        Andrew Purtell added a comment - Server side dmesg is clean but clients show some stress: [ 3898.001528] xennet: skb rides the rocket: 19 slots [ 3898.005126] xennet: skb rides the rocket: 22 slots [ 3951.460940] xennet: skb rides the rocket: 19 slots [ 3951.786580] xennet: skb rides the rocket: 20 slots [ 4004.851271] xennet: skb rides the rocket: 20 slots [ 4005.643567] xennet: skb rides the rocket: 29 slots [ 4059.344717] xennet: skb rides the rocket: 23 slots ... Some Googling turns up some reported Linux issues with EC2 where this leaves the connection in a stalled state until timeout. Disabling TCP offload functionality using ethtool -K eth0 rx off tx off sg off tso off ufo off gso off gro off lro off will work around the problem. Despite this I've opened this issue for (further) discussion.

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrew Purtell
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development