HBase
  1. HBase
  2. HBASE-3382

Make HBase client work better under concurrent clients

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Performance
    • Labels:
      None

      Description

      The HBase client uses 1 socket per regionserver for communication. This is good for socket control but potentially bad for latency. How bad? I did a simple YCSB test that had this config:

      readproportion=0
      updateproportion=0
      scanproportion=1
      insertproportion=0

      fieldlength=10
      fieldcount=100

      requestdistribution=zipfian
      scanlength=300
      scanlengthdistribution=zipfian

      I ran this with 1 and 10 threads. The summary is as so:

      1 thread:
      [SCAN] Operations 1000
      [SCAN] AverageLatency(ms) 35.871

      10 threads:
      [SCAN] Operations 1000
      [SCAN] AverageLatency(ms) 228.576

      We are taking a 6.5x latency hit in our client. But why?

      First step was to move the deserialization out of the Connection thread, this seemed like it could have a big win, an analog change on the server side got a 20% performance improvement (already commited as HBASE-2941). I did this and got about a 20% improvement again, with that 228ms number going to about 190 ms.

      So I then wrote a high performance nanosecond resolution tracing utility. Clients can flag an API call, and we get tracing and numbers through the client pipeline. What I found is that a lot of time is being spent in receiving the response from the network. The code block is like so:

      NanoProfiler.split(id, "receiveResponse");
      if (LOG.isDebugEnabled())
      LOG.debug(getName() + " got value #" + id);

      Call call = calls.get(id);

      size -= 4; // 4 byte off for id because we already read it.

      ByteBuffer buf = ByteBuffer.allocate(size);

      IOUtils.readFully(in, buf.array(), buf.arrayOffset(), size);

      buf.limit(size);
      buf.rewind();
      NanoProfiler.split(id, "setResponse", "Data size: " + size);

      I came up with some numbers:
      11726 (receiveResponse) split: 64991689 overall: 133562895 Data size: 4288937
      12163 (receiveResponse) split: 32743954 overall: 103787420 Data size: 1606273
      12561 (receiveResponse) split: 3517940 overall: 83346740 Data size: 4
      12136 (receiveResponse) split: 64448701 overall: 203872573 Data size: 3570569

      The first number is the internal counter for keeping requests unique from HTable on down. The numbers are in ns, the data size is in bytes.

      Doing some simple calculations, we see for the first line we were reading at about 31 MB/sec. The second one is even worse. Other calls are like:

      26 (receiveResponse) split: 7985400 overall: 21546226 Data size: 850429

      which is 107 MB/sec which is pretty close to the maximum of gige. In my set up, the ycsb client ran on the master node and HAD to use network to talk to regionservers.

      Even at full line rate, we could still see unacceptable hold ups of unrelated calls that just happen to need to talk to the same regionserver.

      This issue is about these findings, what to do, how to improve.

      1. HBASE-3382.txt
        26 kB
        ryan rawson
      2. HBASE-3382-nio.txt
        14 kB
        ryan rawson

        Issue Links

          Activity

          Hide
          ryan rawson added a comment -

          Doing more testing with HBASE-2939 I ran some tests using YCSB, it was very confusing at first because I wasnt getting the performance boost I was hoping for.

          So with a configuration that does scan only load, I am seeing a base line performance of about 50-60ms for 1 thread. Upping this to 10 threads the performance gets much worse, up to 400ms or so. Doing some custom tracing in our client code it revealed that the source of the slowness was waiting for other responses to be streamed to the client. That is thread1 asks for a big fat reply, but it takes 100ms to read off the wire, thread2 which did a little-itty-bitty request (close scanner for example), must wait or that 100ms thus being unnecessarily slowed down.

          So I tried this patch with ThreadLocal, and while I see improvement I am not seeing enough improvement, with lines like this:
          [SCAN], AverageLatency(ms), 363.44
          [SCAN], AverageLatency(ms), 448.31
          [SCAN], AverageLatency(ms), 426.53

          The data size is small enough and fully cached, and I added logging that verifies that we are CREATING multiple connections (1 per thread it seems).

          But the "call_wait" profile time (the time spent between sending the request and when the connection code starts to receive our response) is pretty high, in previous tests I saw something like this;
          cat new_1thr.txt | perl -ne 'if(/call_wait) split: (\d+?) /)

          { print $1/1000000, "\n";}' | perl -ne '$sum = $_; $count+; END {print $sum/$count, "\n"}'
          3.86964071428571
          cat new_10thr.txt | perl -ne 'if(/call_wait) split: (\d+?) /) { print $1/1000000, "n";}

          ' | perl -ne '$sum = $_; $count+; END

          {print $sum/$count, "\n"}'
          56.1530285016722

          As you can see going from an average wait time of 3ms to an average wait time of 56ms is pretty hurting!

          But using the work to add ThreadLocal connections I did not get as much boost as I hoped for, instead I saw call_wait time like:

          cat 10_thr.txt| perl -ne 'if(/call_wait) split: (\d+?) /) { print $1/1000000, "\n";}' | perl -ne '$sum = $_; $count+; END {print $sum/$count, "n"}

          '
          19.9225164798658

          while 19ms < 56ms that is still a lot of ms of wait.

          At this point we might be also seeing server side slowness. I think the next step is to extend the NanoProfiler code into the server side so we can have extensive tracing between both the server and client.

          This result suggests we are seeing server-side slowness under concurrency, which is reasonable, but I wasn't seeing in previous profiler runs, but a lot of performance code has been committed in the mean time.

          Show
          ryan rawson added a comment - Doing more testing with HBASE-2939 I ran some tests using YCSB, it was very confusing at first because I wasnt getting the performance boost I was hoping for. So with a configuration that does scan only load, I am seeing a base line performance of about 50-60ms for 1 thread. Upping this to 10 threads the performance gets much worse, up to 400ms or so. Doing some custom tracing in our client code it revealed that the source of the slowness was waiting for other responses to be streamed to the client. That is thread1 asks for a big fat reply, but it takes 100ms to read off the wire, thread2 which did a little-itty-bitty request (close scanner for example), must wait or that 100ms thus being unnecessarily slowed down. So I tried this patch with ThreadLocal, and while I see improvement I am not seeing enough improvement, with lines like this: [SCAN] , AverageLatency(ms), 363.44 [SCAN] , AverageLatency(ms), 448.31 [SCAN] , AverageLatency(ms), 426.53 The data size is small enough and fully cached, and I added logging that verifies that we are CREATING multiple connections (1 per thread it seems). But the "call_wait" profile time (the time spent between sending the request and when the connection code starts to receive our response) is pretty high, in previous tests I saw something like this; cat new_1thr.txt | perl -ne 'if(/call_wait) split: (\d+?) /) { print $1/1000000, "\n";}' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "\n"}' 3.86964071428571 cat new_10thr.txt | perl -ne 'if(/call_wait) split: (\d+?) /) { print $1/1000000, "n";} ' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "\n"}' 56.1530285016722 As you can see going from an average wait time of 3ms to an average wait time of 56ms is pretty hurting! But using the work to add ThreadLocal connections I did not get as much boost as I hoped for, instead I saw call_wait time like: cat 10_thr.txt| perl -ne 'if(/call_wait) split: (\d+?) /) { print $1/1000000, "\n";}' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "n"} ' 19.9225164798658 while 19ms < 56ms that is still a lot of ms of wait. At this point we might be also seeing server side slowness. I think the next step is to extend the NanoProfiler code into the server side so we can have extensive tracing between both the server and client. This result suggests we are seeing server-side slowness under concurrency, which is reasonable, but I wasn't seeing in previous profiler runs, but a lot of performance code has been committed in the mean time.
          Hide
          stack added a comment -

          Can you get Karthiks patch over in hbase-2939 working to show improved throughput if multiple sockets? Any way of 'proving' contention on the single socket? If bandwidth is full, will multiple sockets make any difference?

          Show
          stack added a comment - Can you get Karthiks patch over in hbase-2939 working to show improved throughput if multiple sockets? Any way of 'proving' contention on the single socket? If bandwidth is full, will multiple sockets make any difference?
          Hide
          stack added a comment -

          Can you explain this line more @Ryan:

          11726 (receiveResponse) split: 64991689 overall: 133562895 Data size: 4288937
          

          What the numbers mean? I"m being a bit thick.

          Show
          stack added a comment - Can you explain this line more @Ryan: 11726 (receiveResponse) split: 64991689 overall: 133562895 Data size: 4288937 What the numbers mean? I"m being a bit thick.
          Hide
          ryan rawson added a comment -

          So it's pretty clear that to improve performance under load, we should be using multiple sockets.

          Here is a rough block diagram of how the client works:

          HTable – calls --> HConnectionImplementation – calls --> HBaseRPC.waitForProxy()

          In waitForProxy, a HBaseClient object is grabbed and associated with the proxy via the embedded Invoker object. Let's call this 'client' (as does the code)

          HCI – calls -> ProxyObject (anonymous) -->client.call()

          Now a few notes:

          • The HCI will reuse the same proxy object a few times, if not a LOT of times.
          • The proxy object has 1 reference to 1 HBaseClient object.
          • The HBaseClient object has 1 socket/connection per Regionserver. Multiple threads will interleave their requests & replies (in any order, out of order replies ok) on the 1 socket.

          So there are a few different approaches, in HBASE-2939 a patch allows for every new call to grab a different connection off the pool, with different pool types. This has the disadvantage of needing 1 thread per extra socket to a RS. Another solution is to change the Connection object & thread to do async on multiple sockets to allow 1 thread per regionserver, but multiple sockets under it all.

          another solution is to use a nio framework to implement this instead of doing raw nio programming.

          Show
          ryan rawson added a comment - So it's pretty clear that to improve performance under load, we should be using multiple sockets. Here is a rough block diagram of how the client works: HTable – calls --> HConnectionImplementation – calls --> HBaseRPC.waitForProxy() In waitForProxy, a HBaseClient object is grabbed and associated with the proxy via the embedded Invoker object. Let's call this 'client' (as does the code) HCI – calls -> ProxyObject (anonymous) -->client.call() Now a few notes: The HCI will reuse the same proxy object a few times, if not a LOT of times. The proxy object has 1 reference to 1 HBaseClient object. The HBaseClient object has 1 socket/connection per Regionserver. Multiple threads will interleave their requests & replies (in any order, out of order replies ok) on the 1 socket. So there are a few different approaches, in HBASE-2939 a patch allows for every new call to grab a different connection off the pool, with different pool types. This has the disadvantage of needing 1 thread per extra socket to a RS. Another solution is to change the Connection object & thread to do async on multiple sockets to allow 1 thread per regionserver, but multiple sockets under it all. another solution is to use a nio framework to implement this instead of doing raw nio programming.
          Hide
          ryan rawson added a comment -

          a non-async, but nio port of hbase client.

          Show
          ryan rawson added a comment - a non-async, but nio port of hbase client.
          Hide
          ryan rawson added a comment -

          I did some work to minimally rewrite the HBase client to use nio instead of oio, no actual architectural changes, just changes to see if we can't improve the performance of straight line reading of bytes off the network socket.

          I used the aforemetioned tracing system to get time for receiving the response, then used the data log to come up with a few different timings, here are the semi-raw data:

          nio:

          hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse prof_1.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100)

          { print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; }' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count+; END {print $sum/$count, "\n"}'
          124.836689844131

          hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse prof_10.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100){ print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 n"; }

          ' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count+; END

          {print $sum/$count, "\n"}'
          112.391825942993

          OIO:

          hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse new_1thr.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100){ print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; }' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count+; END {print $sum/$count, "n"}

          '
          135.158706989288

          hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse new_10thr.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100)

          { print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; }

          ' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count+; END

          {print $sum/$count, "\n"}

          '
          120.16641916275

          As you can see, the OIO client actually performed a bit better than the NIO client, under the same YCSB config (which is listed above), cached workload, etc. This is probably due to needing to cycle thru a select() call to wait for more data, rather than just letting the OS handle it.

          Show
          ryan rawson added a comment - I did some work to minimally rewrite the HBase client to use nio instead of oio, no actual architectural changes, just changes to see if we can't improve the performance of straight line reading of bytes off the network socket. I used the aforemetioned tracing system to get time for receiving the response, then used the data log to come up with a few different timings, here are the semi-raw data: nio: hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse prof_1.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100) { print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; }' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "\n"}' 124.836689844131 hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse prof_10.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100){ print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 n"; } ' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "\n"}' 112.391825942993 OIO: hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse new_1thr.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100){ print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; }' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "n"} ' 135.158706989288 hadoop@sv4borg235:/homes/hadoop/borg221/hbase$ grep receiveResponse new_10thr.txt | perl -n -e ' /split: (\d+?) .*size: (\d+)/ ; if ($2 > 100) { print ((int((1000000000/$1)*$2))/(1024*1024)); print " $2 \n"; } ' | sort -n | cut -f1 -d' ' | perl -ne '$sum = $_; $count +; END {print $sum/$count, "\n"} ' 120.16641916275 As you can see, the OIO client actually performed a bit better than the NIO client, under the same YCSB config (which is listed above), cached workload, etc. This is probably due to needing to cycle thru a select() call to wait for more data, rather than just letting the OS handle it.
          Hide
          ryan rawson added a comment -

          There are a number of issues going on here, and several solutions for them:

          • During scans we need at least 3 RPCs just to do work. If a lighter weight RPC (eg: scanner.close) gets stuck behind a big response that will increase the latency.
          • The reads from the network seem really slow. Ping time is .1 ms (100 microsec!). I tried to set a 256k send/recv buffer on both sides but that was ineffective.
          • Even at the theoretical maximum 110-120MB/sec gige speed, we would have to wait 41 ms to receive 5 MB in the best case. Without chunking and interleaving of responses we will be held up behind the big responses.
          • We are using old io, its possible the new io APIs are more efficient at reading chunks of data. I am not sure though, and it's hard to find info about this online.

          So there are a few avenues to investigate:

          • reduction of RPC calls for scans
          • nio instead of oio for better efficiency of reads
          • some way of interleaving responses, either via multiple sockets or chunking on the wire to interleave responses.
          Show
          ryan rawson added a comment - There are a number of issues going on here, and several solutions for them: During scans we need at least 3 RPCs just to do work. If a lighter weight RPC (eg: scanner.close) gets stuck behind a big response that will increase the latency. The reads from the network seem really slow. Ping time is .1 ms (100 microsec!). I tried to set a 256k send/recv buffer on both sides but that was ineffective. Even at the theoretical maximum 110-120MB/sec gige speed, we would have to wait 41 ms to receive 5 MB in the best case. Without chunking and interleaving of responses we will be held up behind the big responses. We are using old io, its possible the new io APIs are more efficient at reading chunks of data. I am not sure though, and it's hard to find info about this online. So there are a few avenues to investigate: reduction of RPC calls for scans nio instead of oio for better efficiency of reads some way of interleaving responses, either via multiple sockets or chunking on the wire to interleave responses.

            People

            • Assignee:
              ryan rawson
              Reporter:
              ryan rawson
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development