Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2080

Speed up DFS read path by lessening checksum overhead

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.24.0, 0.23.1
    • Component/s: hdfs-client, performance
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I've developed a series of patches that speeds up the HDFS read path by a factor of about 2.5x (~300M/sec to ~800M/sec for localhost reading from buffer cache) and also will make it easier to allow for advanced users (eg hbase) to skip a buffer copy.

      1. hdfs-2080.txt
        134 kB
        Todd Lipcon
      2. hdfs-2080.txt
        142 kB
        Todd Lipcon

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Todd Lipcon added a comment -

          The improvements are the following:

          • Simplify the code for BlockReader by no longer inheriting from FSInputChecker. Read entire 64KB packets at a time into a direct byte buffer with a single read() syscall [slight speed improvement]
          • Once the entire 64K buffer is ready, bulk-verify all of the CRCs with a single call (currently there's a small semantic change associated with this, but it could be fixed without hurting performance much if necessary) [15% improvement]
          • Implement the bulk-verification of CRC code via JNI [60% improvement]
          • On processors supporting SSE4.2 (eg Nehalem/Westmere) use the crc32c assembly instruction to calculate checksums [~2.5x improvement]
          • there's one more optimization I haven't done yet here to improve the pipelining of the SSE instructions

          Unfortunately the last improvement requires introducing a new Checksum implementation, since the hardware implements the iSCSI polynomial instead of the zlib polynomial. Fortunately we have a header field everywhere we use checksums, so introducing a new polynomial can be done in a backwards-compatible way.

          With these optimizations, performance is within 15% of non-checksummed reads.

          Show
          Todd Lipcon added a comment - The improvements are the following: Simplify the code for BlockReader by no longer inheriting from FSInputChecker. Read entire 64KB packets at a time into a direct byte buffer with a single read() syscall [slight speed improvement] Once the entire 64K buffer is ready, bulk-verify all of the CRCs with a single call (currently there's a small semantic change associated with this, but it could be fixed without hurting performance much if necessary) [15% improvement] Implement the bulk-verification of CRC code via JNI [60% improvement] On processors supporting SSE4.2 (eg Nehalem/Westmere) use the crc32c assembly instruction to calculate checksums [~2.5x improvement] there's one more optimization I haven't done yet here to improve the pipelining of the SSE instructions Unfortunately the last improvement requires introducing a new Checksum implementation, since the hardware implements the iSCSI polynomial instead of the zlib polynomial. Fortunately we have a header field everywhere we use checksums, so introducing a new polynomial can be done in a backwards-compatible way. With these optimizations, performance is within 15% of non-checksummed reads.
          Hide
          Nathan Roberts added a comment -

          very nice. And I'll bet cpu usage (CPU_SECONDS/MB) went down significantly as well.

          Show
          Nathan Roberts added a comment - very nice. And I'll bet cpu usage (CPU_SECONDS/MB) went down significantly as well.
          Hide
          Kihwal Lee added a comment -

          The checksum was worked on previously in HADOOP-6148 and HADOOP-6166. Due to implicit data copying when crossing JNI boundary, it was reimplemented in JAVA. This work will solve the problem and get us back to faster native code. I imagine there are other places we could apply the NIO direct buffer + JNI + native code combination.

          In my experiment, zlib was already capable of doing more than most clients can ingest. On a system with 2GHz E5335 (Clovertown) processors running CentOS5, the CRC32 in zlib could do 2.5 GB/s if everything comes from cache (on a 64KB buffer). So in my opinion, although they look seriously cool, the item number 4 and 5 can wait.

          Show
          Kihwal Lee added a comment - The checksum was worked on previously in HADOOP-6148 and HADOOP-6166 . Due to implicit data copying when crossing JNI boundary, it was reimplemented in JAVA. This work will solve the problem and get us back to faster native code. I imagine there are other places we could apply the NIO direct buffer + JNI + native code combination. In my experiment, zlib was already capable of doing more than most clients can ingest. On a system with 2GHz E5335 (Clovertown) processors running CentOS5, the CRC32 in zlib could do 2.5 GB/s if everything comes from cache (on a 64KB buffer). So in my opinion, although they look seriously cool, the item number 4 and 5 can wait.
          Hide
          Todd Lipcon added a comment -

          Nathan: yea, both CPU time and sys time improve by these optimizations.

          Kihwal: using zlib instead of the hardware crc gives only about a 40% improvement. It's true that a disk won't pump out data at rates approaching 1GB/sec, but Nathan's metric of CPUsecs/MB is still very important, eg on multitenant clusters. Another important case is the HBase serving case where the majority of the data being read from HDFS will actually be in the Linux buffer cache. I've benchmarked that 3/4 of the latency of such reads comes from CPU-time rather than context switching (try TestHFileSeek from HBase on RawLocalFS vs LocalFS)

          Show
          Todd Lipcon added a comment - Nathan: yea, both CPU time and sys time improve by these optimizations. Kihwal: using zlib instead of the hardware crc gives only about a 40% improvement. It's true that a disk won't pump out data at rates approaching 1GB/sec, but Nathan's metric of CPUsecs/MB is still very important, eg on multitenant clusters. Another important case is the HBase serving case where the majority of the data being read from HDFS will actually be in the Linux buffer cache. I've benchmarked that 3/4 of the latency of such reads comes from CPU-time rather than context switching (try TestHFileSeek from HBase on RawLocalFS vs LocalFS)
          Hide
          Kihwal Lee added a comment -

          I just found a machine with E5530 (Gainestown), which also supports SSE 4.2. I will do more experiments there. Putting 4 and 5 won't do any harm, but I want to make sure going from 60% to 2.5X was not because of something else that we can improve for boxes without SSE 4.2 support.

          Show
          Kihwal Lee added a comment - I just found a machine with E5530 (Gainestown), which also supports SSE 4.2. I will do more experiments there. Putting 4 and 5 won't do any harm, but I want to make sure going from 60% to 2.5X was not because of something else that we can improve for boxes without SSE 4.2 support.
          Hide
          Todd Lipcon added a comment -

          I'll be pretty surprised if you can approach the performance of SSE4.2 in software. See http://download.intel.com/design/intarch/papers/323405.pdf – we're talking 0.14 cycles/byte fully optimized!

          I'll try to post code this weekend so you can experiment as well.

          Show
          Todd Lipcon added a comment - I'll be pretty surprised if you can approach the performance of SSE4.2 in software. See http://download.intel.com/design/intarch/papers/323405.pdf – we're talking 0.14 cycles/byte fully optimized! I'll try to post code this weekend so you can experiment as well.
          Hide
          Kihwal Lee added a comment -

          I don't doubt the performance of the SSE 4.2 CRC32. I just want to verify that it was the sole contributor of the jump from 60% to 2.5x. If you disable checksum, how much improvement do you get?

          Show
          Kihwal Lee added a comment - I don't doubt the performance of the SSE 4.2 CRC32. I just want to verify that it was the sole contributor of the jump from 60% to 2.5x. If you disable checksum, how much improvement do you get?
          Hide
          Todd Lipcon added a comment -

          Disabling checksum completely is about 16% faster than the hardware CRC.

          My test to switch between them was to comment out the one line that calls the hardware code and replace it with the one line that calls a fairly fast C function. So there shouldn't be other factors in play that I can think of. Of course more experimentation is quite welcome!

          Show
          Todd Lipcon added a comment - Disabling checksum completely is about 16% faster than the hardware CRC. My test to switch between them was to comment out the one line that calls the hardware code and replace it with the one line that calls a fairly fast C function. So there shouldn't be other factors in play that I can think of. Of course more experimentation is quite welcome!
          Hide
          Kihwal Lee added a comment -

          Sorry, the 2.5GB/s number is wrong. With zlib 1.2.3, it's 840MB/s and with the CRC32 instruction I am getting over 6GB/s. The theoretical limit is supposed to be 6.4GB/s on my machine (8 bytes/3cycles).

          Show
          Kihwal Lee added a comment - Sorry, the 2.5GB/s number is wrong. With zlib 1.2.3, it's 840MB/s and with the CRC32 instruction I am getting over 6GB/s. The theoretical limit is supposed to be 6.4GB/s on my machine (8 bytes/3cycles).
          Hide
          Todd Lipcon added a comment -

          The theoretical limit is supposed to be 6.4GB/s on my machine (8 bytes/3cycles)

          With the "naive" implementation it's 8 bytes/3 cycles, but from what I've read the instruction takes 1 cycle with a latency of 3 cycles – ie you can pipeline three at the same time per core if you order the instructions right, so long as there isn't a data dependency between them. That's the pipelining optimization I was referencing above.

          Show
          Todd Lipcon added a comment - The theoretical limit is supposed to be 6.4GB/s on my machine (8 bytes/3cycles) With the "naive" implementation it's 8 bytes/3 cycles, but from what I've read the instruction takes 1 cycle with a latency of 3 cycles – ie you can pipeline three at the same time per core if you order the instructions right, so long as there isn't a data dependency between them. That's the pipelining optimization I was referencing above.
          Hide
          Kihwal Lee added a comment -

          Yes I read the paper you provided the link to. Even it doesn't approach 1, something better than 3 cycles will be nice.

          Show
          Kihwal Lee added a comment - Yes I read the paper you provided the link to. Even it doesn't approach 1, something better than 3 cycles will be nice.
          Hide
          Todd Lipcon added a comment -

          Here's a combined patch to demonstrate the speed improvements. It will need to be split up into a few separate changes to be committed. Here's a summary of the changes in this somewhat large patch:

          common/LICENSE.txt | 20 +

          • I borrowed some code from some BSD-licensed projects (hstore and "Slicing-by-8")

          common/bin/hadoop-config.sh | 8 +-

          • Fixes a bug introduced with RPMs where it wouldn't find native code properly from within the build dir.

          common/build.xml | 9 +

          • adds javah for the new NativeCrc32 class

          .../java/org/apache/hadoop/util/DataChecksum.java | 64 +++-

          • Adds new CHECKSUM_CRC32C type for the "CRC32C" polynomial which has hardware support.
          • Adds a copyOf() to create a new DataChecksum given an existing instance.
          • Generalizes some checks for CRC32 to now apply to all size-4 checksums.
          • Adds new verifySums function, basically borrowed from FSInputChecker.java but operating on ByteBuffer instead, and calling out to native code when available

          .../java/org/apache/hadoop/util/NativeCrc32.java | 68 +++

          • Small wrapper around the new native code

          .../org/apache/hadoop/util/PureJavaCrc32C.java | 454 ++++++++++++++++

          • copy of PureJavaCrc32 but for the new polynomial. Identical code but different tables.

          common/src/native/Makefile.am | 6 +-

          • adds new C code for crc32

          .../src/org/apache/hadoop/util/NativeCrc32.c | 149 ++++++

          • implementation of verifySums using the native code

          .../src/native/src/org/apache/hadoop/util/crc32.h | 133 +++++

          • C implementations of "slicing-by-8" for CRC32 and CRC32C

          .../hadoop/util/crc32_zlib_polynomial_tables.h | 552 ++++++++++++++++++++

          .../src/org/apache/hadoop/util/crc32c_tables.h | 313 +++++++++++

          • codegenned tables for the above algorithms

          .../native/src/org/apache/hadoop/util/x86_crc32c.h | 181 +++++++

          • cpu-detection code to determine if SSE4.2 extensions are available
          • implementations using the hardware crc32 operation for 32-bit and 64-bit

          .../org/apache/hadoop/util/TestPureJavaCrc32.java | 14 +-

          • improvements to generate Table for arbitrary polynomials

          hdfs/build.xml | 1 +

          • change which might be a bad idea, which lets hdfs pick up the native libraries from the built common

          .../java/org/apache/hadoop/hdfs/BlockReader.java | 330 +++++--------

          • rewrites BlockReader to not inherit from FSInputChecker, thus making it much simpler
          • now calls the "bulk verify" method in DataChecksum

          .../org/apache/hadoop/hdfs/DFSOutputStream.java | 5 +-

          • change default checksum to CRC32C

          .../org/apache/hadoop/hdfs/DFSInputStream.java | 8 +-

          .../hadoop/hdfs/server/common/JspHelper.java | 7 +-

          • use IOUtils.readFully instead of the duplicate code from BlockReader

          .../hadoop/hdfs/server/datanode/BlockReceiver.java | 3 +-

          • don't assume checksum is always CRC32 for partial chunk append
          Show
          Todd Lipcon added a comment - Here's a combined patch to demonstrate the speed improvements. It will need to be split up into a few separate changes to be committed. Here's a summary of the changes in this somewhat large patch: common/LICENSE.txt | 20 + I borrowed some code from some BSD-licensed projects (hstore and "Slicing-by-8") common/bin/hadoop-config.sh | 8 +- Fixes a bug introduced with RPMs where it wouldn't find native code properly from within the build dir. common/build.xml | 9 + adds javah for the new NativeCrc32 class .../java/org/apache/hadoop/util/DataChecksum.java | 64 +++- Adds new CHECKSUM_CRC32C type for the "CRC32C" polynomial which has hardware support. Adds a copyOf() to create a new DataChecksum given an existing instance. Generalizes some checks for CRC32 to now apply to all size-4 checksums. Adds new verifySums function, basically borrowed from FSInputChecker.java but operating on ByteBuffer instead, and calling out to native code when available .../java/org/apache/hadoop/util/NativeCrc32.java | 68 +++ Small wrapper around the new native code .../org/apache/hadoop/util/PureJavaCrc32C.java | 454 ++++++++++++++++ copy of PureJavaCrc32 but for the new polynomial. Identical code but different tables. common/src/native/Makefile.am | 6 +- adds new C code for crc32 .../src/org/apache/hadoop/util/NativeCrc32.c | 149 ++++++ implementation of verifySums using the native code .../src/native/src/org/apache/hadoop/util/crc32.h | 133 +++++ C implementations of "slicing-by-8" for CRC32 and CRC32C .../hadoop/util/crc32_zlib_polynomial_tables.h | 552 ++++++++++++++++++++ .../src/org/apache/hadoop/util/crc32c_tables.h | 313 +++++++++++ codegenned tables for the above algorithms .../native/src/org/apache/hadoop/util/x86_crc32c.h | 181 +++++++ cpu-detection code to determine if SSE4.2 extensions are available implementations using the hardware crc32 operation for 32-bit and 64-bit .../org/apache/hadoop/util/TestPureJavaCrc32.java | 14 +- improvements to generate Table for arbitrary polynomials hdfs/build.xml | 1 + change which might be a bad idea, which lets hdfs pick up the native libraries from the built common .../java/org/apache/hadoop/hdfs/BlockReader.java | 330 +++++-------- rewrites BlockReader to not inherit from FSInputChecker, thus making it much simpler now calls the "bulk verify" method in DataChecksum .../org/apache/hadoop/hdfs/DFSOutputStream.java | 5 +- change default checksum to CRC32C .../org/apache/hadoop/hdfs/DFSInputStream.java | 8 +- .../hadoop/hdfs/server/common/JspHelper.java | 7 +- use IOUtils.readFully instead of the duplicate code from BlockReader .../hadoop/hdfs/server/datanode/BlockReceiver.java | 3 +- don't assume checksum is always CRC32 for partial chunk append
          Hide
          Todd Lipcon added a comment -

          One bug I'm aware is probably in this is that the append code path doesn't currently deal with the case where the writer wants to use a different checksum type compared to the original block file. We probably need to change the append response to have the server tell the client which checksum to use.

          Show
          Todd Lipcon added a comment - One bug I'm aware is probably in this is that the append code path doesn't currently deal with the case where the writer wants to use a different checksum type compared to the original block file. We probably need to change the append response to have the server tell the client which checksum to use.
          Hide
          Kihwal Lee added a comment -

          This is awesome! I will review the patch carefully, but I have a couple of questions for now.

          • Did you compare the performance of "software" version with zlib? Just to make sure we fallback to a better one. If zlib's crc32 doesn't perform significantly better, using what we have will be simpler for supporting different polynomials.
          • I did a bit of experiment about filling up the pipeline. When there is no data dependency, I get 1.17 cycles/Qword. By dividing the buffer into three chunks, I get about 1.6 - 1.7 cycles/Qword. This is before combining results and processing remainder. I didn't tweak too much, so it might be possible to make it a bit better. Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work?
          Show
          Kihwal Lee added a comment - This is awesome! I will review the patch carefully, but I have a couple of questions for now. Did you compare the performance of "software" version with zlib? Just to make sure we fallback to a better one. If zlib's crc32 doesn't perform significantly better, using what we have will be simpler for supporting different polynomials. I did a bit of experiment about filling up the pipeline. When there is no data dependency, I get 1.17 cycles/Qword. By dividing the buffer into three chunks, I get about 1.6 - 1.7 cycles/Qword. This is before combining results and processing remainder. I didn't tweak too much, so it might be possible to make it a bit better. Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work?
          Hide
          Todd Lipcon added a comment -

          Did you compare the performance of "software" version with zlib?

          zlib's implementation iirc is the straightforward byte-by-byte algorithm, whereas the "software" implementation here is the "slicing-by-8" algorithm which generally performs much better. I didn't do a rigorous comparison, though I think I did notice a speedup when I switched from zlib to this implementation.

          Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work?

          I did some hacking here: https://github.com/toddlipcon/cpp-dfsclient/blob/master/test_readblock.cc
          See the read_packet() function and the crc32cHardware64_3parallel(...) code. This code does run faster than the "naive" non-pipelined implementation, though I didn't do a rigorous benchmark here either.

          I figure it would be best to post the patch above before going all-out on optimization.

          A few other notes on the patch:

          • a few unit tests are failing because of bugs in the tests (eg not creating a socket with an associated Channel, or assuming read() will always return the requested size)
          • the use of native byte buffers could cause a leak - we need some kind of pooling/buffer reuse here to avoid the native memory leak

          Sadly this project is "for fun" for me at the moment so I probably won't be able to circle back for a little while. I will try to post a patch which addresses some of the above bugs though tonight.

          Show
          Todd Lipcon added a comment - Did you compare the performance of "software" version with zlib? zlib's implementation iirc is the straightforward byte-by-byte algorithm, whereas the "software" implementation here is the "slicing-by-8" algorithm which generally performs much better. I didn't do a rigorous comparison, though I think I did notice a speedup when I switched from zlib to this implementation. Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work? I did some hacking here: https://github.com/toddlipcon/cpp-dfsclient/blob/master/test_readblock.cc See the read_packet() function and the crc32cHardware64_3parallel(...) code. This code does run faster than the "naive" non-pipelined implementation, though I didn't do a rigorous benchmark here either. I figure it would be best to post the patch above before going all-out on optimization. A few other notes on the patch: a few unit tests are failing because of bugs in the tests (eg not creating a socket with an associated Channel, or assuming read() will always return the requested size) the use of native byte buffers could cause a leak - we need some kind of pooling/buffer reuse here to avoid the native memory leak Sadly this project is "for fun" for me at the moment so I probably won't be able to circle back for a little while. I will try to post a patch which addresses some of the above bugs though tonight.
          Hide
          Todd Lipcon added a comment -

          Updated patch to fix a couple of the issues mentioned above:

          • fix a couple of tests which used new Socket directly instead of the SocketFactory – thus they didn't have associated Channels and BlockReader failed
          • fix blockreader to handle EOF correctly (fixes TestClientBlockVerification)
          • fix TestSeekBug to use readFully where necessary

          the append-related bug still exists, but this patch should be useful enough for some people to play around with this if interested.

          Show
          Todd Lipcon added a comment - Updated patch to fix a couple of the issues mentioned above: fix a couple of tests which used new Socket directly instead of the SocketFactory – thus they didn't have associated Channels and BlockReader failed fix blockreader to handle EOF correctly (fixes TestClientBlockVerification) fix TestSeekBug to use readFully where necessary the append-related bug still exists, but this patch should be useful enough for some people to play around with this if interested.
          Hide
          Kihwal Lee added a comment -

          > See the read_packet() function and the crc32cHardware64_3parallel(...) code. This code does run faster than the "naive" non-pipelined implementation, though I didn't do a rigorous benchmark here either.

          Calling individual asm(crc32q .... ) three times in a raw is not quite enough to tightly fill the pipeline, because of extra instructions for copying input/output and calculating address. GCC at this point doesn't know the same registers can be used for the next crc32q instructions. I put three crc32 in one asm() and it can do 1.1~1.2 cycles/Qword.

          /* data: base bufer ptr
           * len: size of chunk / 8
           * c1, c2, c3: checksums
           */
          while(len) {
                  __asm__ __volatile__(
                          "crc32q (%7), %0;\n\t"
                          "crc32q (%7,%6,1), %1;\n\t"
                          "crc32q (%7,%6,2), %2;\n\t"
                          : "=r"(c1), "=r"(c2), "=r"(c3)
                          : "r"(c1), "r"(c2), "r"(c3), "r"(offset), "r"(data)
                  );
                  data++;
                  len--;
          }
          

          This gives a very tight loop when compiled.

            400780:       f2 48 0f 38 f1 30       crc32q (%rax),%rsi
            400786:       f2 48 0f 38 f1 0c 38    crc32q (%rax,%rdi,1),%rcx
            40078d:       f2 48 0f 38 f1 14 78    crc32q (%rax,%rdi,2),%rdx
            400794:       48 83 c0 08             add    $0x8,%rax
            400798:       49 83 e8 01             sub    $0x1,%r8
            40079c:       75 e2                   jne    400780 <main+0xf0>
          

          Use of the Castagnoli polynomial seems harmless, if not beneficial, in terms of error detection properties. I will look into safer ways of managing the direct buffer.

          Show
          Kihwal Lee added a comment - > See the read_packet() function and the crc32cHardware64_3parallel(...) code. This code does run faster than the "naive" non-pipelined implementation, though I didn't do a rigorous benchmark here either. Calling individual asm(crc32q .... ) three times in a raw is not quite enough to tightly fill the pipeline, because of extra instructions for copying input/output and calculating address. GCC at this point doesn't know the same registers can be used for the next crc32q instructions. I put three crc32 in one asm() and it can do 1.1~1.2 cycles/Qword. /* data: base bufer ptr * len: size of chunk / 8 * c1, c2, c3: checksums */ while (len) { __asm__ __volatile__( "crc32q (%7), %0;\n\t" "crc32q (%7,%6,1), %1;\n\t" "crc32q (%7,%6,2), %2;\n\t" : "=r" (c1), "=r" (c2), "=r" (c3) : "r" (c1), "r" (c2), "r" (c3), "r" (offset), "r" (data) ); data++; len--; } This gives a very tight loop when compiled. 400780: f2 48 0f 38 f1 30 crc32q (%rax),%rsi 400786: f2 48 0f 38 f1 0c 38 crc32q (%rax,%rdi,1),%rcx 40078d: f2 48 0f 38 f1 14 78 crc32q (%rax,%rdi,2),%rdx 400794: 48 83 c0 08 add $0x8,%rax 400798: 49 83 e8 01 sub $0x1,%r8 40079c: 75 e2 jne 400780 <main+0xf0> Use of the Castagnoli polynomial seems harmless, if not beneficial, in terms of error detection properties. I will look into safer ways of managing the direct buffer.
          Hide
          Todd Lipcon added a comment -

          I don't think your inline asm is doing quite what you want. I don't know the asm syntax well, but it seems you're calculating the crc at data[offset], data[offset+1], and data[offset+2]. But, those have data dependencies between them. The crc32cHardware64_3parallel code calculates three separate chunks (at 512-byte offsets from each other) pipelined together

          Show
          Todd Lipcon added a comment - I don't think your inline asm is doing quite what you want. I don't know the asm syntax well, but it seems you're calculating the crc at data [offset] , data [offset+1] , and data [offset+2] . But, those have data dependencies between them. The crc32cHardware64_3parallel code calculates three separate chunks (at 512-byte offsets from each other) pipelined together
          Hide
          Kihwal Lee added a comment -

          It's one the indirect addressing mode. For example, (%rax,%rdi,2) means memory[rax + rdi*2].
          So if the chunks in the buffer back to back, three chunks will get processed with rdi being the size of chunk. It can be made to access three independent memory locations with a bit of performance loss.

          Show
          Kihwal Lee added a comment - It's one the indirect addressing mode. For example, (%rax,%rdi,2) means memory [rax + rdi*2] . So if the chunks in the buffer back to back, three chunks will get processed with rdi being the size of chunk. It can be made to access three independent memory locations with a bit of performance loss.
          Hide
          Kihwal Lee added a comment -

          Hmm it seems I am suffering from random word omission errors in my brain.

          Show
          Kihwal Lee added a comment - Hmm it seems I am suffering from random word omission errors in my brain.
          Hide
          Todd Lipcon added a comment -

          ah, I wasn't aware of that x86 addressing mode. My PowerPC background shows itself

          Show
          Todd Lipcon added a comment - ah, I wasn't aware of that x86 addressing mode. My PowerPC background shows itself
          Hide
          Todd Lipcon added a comment -

          I've filed subtasks and linked issues for the various optimizations investigated above. Please watch the subtasks for the latest progress.

          Show
          Todd Lipcon added a comment - I've filed subtasks and linked issues for the various optimizations investigated above. Please watch the subtasks for the latest progress.
          Hide
          Todd Lipcon added a comment -

          Li Pi has been doing some investigation of read performance in HBase, and has seen that RawLocalFileSystem performs over 2x better than the checksummed local filesystem for random reads out of the linux FS cache. So, this JIRA should have significant speed improvements for HBase.

          Show
          Todd Lipcon added a comment - Li Pi has been doing some investigation of read performance in HBase, and has seen that RawLocalFileSystem performs over 2x better than the checksummed local filesystem for random reads out of the linux FS cache. So, this JIRA should have significant speed improvements for HBase.
          Hide
          Kihwal Lee added a comment -

          Do you know the status of his slab allocator? Is there any chance of it being used for managing the direct byte buffer pool?

          Show
          Kihwal Lee added a comment - Do you know the status of his slab allocator? Is there any chance of it being used for managing the direct byte buffer pool?
          Hide
          Todd Lipcon added a comment -

          Haven't code reviewed it yet, but it's working within HBase. Could potentially be used for a direct buffer pool, but seems a little more complexity than necessary, since the direct buffers on the read path are all identical size (the 64kb "packet size")

          Show
          Todd Lipcon added a comment - Haven't code reviewed it yet, but it's working within HBase. Could potentially be used for a direct buffer pool, but seems a little more complexity than necessary, since the direct buffers on the read path are all identical size (the 64kb "packet size")
          Hide
          Kihwal Lee added a comment -

          I agree. We can have something simpler. Are we doing anything in compressor/decompressor? Do we also need something there?

          Show
          Kihwal Lee added a comment - I agree. We can have something simpler. Are we doing anything in compressor/decompressor? Do we also need something there?
          Hide
          stack added a comment -

          @Kihwal Do you mean the hbase compressor/decompressor?

          Show
          stack added a comment - @Kihwal Do you mean the hbase compressor/decompressor?
          Hide
          Kihwal Lee added a comment -

          I meant the ones in HDFS. They also use direct byte buffer. I thought they might have a similar issue/solution.

          Show
          Kihwal Lee added a comment - I meant the ones in HDFS. They also use direct byte buffer. I thought they might have a similar issue/solution.
          Hide
          Kihwal Lee added a comment -

          Well, since it's in common, so it might be used by others as well. Does HBase use direct byte buffer for comp/decomp?

          Show
          Kihwal Lee added a comment - Well, since it's in common, so it might be used by others as well. Does HBase use direct byte buffer for comp/decomp?
          Hide
          stack added a comment -

          No (not yet at least).

          Show
          stack added a comment - No (not yet at least).
          Hide
          Todd Lipcon added a comment -

          Well, not directly, which means that when we call the codecs, they end up copying in/out of direct buffers when calling the native functions.

          Show
          Todd Lipcon added a comment - Well, not directly, which means that when we call the codecs, they end up copying in/out of direct buffers when calling the native functions.
          Hide
          Todd Lipcon added a comment -

          All of the subtasks have now been completed and committed for 0.23.1 and 0.24. Thanks to those that helped, especially Nathan, Kihwal, Eli, and Nicholas for the many reviews.

          Show
          Todd Lipcon added a comment - All of the subtasks have now been completed and committed for 0.23.1 and 0.24. Thanks to those that helped, especially Nathan, Kihwal, Eli, and Nicholas for the many reviews.
          Hide
          Kihwal Lee added a comment -

          Good work, Todd!

          Show
          Kihwal Lee added a comment - Good work, Todd!

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              41 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development