Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11660

Add support for hardware crc of HDFS checksums on ARM aarch64 architecture

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: native
    • Labels:
    • Environment:

      ARM aarch64 development platform

    • Target Version/s:
    • Release Note:
      Add support for aarch64 CRC instructions

      Description

      This patch adds support for hardware crc for ARM's new 64 bit architecture

      The patch is completely conditionalized on _aarch64_

      I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement.

      The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre.

      To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times.

      Before:

      CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
      CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55

      After:

      CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
      CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57

      So this represents a 5X performance improvement on raw CRC calculation.

      1. jira-11660.patch
        49 kB
        Edward Nevill

        Activity

        Hide
        cmccabe Colin P. McCabe added a comment -

        Hi Edward Nevill, I don't see the patch? Also, it seems like this should be in branch-2 as well as trunk.

        Show
        cmccabe Colin P. McCabe added a comment - Hi Edward Nevill , I don't see the patch? Also, it seems like this should be in branch-2 as well as trunk.
        Hide
        enevill Edward Nevill added a comment -

        Hi Colin,

        The lab went down just as I was preparing the patch. I'll attach it as soon
        as the lab is back up.

        All the best,
        Ed.

        On 2 March 2015 at 18:34, Colin Patrick McCabe (JIRA) <jira@apache.org>

        Show
        enevill Edward Nevill added a comment - Hi Colin, The lab went down just as I was preparing the patch. I'll attach it as soon as the lab is back up. All the best, Ed. On 2 March 2015 at 18:34, Colin Patrick McCabe (JIRA) <jira@apache.org>
        Hide
        enevill Edward Nevill added a comment -

        This patch adds support for hw CRC instructions on aarch64

        Show
        enevill Edward Nevill added a comment - This patch adds support for hw CRC instructions on aarch64
        Hide
        cmccabe Colin P. McCabe added a comment -
        -#if (!defined(__FreeBSD__) && !defined(WINDOWS))
        +#if (!defined(__FreeBSD__) && !defined(WINDOWS)) && !defined(__aarch64__)
        

        I don't understand the logic here, can you explain? What does pipelining have to do with whether we are on ARM? At minimum this needs a comment.

               crc_update_func = crc32_zlib_sb8;
        +#ifdef __aarch64__
        +      if (likely(cached_cpu_supports_crc32))
        +        crc_update_func = crc32_zlib_hardware;
        +#endif
        

        Can you just change the definition of crc32_zlib_sb8 when _aarch64_ is defined? We should minimize the number of skid marks in the code.

        Show
        cmccabe Colin P. McCabe added a comment - -# if (!defined(__FreeBSD__) && !defined(WINDOWS)) +# if (!defined(__FreeBSD__) && !defined(WINDOWS)) && !defined(__aarch64__) I don't understand the logic here, can you explain? What does pipelining have to do with whether we are on ARM? At minimum this needs a comment. crc_update_func = crc32_zlib_sb8; +#ifdef __aarch64__ + if (likely(cached_cpu_supports_crc32)) + crc_update_func = crc32_zlib_hardware; +#endif Can you just change the definition of crc32_zlib_sb8 when _ aarch64 _ is defined? We should minimize the number of skid marks in the code.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12702205/jira-11660.patch
        against trunk revision 4228de9.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12702205/jira-11660.patch against trunk revision 4228de9. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//console This message is automatically generated.
        Hide
        enevill Edward Nevill added a comment -

        On 3 March 2015 at 18:38, Colin Patrick McCabe (JIRA) <jira@apache.org>

        The USE_PIPELINED optimises crc on x86 because x86 has 3 crc units and
        therefore can essentially calculate 3 crcs in parallel. The code for the
        x86 loop does

        while (likely(counter))

        { __asm__ __volatile__( "crc32q (%7), %0;\n\t" "crc32q (%7,%6,1), %1;\n\t" "crc32q (%7,%6,2), %2;\n\t" : "=r"(c1), "=r"(c2), "=r"(c3) : "0"(c1), "1"(c2), "2"(c3), "r"(block_size), "r"(data) ); data++; counter--; }

        So, the 2nd crc32q executes in the shadow of the 1st (ie it executes in
        what would be a result delay from the 1st crc32q in any case). Similarly
        the 3rd crc32q executes in the shadow or the 1st and 2nd.

        On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if
        we did the same on aarch64 and had say

        crc32 w0, w0, x3
        crc32 w1, w1, x4
        crc32 w2, w2, x5

        The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete
        and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so
        there is no benefit to pipelining.

        I did some tests with pipelining on aarch64 and it worked out marginally
        slower than the non pipelined version.

        Note that the pipelined_crc32 is within a #ifdef x86 section of code so
        from that point of view it is architecture specific and if I enabled it I
        would then have to redundantly implement the pipelined version on aarch64.

        No. Because not all aarch64 HW has the CRC instructions. So we need all
        three

        crc32c_hardware - When called to do a castagnoli CRC on a cpu with HW
        support
        crc32_zlib_hardware - When called to do a zlib CRC on a cpu with HW support
        crc32_zlib_sb8 - When call to do a CRC on a CPU without HW support

        The problem is that the existing code assumes that HW support means only HW
        support for the Castagnoli CRC.

        The way to get around the #ifdef _aarch64_ is to add support for zlib CRC
        on all archs. This would mean adding code to the x86 side to say that zlib
        CRC is not available.

        So what this would involve is.

        • On the x86 side add a dummy definitions of crc32_zlib_hardware which will
          never get called but is just there to satisfy the reference. It would have
          an assert to ensure it is never called. This is much like the existing
          dummy crc32_hardware at the very end.

        static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t
        length)

        { // never called! assert(0 && "hardware crc called on an unsupported platform"); return 0; }
        • Add an additional variable cached_cpu_supports_crc32_zlib to complement
          the existing cached_cpu_supports_crc32 variable.
        • In the init section for x86

        void _attribute_ ((constructor)) init_cpu_support_flag(void)

        { uint32_t ecx = cpuid(CPUID_FEATURES); cached_cpu_supports_crc32 = ecx & SSE42_FEATURE_BIT; }

        add an initialisation

        cached_cpu_supports_crc32_zlib = 0;

        (alternatively this could be left out and rely on the static initialisation
        of 0).

        • In the init section for aarch64

        void _attribute_ ((constructor)) init_cpu_support_flag(void)

        { unsigned long auxv = getauxval(AT_HWCAP); cached_cpu_supports_crc32 = auxv & HWCAP_CRC32; }

        add an initialisation

        cached_cpu_supports_crc32_zlib = cached_cpu_supports_crc32;

        (if CRC is supported on aarch64 both variants must be supported).

        • Then instead of

        case CRC32_ZLIB_POLYNOMIAL:
        crc_update_func = crc32_zlib_sb8;
        #ifdef _aarch64_
        if (likely(cached_cpu_supports_crc32))
        crc_update_func = crc32_zlib_hardware;
        #endif

        we do

        case CRC32_ZLIB_POLYNOMIAL:
        crc_update_func = crc32_zlib_sb8;
        if (likely(cached_cpu_supports_crc32_zlib))
        crc_update_func = crc32_zlib_hardware;

        This would have the advantage that there is a generic framework for
        architectures to add support for 1 or both (or none) HW support for CRC
        without additional #ifdefs.

        So this would get rid of 2 #ifdef _aarch64_ statements leaving just the
        #ifdef around the USE_PIPELINE and I do feel that adding a significant
        amout of code to the aarch64 side to implement a pipelined version which
        offers no performance advantages would be undesirable.

        But I'm happy to do whichever you feel is best. Just let me know.

        All the best,
        Ed.

        Show
        enevill Edward Nevill added a comment - On 3 March 2015 at 18:38, Colin Patrick McCabe (JIRA) <jira@apache.org> The USE_PIPELINED optimises crc on x86 because x86 has 3 crc units and therefore can essentially calculate 3 crcs in parallel. The code for the x86 loop does while (likely(counter)) { __asm__ __volatile__( "crc32q (%7), %0;\n\t" "crc32q (%7,%6,1), %1;\n\t" "crc32q (%7,%6,2), %2;\n\t" : "=r"(c1), "=r"(c2), "=r"(c3) : "0"(c1), "1"(c2), "2"(c3), "r"(block_size), "r"(data) ); data++; counter--; } So, the 2nd crc32q executes in the shadow of the 1st (ie it executes in what would be a result delay from the 1st crc32q in any case). Similarly the 3rd crc32q executes in the shadow or the 1st and 2nd. On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if we did the same on aarch64 and had say crc32 w0, w0, x3 crc32 w1, w1, x4 crc32 w2, w2, x5 The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so there is no benefit to pipelining. I did some tests with pipelining on aarch64 and it worked out marginally slower than the non pipelined version. Note that the pipelined_crc32 is within a #ifdef x86 section of code so from that point of view it is architecture specific and if I enabled it I would then have to redundantly implement the pipelined version on aarch64. No. Because not all aarch64 HW has the CRC instructions. So we need all three crc32c_hardware - When called to do a castagnoli CRC on a cpu with HW support crc32_zlib_hardware - When called to do a zlib CRC on a cpu with HW support crc32_zlib_sb8 - When call to do a CRC on a CPU without HW support The problem is that the existing code assumes that HW support means only HW support for the Castagnoli CRC. The way to get around the #ifdef _ aarch64 _ is to add support for zlib CRC on all archs. This would mean adding code to the x86 side to say that zlib CRC is not available. So what this would involve is. On the x86 side add a dummy definitions of crc32_zlib_hardware which will never get called but is just there to satisfy the reference. It would have an assert to ensure it is never called. This is much like the existing dummy crc32_hardware at the very end. static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t length) { // never called! assert(0 && "hardware crc called on an unsupported platform"); return 0; } Add an additional variable cached_cpu_supports_crc32_zlib to complement the existing cached_cpu_supports_crc32 variable. In the init section for x86 void _ attribute _ ((constructor)) init_cpu_support_flag(void) { uint32_t ecx = cpuid(CPUID_FEATURES); cached_cpu_supports_crc32 = ecx & SSE42_FEATURE_BIT; } add an initialisation cached_cpu_supports_crc32_zlib = 0; (alternatively this could be left out and rely on the static initialisation of 0). In the init section for aarch64 void _ attribute _ ((constructor)) init_cpu_support_flag(void) { unsigned long auxv = getauxval(AT_HWCAP); cached_cpu_supports_crc32 = auxv & HWCAP_CRC32; } add an initialisation cached_cpu_supports_crc32_zlib = cached_cpu_supports_crc32; (if CRC is supported on aarch64 both variants must be supported). Then instead of case CRC32_ZLIB_POLYNOMIAL: crc_update_func = crc32_zlib_sb8; #ifdef _ aarch64 _ if (likely(cached_cpu_supports_crc32)) crc_update_func = crc32_zlib_hardware; #endif we do case CRC32_ZLIB_POLYNOMIAL: crc_update_func = crc32_zlib_sb8; if (likely(cached_cpu_supports_crc32_zlib)) crc_update_func = crc32_zlib_hardware; This would have the advantage that there is a generic framework for architectures to add support for 1 or both (or none) HW support for CRC without additional #ifdefs. So this would get rid of 2 #ifdef _ aarch64 _ statements leaving just the #ifdef around the USE_PIPELINE and I do feel that adding a significant amout of code to the aarch64 side to implement a pipelined version which offers no performance advantages would be undesirable. But I'm happy to do whichever you feel is best. Just let me know. All the best, Ed.
        Hide
        aw Allen Wittenauer added a comment -

        Please stop setting the fix version. That is set at commit time to reflect which version this change was committed to.

        Thanks.

        Show
        aw Allen Wittenauer added a comment - Please stop setting the fix version. That is set at commit time to reflect which version this change was committed to. Thanks.
        Hide
        enevill Edward Nevill added a comment -

        Hi,

        I did some additional benchmarking on pipelining on aarch64 and it would appear that I was mistaken. AArch64 is in fact capable of pipelining. The reason I was not seeing any improvement was that I choose too large a buffer size. In my test I was doing a CRC of 3 x 1MB buffers x 5000 times. The pipeline version shows worse performance because it is doing all three 1MB buffers in parallel whereas the non pipeline version processes the 1MB buffers individually which is more cache efficient.

        I reduced the buffer size to 16KB and increased the no. of iterations from 5000 to 500000. This generated the following results.

        NON PIPELINED crc1 = 783797200, crc2 = 610683550, crc3 = -1644088667
        time = 6.16
        PIPELINED crc1 = -2031343782, crc2 = -2043588942, crc3 = 554161471
        time = 4.61
        

        I then replaced the CRC instruction with and ADD instruction (which always completes in 1 cycle) and got the following result.

        NON PIPELINED crc1 = -1928994468, crc2 = -1747836272, crc3 = -674545616
        time = 4.13
        PIPELINED crc1 = -2096826240, crc2 = -334553600, crc3 = 1911147008
        time = 4.19
        

        This clearly shows that the CRC is pipelined because it is effectively able to complete each CRC in a single cycle (because the pipeline version gets the same performance as using ADD).

        My bad. I will submit another patch within the next couple of days which includes pipelining. Thanks for your patience!

        Ed.

        Show
        enevill Edward Nevill added a comment - Hi, I did some additional benchmarking on pipelining on aarch64 and it would appear that I was mistaken. AArch64 is in fact capable of pipelining. The reason I was not seeing any improvement was that I choose too large a buffer size. In my test I was doing a CRC of 3 x 1MB buffers x 5000 times. The pipeline version shows worse performance because it is doing all three 1MB buffers in parallel whereas the non pipeline version processes the 1MB buffers individually which is more cache efficient. I reduced the buffer size to 16KB and increased the no. of iterations from 5000 to 500000. This generated the following results. NON PIPELINED crc1 = 783797200, crc2 = 610683550, crc3 = -1644088667 time = 6.16 PIPELINED crc1 = -2031343782, crc2 = -2043588942, crc3 = 554161471 time = 4.61 I then replaced the CRC instruction with and ADD instruction (which always completes in 1 cycle) and got the following result. NON PIPELINED crc1 = -1928994468, crc2 = -1747836272, crc3 = -674545616 time = 4.13 PIPELINED crc1 = -2096826240, crc2 = -334553600, crc3 = 1911147008 time = 4.19 This clearly shows that the CRC is pipelined because it is effectively able to complete each CRC in a single cycle (because the pipeline version gets the same performance as using ADD). My bad. I will submit another patch within the next couple of days which includes pipelining. Thanks for your patience! Ed.
        Hide
        enevill Edward Nevill added a comment -

        Patch to be replaced with a version which does pipelining

        Show
        enevill Edward Nevill added a comment - Patch to be replaced with a version which does pipelining
        Hide
        cmccabe Colin P. McCabe added a comment -

        OK. Thanks, Edward.

        Show
        cmccabe Colin P. McCabe added a comment - OK. Thanks, Edward.
        Hide
        enevill Edward Nevill added a comment -

        This revised patch adds support for hw crc on aarch64 with pipelining.

        I have also added support for the zlib polynomial in HW and done so in a generic way so that it can be easily added for other arch without the use of #ifdef.

        There now a single #elif _aarch64_ which surrounds all the aarch64 assembler code.

        The following are the results I get on generic A57 aarch64 HW:-

        -bash-4.2$ ./test_bulk_crc32_orig 
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 3.98
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 3.97
        ./test_bulk_crc32_orig: SUCCESS.
        -bash-4.2$ ./test_bulk_crc32_nopipe 
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.87
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.87
        ./test_bulk_crc32_nopipe: SUCCESS.
        -bash-4.2$ ./test_bulk_crc32_pipe 
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.35
        CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.36
        ./test_bulk_crc32_pipe: SUCCESS.
        -bash-4.2$ 
        

        In the above _orig is the original unaccelerated code, _nopipe is the accelerated code without pipelining and _pipe is the accelerated code with pipelining. The net result is that non pipelined case is 4.5 times faster than the original and the pipelined case is 11 times faster.

        Thanks to everyone for the help and patience developing this patch.

        All the best,
        Ed.

        Show
        enevill Edward Nevill added a comment - This revised patch adds support for hw crc on aarch64 with pipelining. I have also added support for the zlib polynomial in HW and done so in a generic way so that it can be easily added for other arch without the use of #ifdef. There now a single #elif _ aarch64 _ which surrounds all the aarch64 assembler code. The following are the results I get on generic A57 aarch64 HW:- -bash-4.2$ ./test_bulk_crc32_orig CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 3.98 CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 3.97 ./test_bulk_crc32_orig: SUCCESS. -bash-4.2$ ./test_bulk_crc32_nopipe CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.87 CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.87 ./test_bulk_crc32_nopipe: SUCCESS. -bash-4.2$ ./test_bulk_crc32_pipe CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.35 CRC 16384 bytes @ 512 bytes per checksum X 100000 iterations = 0.36 ./test_bulk_crc32_pipe: SUCCESS. -bash-4.2$ In the above _orig is the original unaccelerated code, _nopipe is the accelerated code without pipelining and _pipe is the accelerated code with pipelining. The net result is that non pipelined case is 4.5 times faster than the original and the pipelined case is 11 times faster. Thanks to everyone for the help and patience developing this patch. All the best, Ed.
        Hide
        enevill Edward Nevill added a comment -

        Revised patch to add support for pipelining.

        Show
        enevill Edward Nevill added a comment - Revised patch to add support for pipelining.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch
        against trunk revision 95bfd08.

        -1 patch. Trunk compilation may be broken.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5867//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch against trunk revision 95bfd08. -1 patch . Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5867//console This message is automatically generated.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch
        against trunk revision 95bfd08.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch against trunk revision 95bfd08. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thanks for working on this, Edward Nevill. I think the #ifdefs are getting kind of out of control in this file. This isn't really your fault... it's a mess even now. But adding another dimension of #ifdefing just makes my brain hurt.

        How about we split the hardware-specific code for x86 into bulk_crc32_x86.c, and the ARM-specific code off into bulk_crc32_aarch64.c. CMake will compile in the relevant file based on the architecture. I would say both of the platform specific files should implement the same functions, so that the generic part can be compiled against either.

        I suppose we'd still have the USE_PIPELINED ifdefs, but one dimension of ifdefs is possible to understand (for me, at least).

        Do you think that's possible?

        Show
        cmccabe Colin P. McCabe added a comment - Thanks for working on this, Edward Nevill . I think the #ifdefs are getting kind of out of control in this file. This isn't really your fault... it's a mess even now. But adding another dimension of #ifdefing just makes my brain hurt. How about we split the hardware-specific code for x86 into bulk_crc32_x86.c , and the ARM-specific code off into bulk_crc32_aarch64.c . CMake will compile in the relevant file based on the architecture. I would say both of the platform specific files should implement the same functions, so that the generic part can be compiled against either. I suppose we'd still have the USE_PIPELINED ifdefs, but one dimension of ifdefs is possible to understand (for me, at least). Do you think that's possible?
        Hide
        enevill Edward Nevill added a comment -

        Hi,

        I have added a new patch which reworks the code and (hopefully) makes it more readable. I have also attached the resultant bulk_crc32.c as the patch is growing rather large and difficult to understand.

        I have made the following changes

        1) I have removed conditionalisaton on USE_PIPELINED. I really do not understand the condition that reads

        #if (!defined(__FreeBSD__) && !defined(WINDOWS))
        #define USE_PIPELINED
        #endif
        

        USE_PIPELINED is a feature of the architecture (IE. is there any benefit to pipelining on your particular arch - at the moment x86 or aarch64), not the OS. So I really dont understand this ifdef.

        If it is the case that some arch does not support pipelining then the arch must still implement a pipelined version, but this can be a basic shell like I have done for the sb8 versions.

        static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3,
                                         const uint8_t *p_buf, size_t block_size, int num_blocks) {
          assert(num_blocks >= 1 && num_blocks <=3 && "invalid num_blocks");
          *crc1 = crc32c_sb8(*crc1, p_buf, block_size);
          if (num_blocks >= 2)
            *crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size);
          if (num_blocks >= 3)
            *crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size);
        }
        

        So, in this case, if there is no support for pipelining, just call the non pipelined version 3 times with each buffer in turn.

        Doing this make the code a whole lot easier to read IMHO

        2) The existing code initialised a variable

        cached_cpu_supports_crc32

        and then did

              if (likely(cached_cpu_supports_crc32)) {
                crc_update_func = crc32c_hardware;
        

        The problem with this is it creates a reference to crc32c_hardware, even when there is no hardware support. This means it is necessary to create dummy crc32_hardware function when we are only doing SW crc. IE.

        static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t length) {
          // never called!
          assert(0 && "hardware crc called on an unsupported platform");
          return 0;
        }
        

        Instead, what I do is initialise static function pointers in the initialisation function. So for example, for aarch64 I do

        void __attribute__ ((constructor)) init_cpu_support_flag(void) {
          unsigned long auxv = getauxval(AT_HWCAP);
          if (auxv & HWCAP_CRC32) {
            pipelined_crc32c_func = pipelined_crc32c;
            pipelined_crc32_zlib_func = pipelined_crc32_zlib;
          }
        }
        

        By default these pointers are statically initialised to the SW routines. IE.

        // Satically initialise the function pointers to the software versions
        // If HW is available these will be overridden by the initialisation functions
        static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8;
        static crc_pipelined_func_t pipelined_crc32_zlib_func = pipelined_crc32_zlib_sb8;
        

        So if nobody initialises them they fall back to SW.

        3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit x86 now falls back to SW.

        Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back.

        The result of these changes is to vastly reduce the amount of conditionalisation in the code. The conditionalisation is now reduced to

        #if defined(__amd64__) && defined(__GNUC__) && !defined(__FreeBSD__)
        .....
        #elif defined(__aarch64__) // Start ARM 64 architecture
        .....
        #endif
        

        I hope the above changes will make the code a whole lot more readable,

        All the best,
        Ed.

        Show
        enevill Edward Nevill added a comment - Hi, I have added a new patch which reworks the code and (hopefully) makes it more readable. I have also attached the resultant bulk_crc32.c as the patch is growing rather large and difficult to understand. I have made the following changes 1) I have removed conditionalisaton on USE_PIPELINED. I really do not understand the condition that reads # if (!defined(__FreeBSD__) && !defined(WINDOWS)) #define USE_PIPELINED #endif USE_PIPELINED is a feature of the architecture (IE. is there any benefit to pipelining on your particular arch - at the moment x86 or aarch64), not the OS. So I really dont understand this ifdef. If it is the case that some arch does not support pipelining then the arch must still implement a pipelined version, but this can be a basic shell like I have done for the sb8 versions. static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3, const uint8_t *p_buf, size_t block_size, int num_blocks) { assert (num_blocks >= 1 && num_blocks <=3 && "invalid num_blocks" ); *crc1 = crc32c_sb8(*crc1, p_buf, block_size); if (num_blocks >= 2) *crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size); if (num_blocks >= 3) *crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size); } So, in this case, if there is no support for pipelining, just call the non pipelined version 3 times with each buffer in turn. Doing this make the code a whole lot easier to read IMHO 2) The existing code initialised a variable cached_cpu_supports_crc32 and then did if (likely(cached_cpu_supports_crc32)) { crc_update_func = crc32c_hardware; The problem with this is it creates a reference to crc32c_hardware, even when there is no hardware support. This means it is necessary to create dummy crc32_hardware function when we are only doing SW crc. IE. static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t length) { // never called! assert (0 && "hardware crc called on an unsupported platform" ); return 0; } Instead, what I do is initialise static function pointers in the initialisation function. So for example, for aarch64 I do void __attribute__ ((constructor)) init_cpu_support_flag(void) { unsigned long auxv = getauxval(AT_HWCAP); if (auxv & HWCAP_CRC32) { pipelined_crc32c_func = pipelined_crc32c; pipelined_crc32_zlib_func = pipelined_crc32_zlib; } } By default these pointers are statically initialised to the SW routines. IE. // Satically initialise the function pointers to the software versions // If HW is available these will be overridden by the initialisation functions static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8; static crc_pipelined_func_t pipelined_crc32_zlib_func = pipelined_crc32_zlib_sb8; So if nobody initialises them they fall back to SW. 3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit x86 now falls back to SW. Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back. The result of these changes is to vastly reduce the amount of conditionalisation in the code. The conditionalisation is now reduced to # if defined(__amd64__) && defined(__GNUC__) && !defined(__FreeBSD__) ..... #elif defined(__aarch64__) // Start ARM 64 architecture ..... #endif I hope the above changes will make the code a whole lot more readable, All the best, Ed.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12703755/bulk_crc32.c
        against trunk revision 64eb068.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5910//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703755/bulk_crc32.c against trunk revision 64eb068. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5910//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Hi Edward,

        What's the rationale for not splitting the file into multiple files per architecture as I asked in this comment: https://issues.apache.org/jira/browse/HADOOP-11660?focusedCommentId=14350788&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350788 ? Also, I would prefer not to drop support for architectures, even if they are 32-bit x86.

        Show
        cmccabe Colin P. McCabe added a comment - Hi Edward, What's the rationale for not splitting the file into multiple files per architecture as I asked in this comment: https://issues.apache.org/jira/browse/HADOOP-11660?focusedCommentId=14350788&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350788 ? Also, I would prefer not to drop support for architectures, even if they are 32-bit x86.
        Hide
        enevill Edward Nevill added a comment -

        Add support for aarch64 CRC instructions

        Show
        enevill Edward Nevill added a comment - Add support for aarch64 CRC instructions
        Hide
        enevill Edward Nevill added a comment -

        Hi Colin,

        My understanding was that the issue was the no of #ifdefs in the file so I reworked it to reduce it to a single #ifdef X86 ... #elif AARCH64 #endif.

        I have prepared a new patch which separates x86 and aarch64 into two files bulk_crc32_x86.c and bulk_crc32_aarch64.c. bulk_crc32.c now contains just the fallback software implementation.

        The cmakefile includes both bulk_crc32_x86.c and bulk_crc32_aarch64.c and the two files are then protected by

        bulk_crc32_x86.c

        #ifdef <X86>
        ....
        #endif
        

        bulk_crc32_aarch64.c

        #ifdef <AARCH64>
        ...
        #endif
        

        I am not clear how to make cmake select correctly the _x86 or _aarch64 version. If you could help me with this then we can get rid of the final conditionalisation above.

        I have also added back in support for the optimised 32 bit x86 implemetation.

        All the best,
        Ed.

        Show
        enevill Edward Nevill added a comment - Hi Colin, My understanding was that the issue was the no of #ifdefs in the file so I reworked it to reduce it to a single #ifdef X86 ... #elif AARCH64 #endif. I have prepared a new patch which separates x86 and aarch64 into two files bulk_crc32_x86.c and bulk_crc32_aarch64.c. bulk_crc32.c now contains just the fallback software implementation. The cmakefile includes both bulk_crc32_x86.c and bulk_crc32_aarch64.c and the two files are then protected by bulk_crc32_x86.c #ifdef <X86> .... #endif bulk_crc32_aarch64.c #ifdef <AARCH64> ... #endif I am not clear how to make cmake select correctly the _x86 or _aarch64 version. If you could help me with this then we can get rid of the final conditionalisation above. I have also added back in support for the optimised 32 bit x86 implemetation. All the best, Ed.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12705102/jira-11660.patch
        against trunk revision 7179f94.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705102/jira-11660.patch against trunk revision 7179f94. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        sorry that I have been slow to review this. It is difficult code to review.

        If you want an example of cmake conditional compilation, look at ./hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt and its if (WIN32). In your case you would be checking on the built-in cmake variables, something like ARCH? It should be in the cmake docs

        Show
        cmccabe Colin P. McCabe added a comment - sorry that I have been slow to review this. It is difficult code to review. If you want an example of cmake conditional compilation, look at ./hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt and its if (WIN32) . In your case you would be checking on the built-in cmake variables, something like ARCH? It should be in the cmake docs
        Hide
        enevill Edward Nevill added a comment -

        Revised patch to select _x86 or _aarch64 in cmake

        Show
        enevill Edward Nevill added a comment - Revised patch to select _x86 or _aarch64 in cmake
        Hide
        enevill Edward Nevill added a comment -

        Hi,

        There does not seem to be a variable ARCH. The closest match seems to be CMAKE_SYSTEM_PROCESSOR.

        I have added a condition in the CMakeLists.txt which does

        IF (CMAKE_SYSTEM_PROCESSOR MATCHES "^i.86$" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "amd64")
          set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_x86.c")
        ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")
          set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_aarch64.c")
        ENDIF()
        

        And then uses BULK_CRC_ARCH_SOURCE_FIlE in the 2 cases where it is included later.

        Thanks for your help and apologies for the amount of time this patch is taking,

        Ed.

        Show
        enevill Edward Nevill added a comment - Hi, There does not seem to be a variable ARCH. The closest match seems to be CMAKE_SYSTEM_PROCESSOR. I have added a condition in the CMakeLists.txt which does IF (CMAKE_SYSTEM_PROCESSOR MATCHES "^i.86$" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "amd64" ) set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_x86.c" ) ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64" ) set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_aarch64.c" ) ENDIF() And then uses BULK_CRC_ARCH_SOURCE_FIlE in the 2 cases where it is included later. Thanks for your help and apologies for the amount of time this patch is taking, Ed.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12705611/jira-11660.patch
        against trunk revision 522c897.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705611/jira-11660.patch against trunk revision 522c897. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thank you for your patience, Edward Nevill. I think this is almost ready to commit.

        168	ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")
        169	  set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_aarch64.c")
        170	ENDIF()
        

        Can you put in a MESSAGE here that explains that the architecture is unsupported in the ELSE case? We certainly don't want to be losing hardware acceleration without being aware of it.

        bulk_crc32_aarch64.c: you should include stdint.h here for uint8_t, etc. Even though some other header is probably pulling it in now by accident.

        bulk_crc32_x86.c: I would really prefer not to wrap this in a giant #if defined(_GNUC) && !defined(FreeBSD_), especially since we're not wrapping the ARM version like that. If people want this to be compiler and os-specific, it would be better to do it at the CMake level. I would say just take that out and let people fix it if it becomes a problem for them.

        Can you post before / after performance numbers for x86_64? Maybe you can instrument test_bulk_crc32.c to produce those numbers.

        It looks like when this was done previously, the test code was not checked in. See:
        https://issues.apache.org/jira/browse/HADOOP-7446?focusedCommentId=13084519&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13084519

        I'm sorry to dump another task on you, but my co-workers will kill me if I regress checksum performance.

        Thanks again for working on this. As soon as we verify that we haven't regressed perf, and made those minor changes, we should be good to go.

        Show
        cmccabe Colin P. McCabe added a comment - Thank you for your patience, Edward Nevill . I think this is almost ready to commit. 168 ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64" ) 169 set(BULK_CRC_ARCH_SOURCE_FIlE "${D}/util/bulk_crc32_aarch64.c" ) 170 ENDIF() Can you put in a MESSAGE here that explains that the architecture is unsupported in the ELSE case? We certainly don't want to be losing hardware acceleration without being aware of it. bulk_crc32_aarch64.c : you should include stdint.h here for uint8_t , etc. Even though some other header is probably pulling it in now by accident. bulk_crc32_x86.c : I would really prefer not to wrap this in a giant #if defined(_ GNUC ) && !defined( FreeBSD _) , especially since we're not wrapping the ARM version like that. If people want this to be compiler and os-specific, it would be better to do it at the CMake level. I would say just take that out and let people fix it if it becomes a problem for them. Can you post before / after performance numbers for x86_64? Maybe you can instrument test_bulk_crc32.c to produce those numbers. It looks like when this was done previously, the test code was not checked in. See: https://issues.apache.org/jira/browse/HADOOP-7446?focusedCommentId=13084519&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13084519 I'm sorry to dump another task on you, but my co-workers will kill me if I regress checksum performance. Thanks again for working on this. As soon as we verify that we haven't regressed perf, and made those minor changes, we should be good to go.
        Hide
        enevill Edward Nevill added a comment -

        Revised patch to include requested changes

        Show
        enevill Edward Nevill added a comment - Revised patch to include requested changes
        Hide
        enevill Edward Nevill added a comment -

        Hi,

        I have revised the patch to include the changes requested above. I have also updated test_bulk_crc32.c so it prints out the times for 16384 bytes @ 512 bytes per checksum X 1000000 iterations for both the Castagnoli and Zlib polynomials.

        The following are the results I get for x86_64 before and after. I have done 5 runs of each.

        BEFORE

        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.8
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.84
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.1
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.85
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.94
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.81
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        

        AFTER

        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.99
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.9
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12
        CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92
        ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
        

        Loking at the average time over 5 runs gives

        BEFORE

        Castagnoli average = 1.116 sec
        Zlib average = 13.848 sec

        AFTER

        Castagnoli average = 1.116
        Zlib average = 13.93

        So the performance for the Castagnoli polynomial is the same. For the Zlib poynomial there seems to be a performance degradation of 0.6%. This may be due to experimental error, however this is unaccelerated in any case on x86 because it is not supported on x86 HW and is not used for HDFS.

        For comparison, on aarch64 partner HW I get the following averages

        Castagnoli = 3.586
        Zlib = 3.580

        Many thanks for you help with this,
        Ed.

        Show
        enevill Edward Nevill added a comment - Hi, I have revised the patch to include the changes requested above. I have also updated test_bulk_crc32.c so it prints out the times for 16384 bytes @ 512 bytes per checksum X 1000000 iterations for both the Castagnoli and Zlib polynomials. The following are the results I get for x86_64 before and after. I have done 5 runs of each. BEFORE [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.8 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.84 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.1 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.85 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.94 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.81 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. AFTER [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.99 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.9 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/ native /test_bulk_crc32: SUCCESS. Loking at the average time over 5 runs gives BEFORE Castagnoli average = 1.116 sec Zlib average = 13.848 sec AFTER Castagnoli average = 1.116 Zlib average = 13.93 So the performance for the Castagnoli polynomial is the same. For the Zlib poynomial there seems to be a performance degradation of 0.6%. This may be due to experimental error, however this is unaccelerated in any case on x86 because it is not supported on x86 HW and is not used for HDFS. For comparison, on aarch64 partner HW I get the following averages Castagnoli = 3.586 Zlib = 3.580 Many thanks for you help with this, Ed.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12708163/jira-11660.patch
        against trunk revision ae3e8c6.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708163/jira-11660.patch against trunk revision ae3e8c6. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        +1. Thanks, Edward Nevill.

        Just a quick note for the future, we usually leave around the old patch when attaching a new patch. Then we have files like jira-11660.001.patch, jira-11660.002.patch, etc. attached so we can see the evolution of the patch over time.

        Committed to 2.8.

        Show
        cmccabe Colin P. McCabe added a comment - +1. Thanks, Edward Nevill . Just a quick note for the future, we usually leave around the old patch when attaching a new patch. Then we have files like jira-11660.001.patch, jira-11660.002.patch, etc. attached so we can see the evolution of the patch over time. Committed to 2.8.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #7469 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7469/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #7469 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7469/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/149/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/149/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #883 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/883/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #883 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/883/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2081 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2081/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2081 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2081/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #140 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/140/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #140 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/140/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/CMakeLists.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/149/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/149/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2099 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2099/)
        HADOOP-11660. Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c)

        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java
        • hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/CMakeLists.txt
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c
        • hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2099 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2099/ ) HADOOP-11660 . Add support for hardware crc of HDFS checksums on ARM aarch64 architecture (Edward Nevill via Colin P. McCabe) (cmccabe: rev d9ac5ee2c4dcd4a108ca892af501618caaea450c) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestNativeCrc32.java hadoop-common-project/hadoop-common/src/main/native/src/test/org/apache/hadoop/util/test_bulk_crc32.c hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/CMakeLists.txt hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_aarch64.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
        Hide
        apinski-cavium Andrew Pinski added a comment -

        > On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if
        > we did the same on aarch64 and had say
        > crc32 w0, w0, x3
        > crc32 w1, w1, x4
        > crc32 w2, w2, x5
        > The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete
        > and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so
        > there is no benefit to pipelining.

        This is not true on some AARCH64 processors. For ThunderX, this is definitely not true. crc32 (32bits) is fully pipelined and the next one will issue right away.
        Though the latency of those instructions are 4 cycles long. So this will only take 6 cycles on ThunderX.

        Thanks,
        Andrew Pinski

        Show
        apinski-cavium Andrew Pinski added a comment - > On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if > we did the same on aarch64 and had say > crc32 w0, w0, x3 > crc32 w1, w1, x4 > crc32 w2, w2, x5 > The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete > and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so > there is no benefit to pipelining. This is not true on some AARCH64 processors. For ThunderX, this is definitely not true. crc32 (32bits) is fully pipelined and the next one will issue right away. Though the latency of those instructions are 4 cycles long. So this will only take 6 cycles on ThunderX. Thanks, Andrew Pinski
        Hide
        enevill Edward Nevill added a comment -

        Hi Andrew,

        Yes. Please see my comments of 4/Mar/15.

        The version of the CRC patch checked into Hadoop does support pipelining.

        On A57 without pipelining the raw CRC speed was 4.5X better. With pipelining it was 11X better.

        Regards,
        Ed.

        Show
        enevill Edward Nevill added a comment - Hi Andrew, Yes. Please see my comments of 4/Mar/15. The version of the CRC patch checked into Hadoop does support pipelining. On A57 without pipelining the raw CRC speed was 4.5X better. With pipelining it was 11X better. Regards, Ed.

          People

          • Assignee:
            enevill Edward Nevill
            Reporter:
            enevill Edward Nevill
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development