Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.5.0
    • Fix Version/s: None
    • Component/s: performance, task
    • Labels:
      None
    • Target Version/s:

      Description

      Currently, the IFile format used by the MR shuffle checksums all data using the zlib CRC32 polynomial. If we allow use of CRC32C instead, we can get a large reduction in CPU usage by leveraging the native hardware CRC32C implementation (approx half a second of CPU time savings per GB checksummed).

      1. mapreduce-5962.txt
        14 kB
        Todd Lipcon
      2. mapreduce-5962.txt
        14 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon created issue -
          Todd Lipcon made changes -
          Field Original Value New Value
          Link This issue is related to HDFS-3528 [ HDFS-3528 ]
          Todd Lipcon made changes -
          Link This issue relates to MAPREDUCE-2841 [ MAPREDUCE-2841 ]
          James Thomas made changes -
          Assignee James Thomas [ james.thomas ]
          Todd Lipcon made changes -
          Assignee James Thomas [ james.thomas ] Todd Lipcon [ tlipcon ]
          Hide
          Todd Lipcon added a comment -

          Attached patch adds a new configuration to set the IFile checksum type. I changed the default to CRC32C since it's much faster if you have the native libraries available.

          I don't believe this is an incompatible change, since IFiles are only used internal to a single job (written by map, read by reduce). So, one would never have a different version reader compared to writer. That said, if anyone has any issues with this, they can configure the default back to CRC32 cluster-wide.

          Show
          Todd Lipcon added a comment - Attached patch adds a new configuration to set the IFile checksum type. I changed the default to CRC32C since it's much faster if you have the native libraries available. I don't believe this is an incompatible change, since IFiles are only used internal to a single job (written by map, read by reduce). So, one would never have a different version reader compared to writer. That said, if anyone has any issues with this, they can configure the default back to CRC32 cluster-wide.
          Todd Lipcon made changes -
          Attachment mapreduce-5962.txt [ 12656314 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Todd Lipcon added a comment -

          (fwiw this depends on James Thomas's work to enable native checksumming on byte arrays. So we won't see an immediate benefit, but will once that patch is done)

          Show
          Todd Lipcon added a comment - (fwiw this depends on James Thomas 's work to enable native checksumming on byte arrays. So we won't see an immediate benefit, but will once that patch is done)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12656314/mapreduce-5962.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapred.TestReduceFetch
          org.apache.hadoop.mapred.TestReduceFetchFromPartialMem

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656314/mapreduce-5962.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapred.TestReduceFetchFromPartialMem +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4750//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The RawKVIteratorReader used in the shuffle wasn't properly passing the jobconf into the IFileReader. This was causing an NPE when we tried to get the checksum type out of the conf. I changed it to pass the jobConf, which may actually have a slight performance advantage too due to avoiding the "new Configuration()" call in IFileInputStream's ctor. Verified that the two unit tests that failed before now pass on my machine

          Show
          Todd Lipcon added a comment - The RawKVIteratorReader used in the shuffle wasn't properly passing the jobconf into the IFileReader. This was causing an NPE when we tried to get the checksum type out of the conf. I changed it to pass the jobConf, which may actually have a slight performance advantage too due to avoiding the "new Configuration()" call in IFileInputStream's ctor. Verified that the two unit tests that failed before now pass on my machine
          Todd Lipcon made changes -
          Attachment mapreduce-5962.txt [ 12656376 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12656376/mapreduce-5962.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656376/mapreduce-5962.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4752//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Realized for this to be effective we also need to implement the Checksum interface with the native code. Currently the native code only supports the "chunked sums" verification used by HDFS, and doesn't implement the java Checksum.update interface that IFile uses. Will hold off on this patch for the time being.

          Show
          Todd Lipcon added a comment - Realized for this to be effective we also need to implement the Checksum interface with the native code. Currently the native code only supports the "chunked sums" verification used by HDFS, and doesn't implement the java Checksum.update interface that IFile uses. Will hold off on this patch for the time being.
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Todd Lipcon made changes -
          Link This issue relates to HADOOP-10859 [ HADOOP-10859 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Moving features/enhancements out of previously closed releases into the next minor release 2.8.0.

          Show
          Vinod Kumar Vavilapalli added a comment - Moving features/enhancements out of previously closed releases into the next minor release 2.8.0.
          Vinod Kumar Vavilapalli made changes -
          Target Version/s 2.5.0 [ 12326265 ] 2.8.0 [ 12329060 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          10d 1h 22m 1 Todd Lipcon 17/Jul/14 19:19
          Patch Available Patch Available Open Open
          20h 56m 1 Todd Lipcon 18/Jul/14 16:16

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development