Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4782

backport edit log corruption toleration to branch-1-win

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1-win
    • Fix Version/s: 1-win
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      HDFS-3521 made changes to management of the edits log to prevent certain cases of corruption. This issue tracks backporting those changes to branch-1-win.

      1. HDFS-4782-branch-1-win.2.patch
        48 kB
        Chris Nauroth
      2. HDFS-4782-branch-1-win.1.patch
        38 kB
        Chris Nauroth

        Issue Links

          Activity

          Hide
          cnauroth Chris Nauroth added a comment -

          Attaching the patch. This was a fairly straightforward backport of the HDFS-3521 patch with some manual resolution in FSEditLog.

          We have observed some cases on Windows of an edits log with ~1 MB of trailing 0-bytes due to a race condition between preallocation and namenode shutdown. On the next namenode startup, the 0-bytes get misinterpreted as an OP_ADD, and the load fails. This patch will prevent those issues by changing preallocation to use OP_INVALID.

          I tested this on Mac and Windows.

          Show
          cnauroth Chris Nauroth added a comment - Attaching the patch. This was a fairly straightforward backport of the HDFS-3521 patch with some manual resolution in FSEditLog . We have observed some cases on Windows of an edits log with ~1 MB of trailing 0-bytes due to a race condition between preallocation and namenode shutdown. On the next namenode startup, the 0-bytes get misinterpreted as an OP_ADD , and the load fails. This patch will prevent those issues by changing preallocation to use OP_INVALID . I tested this on Mac and Windows.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          For backporting, let's try to make the code to be the same in both branches. Then, it would be easier to maintain.

          • add FSEditLog.PositionTrackingInputStream instead of org.apache.hadoop.io.PositionInputStream, although I like the latter more.
          • In FSEditLog.loadFSEdits(..),
            • initialize PositionTrackingInputStream even if isTolerationEnabled is false. It is useful for printing log messages.
            • catch Throwable instead of Exception. Then, it will catch Error(s) like OutOfMemoryError. (If an edit log is corrupted, the code may try to create a huge array, so OutOfMemoryError is possible.)
          • use the class name Padding instead of PaddingCorruption in TestEditLogToleration.
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - For backporting, let's try to make the code to be the same in both branches. Then, it would be easier to maintain. add FSEditLog.PositionTrackingInputStream instead of org.apache.hadoop.io.PositionInputStream, although I like the latter more. In FSEditLog.loadFSEdits(..), initialize PositionTrackingInputStream even if isTolerationEnabled is false. It is useful for printing log messages. catch Throwable instead of Exception. Then, it will catch Error(s) like OutOfMemoryError. (If an edit log is corrupted, the code may try to create a huge array, so OutOfMemoryError is possible.) use the class name Padding instead of PaddingCorruption in TestEditLogToleration.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          BTW, should we also backport HDFS-3540?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - BTW, should we also backport HDFS-3540 ?
          Hide
          cnauroth Chris Nauroth added a comment -

          Thanks, Nicholas! I'm attaching a new patch to address your feedback. I made the files in this patch match their branch-1 versions more closely. I backported the HDFS-3540 changes in hdfs-default.xml and DFS_NAMENODE_EDITS_TOLERATION_LENGTH_DEFAULT. Additionally, the latest patch includes a backport of HDFS-3596 to bring FSEditLog.java even closer to the branch-1 code.

          I think this makes the code match between the 2 branches as closely as possible without folding in additional features, like recovery mode and concat. We can handle those as separate backports later if needed. For right now, we're most interested in mitigating edit log corruption.

          I retested on Mac and Windows.

          Show
          cnauroth Chris Nauroth added a comment - Thanks, Nicholas! I'm attaching a new patch to address your feedback. I made the files in this patch match their branch-1 versions more closely. I backported the HDFS-3540 changes in hdfs-default.xml and DFS_NAMENODE_EDITS_TOLERATION_LENGTH_DEFAULT. Additionally, the latest patch includes a backport of HDFS-3596 to bring FSEditLog.java even closer to the branch-1 code. I think this makes the code match between the 2 branches as closely as possible without folding in additional features, like recovery mode and concat. We can handle those as separate backports later if needed. For right now, we're most interested in mitigating edit log corruption. I retested on Mac and Windows.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          The patch also contains JIRAs

          • HDFS-3838. Fix the typo in FSEditLog.java: isToterationEnabled should be isTolerationEnabled.
          • HDFS-3961. FSEditLog preallocate() preallocated more than 1MB.
          • FSEditLog part of HDFS-4122. Cleanup HDFS logs and reduce the size of logged messages.
          • HDFS-4252. Improve confusing log message that prints exception when editlog read is completed.
          • HDFS-4444. Add space between total transaction time and number of transactions in FSEditLog#printStatistics.
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - The patch also contains JIRAs HDFS-3838 . Fix the typo in FSEditLog.java: isToterationEnabled should be isTolerationEnabled. HDFS-3961 . FSEditLog preallocate() preallocated more than 1MB. FSEditLog part of HDFS-4122 . Cleanup HDFS logs and reduce the size of logged messages. HDFS-4252 . Improve confusing log message that prints exception when editlog read is completed. HDFS-4444 . Add space between total transaction time and number of transactions in FSEditLog#printStatistics.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Chris!

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Chris!
          Hide
          cnauroth Chris Nauroth added a comment -

          Nicholas, thanks for the review and commit, and also thank you for enumerating all of the historical jiras that were folded into this patch.

          Show
          cnauroth Chris Nauroth added a comment - Nicholas, thanks for the review and commit, and also thank you for enumerating all of the historical jiras that were folded into this patch.

            People

            • Assignee:
              cnauroth Chris Nauroth
              Reporter:
              cnauroth Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development