Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DiskChecker can fail to detect total disk/controller failures indefinitely. We have seen this in real clusters. DiskChecker performs simple permissions-based checks on directories which do not guarantee that any disk IO will be attempted.

      A simple improvement is to write some data and flush it to the disk.

      1. HADOOP-13738.01.patch
        10 kB
        Arpit Agarwal
      2. HADOOP-13738.02.patch
        11 kB
        Arpit Agarwal
      3. HADOOP-13738.03.patch
        12 kB
        Arpit Agarwal
      4. HADOOP-13738.04.patch
        16 kB
        Arpit Agarwal
      5. HADOOP-13738.05.patch
        16 kB
        Arpit Agarwal

        Issue Links

          Activity

          Hide
          arpitagarwal Arpit Agarwal added a comment -

          v01 patch attempts to create a file in the target directory, write 1 byte to it and flush the file data to disk.

          Not hitting "Submit Patch" yet as this depends on HADOOP-13737.

          Show
          arpitagarwal Arpit Agarwal added a comment - v01 patch attempts to create a file in the target directory, write 1 byte to it and flush the file data to disk. Not hitting "Submit Patch" yet as this depends on HADOOP-13737 .
          Hide
          kihwal Kihwal Lee added a comment -

          The existing implementation is mainly for detecting read-only file system (mkdir fails with EROFS) and unmounted storage (fails with EPERM).

          We have seen cases where written data is lost after closing because delayed block allocation failed in kernel. Since this failure is asynchronous to the file write/close, no user process received an error. I think enabling syncOnClose will make such writes to fail with EIO. The write-sync test will more likely detect this kind of conditions, so I think this approach has a merit.

          Another common disk failure mode involves read error. Writes go through fine, but reading back can cause an unrecoverable error/hang. Unless the affected sector is used for file system metadata, no action at file system-level will be taken. This is kind of being dealt with by adding the affected block to the volume scanner queue. The write-sync check will still catch many bad disks.

          Any particular reason why it retries on FNFE? When do you think that will happen?

          Show
          kihwal Kihwal Lee added a comment - The existing implementation is mainly for detecting read-only file system (mkdir fails with EROFS) and unmounted storage (fails with EPERM). We have seen cases where written data is lost after closing because delayed block allocation failed in kernel. Since this failure is asynchronous to the file write/close, no user process received an error. I think enabling syncOnClose will make such writes to fail with EIO . The write-sync test will more likely detect this kind of conditions, so I think this approach has a merit. Another common disk failure mode involves read error. Writes go through fine, but reading back can cause an unrecoverable error/hang. Unless the affected sector is used for file system metadata, no action at file system-level will be taken. This is kind of being dealt with by adding the affected block to the volume scanner queue. The write-sync check will still catch many bad disks. Any particular reason why it retries on FNFE? When do you think that will happen?
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Thanks for the feedback Kihwal Lee.

          Any particular reason why it retries on FNFE? When do you think that will happen?

          The retry on FNFE handles the very unlikely situation of file name collision while creating the FileOutputStream. e.g. due to simultaneous checks or a previously existing file which cannot be deleted.

          Show
          arpitagarwal Arpit Agarwal added a comment - Thanks for the feedback Kihwal Lee . Any particular reason why it retries on FNFE? When do you think that will happen? The retry on FNFE handles the very unlikely situation of file name collision while creating the FileOutputStream. e.g. due to simultaneous checks or a previously existing file which cannot be deleted.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Another common disk failure mode involves read error. Writes go through fine, but reading back can cause an unrecoverable error/hang. Unless the affected sector is used for file system metadata, no action at file system-level will be taken.

          I don't remember seeing this one yet. Do you have a theory on what causes it?

          Show
          arpitagarwal Arpit Agarwal added a comment - Another common disk failure mode involves read error. Writes go through fine, but reading back can cause an unrecoverable error/hang. Unless the affected sector is used for file system metadata, no action at file system-level will be taken. I don't remember seeing this one yet. Do you have a theory on what causes it?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 5s HADOOP-13738 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HADOOP-13738
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834279/HADOOP-13738.01.patch
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10843/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 5s HADOOP-13738 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HADOOP-13738 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834279/HADOOP-13738.01.patch Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10843/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          I don't remember seeing this one yet. Do you have a theory on what causes it?

          I think the cause is mainly latent sector error. As drives getting larger and larger, these can go unnoticed for long time. Even with the "data_err=abort" mount option, the delayed block allocation error detected at EXT4 level doesn't normally cause the journal to be aborted (then become read-only), let alone reacting to read errors. The SMART data (e.g. remapping count) sometimes correlates to such read errors, but not all the time. I think there is large variance with manufacturer/model.

          Show
          kihwal Kihwal Lee added a comment - I don't remember seeing this one yet. Do you have a theory on what causes it? I think the cause is mainly latent sector error. As drives getting larger and larger, these can go unnoticed for long time. Even with the "data_err=abort" mount option, the delayed block allocation error detected at EXT4 level doesn't normally cause the journal to be aborted (then become read-only), let alone reacting to read errors. The SMART data (e.g. remapping count) sometimes correlates to such read errors, but not all the time. I think there is large variance with manufacturer/model.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          v02 patch rebased to trunk. Also made the doDiskIo method slightly more conservative (it retries up to three times on any IOException not just FNFE). Added two more tests for the changed behavior.

          Show
          arpitagarwal Arpit Agarwal added a comment - v02 patch rebased to trunk. Also made the doDiskIo method slightly more conservative (it retries up to three times on any IOException not just FNFE). Added two more tests for the changed behavior.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          v03. Fix a bad reference in javadocs.

          Show
          arpitagarwal Arpit Agarwal added a comment - v03. Fix a bad reference in javadocs.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 9m 4s trunk passed
          +1 compile 8m 30s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 1m 4s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 38s trunk passed
          +1 javadoc 0m 51s trunk passed
          +1 mvninstall 0m 47s the patch passed
          +1 compile 7m 59s the patch passed
          +1 javac 7m 59s the patch passed
          -0 checkstyle 0m 23s hadoop-common-project/hadoop-common: The patch generated 7 new + 30 unchanged - 2 fixed = 37 total (was 32)
          +1 mvnsite 0m 54s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          -1 findbugs 1m 29s hadoop-common-project/hadoop-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
          +1 javadoc 0m 42s the patch passed
          -1 unit 20m 1s hadoop-common in the patch failed.
          +1 asflicense 0m 24s The patch does not generate ASF License warnings.
          56m 19s



          Reason Tests
          FindBugs module:hadoop-common-project/hadoop-common
            Bad attempt to compute absolute value of signed random integer in org.apache.hadoop.util.DiskChecker.makeRandomFile(File) At DiskChecker.java:value of signed random integer in org.apache.hadoop.util.DiskChecker.makeRandomFile(File) At DiskChecker.java:[line 258]
          Timed out junit tests org.apache.hadoop.http.TestHttpServerLifecycle



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-13738
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835033/HADOOP-13738.02.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 8156dfa8c3d0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9d17585
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt
          findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/new-findbugs-hadoop-common-project_hadoop-common.html
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 9m 4s trunk passed +1 compile 8m 30s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 1m 4s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 38s trunk passed +1 javadoc 0m 51s trunk passed +1 mvninstall 0m 47s the patch passed +1 compile 7m 59s the patch passed +1 javac 7m 59s the patch passed -0 checkstyle 0m 23s hadoop-common-project/hadoop-common: The patch generated 7 new + 30 unchanged - 2 fixed = 37 total (was 32) +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. -1 findbugs 1m 29s hadoop-common-project/hadoop-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) +1 javadoc 0m 42s the patch passed -1 unit 20m 1s hadoop-common in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 56m 19s Reason Tests FindBugs module:hadoop-common-project/hadoop-common   Bad attempt to compute absolute value of signed random integer in org.apache.hadoop.util.DiskChecker.makeRandomFile(File) At DiskChecker.java:value of signed random integer in org.apache.hadoop.util.DiskChecker.makeRandomFile(File) At DiskChecker.java: [line 258] Timed out junit tests org.apache.hadoop.http.TestHttpServerLifecycle Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-13738 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835033/HADOOP-13738.02.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8156dfa8c3d0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9d17585 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/new-findbugs-hadoop-common-project_hadoop-common.html unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10881/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Retrying is fine in general, but what if it took minutes to fail in the previous attempt?

          Show
          kihwal Kihwal Lee added a comment - Retrying is fine in general, but what if it took minutes to fail in the previous attempt?
          Hide
          anu Anu Engineer added a comment - - edited

          Arpit Agarwal Thank you for these improvements. I am sure that these are going to make detecting errors on datanode much easier.

          Had some minor comments / questions.

          1. Not sure what diskchecker.random buys us versus diskchecker.1, diskchecker.2 or diskchecker.3. These files are always deleted after use and in a failure case seeing the sequence number of diskchecker files might be helpful. so I am not sure why we need random at all here.
          2. As Kihwal Lee said, just wanted to think through the failure. I can think of 3 distinct failure cases.
            1. Not able to create a file at all – You can try 3 times and come out, may be as Kihwal Lee said it will take us 6 mins to get out.
            2. Creation works, but I/O and delete fails – In this case disk I/O failure is propagated but the junk files remain. Since disk checker will flag the disk is having an issue this case is not problematic.
            3. File creation and I/O works, but delete fails. We seem to be using FileUtils.deleteQuietly, shouldn't diskchecker be able to understand the delete operation failed ? Also in this scenario, if we have both random files and delete failures, we might create way too many junk files. If you use dc.1, dc.2, dc.3 – we might be able to restrict junk files to 3
          Show
          anu Anu Engineer added a comment - - edited Arpit Agarwal Thank you for these improvements. I am sure that these are going to make detecting errors on datanode much easier. Had some minor comments / questions. Not sure what diskchecker.random buys us versus diskchecker.1, diskchecker.2 or diskchecker.3. These files are always deleted after use and in a failure case seeing the sequence number of diskchecker files might be helpful. so I am not sure why we need random at all here. As Kihwal Lee said, just wanted to think through the failure. I can think of 3 distinct failure cases. Not able to create a file at all – You can try 3 times and come out, may be as Kihwal Lee said it will take us 6 mins to get out. Creation works, but I/O and delete fails – In this case disk I/O failure is propagated but the junk files remain. Since disk checker will flag the disk is having an issue this case is not problematic. File creation and I/O works, but delete fails. We seem to be using FileUtils.deleteQuietly , shouldn't diskchecker be able to understand the delete operation failed ? Also in this scenario, if we have both random files and delete failures, we might create way too many junk files. If you use dc.1, dc.2, dc.3 – we might be able to restrict junk files to 3
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Kihwal Lee that's a good point. The caller should deal with slow checks since a single check can also hang indefinitely. The DN currently serializes disk checks so one slow disk can delay checking the rest. Also it doesn't handle stalled checks. I am testing some changes to fix this in the DataNode and plan to post patches in the next few days.

          Show
          arpitagarwal Arpit Agarwal added a comment - Kihwal Lee that's a good point. The caller should deal with slow checks since a single check can also hang indefinitely. The DN currently serializes disk checks so one slow disk can delay checking the rest. Also it doesn't handle stalled checks. I am testing some changes to fix this in the DataNode and plan to post patches in the next few days.
          Hide
          xyao Xiaoyu Yao added a comment -

          Thanks Arpit Agarwal for working on this, Kihwal Lee and Anu Engineer for the discussion.

          I can see some benefits of using random file name. The diskchecker may run multiple times. A random file name will not be impacted by the failed deletion from previous runs. If we want to use pattern for test file naming, we should do clean up of files from previous run before the disk check like Arpit Agarwal has already done in the unit test.

          Can we have some timer/threshold (in ms level) for the expected execution time of each diskIoCheckWithoutNativeIo() test to break out of the retry loop? This way, we won't have to wait forever even with the current serialized disk check in datanode.

          Show
          xyao Xiaoyu Yao added a comment - Thanks Arpit Agarwal for working on this, Kihwal Lee and Anu Engineer for the discussion. I can see some benefits of using random file name. The diskchecker may run multiple times. A random file name will not be impacted by the failed deletion from previous runs. If we want to use pattern for test file naming, we should do clean up of files from previous run before the disk check like Arpit Agarwal has already done in the unit test. Can we have some timer/threshold (in ms level) for the expected execution time of each diskIoCheckWithoutNativeIo() test to break out of the retry loop? This way, we won't have to wait forever even with the current serialized disk check in datanode.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Thanks for the feedback all! I've incorporated most comments.

          so I am not sure why we need random at all here

          Changed file naming scheme to use fixed names. If we hit two successive failures then we'll try once more with a randomized file name.

          shouldn't diskchecker be able to understand the delete operation failed

          Fixed.

          Can we have some timer/threshold (in ms level) for the expected execution time of each diskIoCheckWithoutNativeIo() test to break out of the retry loop

          Hi Xiaoyu Yao, that will require spawning a thread. DiskChecker will have to maintain a thread pool. We could end up with many threads stalled on a slow disk and checks of healthy disks waiting for thread availability. It is easier to solve this in the caller. Let me know if you're okay with deferring this particular problem for now.

          Show
          arpitagarwal Arpit Agarwal added a comment - Thanks for the feedback all! I've incorporated most comments. so I am not sure why we need random at all here Changed file naming scheme to use fixed names. If we hit two successive failures then we'll try once more with a randomized file name. shouldn't diskchecker be able to understand the delete operation failed Fixed. Can we have some timer/threshold (in ms level) for the expected execution time of each diskIoCheckWithoutNativeIo() test to break out of the retry loop Hi Xiaoyu Yao , that will require spawning a thread. DiskChecker will have to maintain a thread pool. We could end up with many threads stalled on a slow disk and checks of healthy disks waiting for thread availability. It is easier to solve this in the caller. Let me know if you're okay with deferring this particular problem for now.
          Hide
          anu Anu Engineer added a comment -

          +1, pending Jenkins. Thanks for updating the patch and fixing the issues. I will leave this JIRA unresolved till monday evening (oct/31) in case Kihwal Lee or Xiaoyu Yao has any further comments.
          As for the I/O getting stuck in a disk, I am hoping the normal I/O on that datanode would have caught that issue and errors already propagated. So I am okay with not solving that in DiskChecker.

          Show
          anu Anu Engineer added a comment - +1, pending Jenkins. Thanks for updating the patch and fixing the issues. I will leave this JIRA unresolved till monday evening (oct/31) in case Kihwal Lee or Xiaoyu Yao has any further comments. As for the I/O getting stuck in a disk, I am hoping the normal I/O on that datanode would have caught that issue and errors already propagated. So I am okay with not solving that in DiskChecker.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Thanks Anu Engineer. Yeah we should hold off committing for a few days to let Kihwal and Xiaoyu comment.

          As for the I/O getting stuck in a disk, I am hoping the normal I/O on that datanode would have caught that issue and errors already propagated. So I am okay with not solving that in DiskChecker.

          Many IO failures in the DataNode trigger DiskChecker. So if the disk is failing it should eventually affect DiskChecker.

          Show
          arpitagarwal Arpit Agarwal added a comment - Thanks Anu Engineer . Yeah we should hold off committing for a few days to let Kihwal and Xiaoyu comment. As for the I/O getting stuck in a disk, I am hoping the normal I/O on that datanode would have caught that issue and errors already propagated. So I am okay with not solving that in DiskChecker. Many IO failures in the DataNode trigger DiskChecker. So if the disk is failing it should eventually affect DiskChecker.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 4s trunk passed
          +1 compile 7m 35s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 59s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 28s trunk passed
          +1 javadoc 0m 43s trunk passed
          +1 mvninstall 0m 41s the patch passed
          +1 compile 7m 27s the patch passed
          +1 javac 7m 27s the patch passed
          -0 checkstyle 0m 25s hadoop-common-project/hadoop-common: The patch generated 1 new + 26 unchanged - 6 fixed = 27 total (was 32)
          +1 mvnsite 0m 56s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 33s the patch passed
          +1 javadoc 0m 43s the patch passed
          -1 unit 8m 12s hadoop-common in the patch failed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          40m 37s



          Reason Tests
          Failed junit tests hadoop.ha.TestZKFailoverController



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-13738
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835898/HADOOP-13738.04.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux c2c595d92d94 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8a9388e
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 4s trunk passed +1 compile 7m 35s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 59s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 28s trunk passed +1 javadoc 0m 43s trunk passed +1 mvninstall 0m 41s the patch passed +1 compile 7m 27s the patch passed +1 javac 7m 27s the patch passed -0 checkstyle 0m 25s hadoop-common-project/hadoop-common: The patch generated 1 new + 26 unchanged - 6 fixed = 27 total (was 32) +1 mvnsite 0m 56s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 33s the patch passed +1 javadoc 0m 43s the patch passed -1 unit 8m 12s hadoop-common in the patch failed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 40m 37s Reason Tests Failed junit tests hadoop.ha.TestZKFailoverController Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-13738 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835898/HADOOP-13738.04.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c2c595d92d94 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8a9388e Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10921/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          v05 patch fixes the checkstyle issue. The test failure looks unrelated.

          Show
          arpitagarwal Arpit Agarwal added a comment - v05 patch fixes the checkstyle issue. The test failure looks unrelated.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 11s trunk passed
          +1 compile 7m 22s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 1m 9s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 35s trunk passed
          +1 javadoc 0m 41s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 7m 2s the patch passed
          +1 javac 7m 2s the patch passed
          +1 checkstyle 0m 28s hadoop-common-project/hadoop-common: The patch generated 0 new + 26 unchanged - 6 fixed = 26 total (was 32)
          +1 mvnsite 0m 59s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 37s the patch passed
          +1 javadoc 0m 42s the patch passed
          +1 unit 8m 18s hadoop-common in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          41m 42s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-13738
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835915/HADOOP-13738.05.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux df1b853c0f21 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 112f04e
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10922/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10922/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 11s trunk passed +1 compile 7m 22s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 1m 9s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 35s trunk passed +1 javadoc 0m 41s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 7m 2s the patch passed +1 javac 7m 2s the patch passed +1 checkstyle 0m 28s hadoop-common-project/hadoop-common: The patch generated 0 new + 26 unchanged - 6 fixed = 26 total (was 32) +1 mvnsite 0m 59s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 37s the patch passed +1 javadoc 0m 42s the patch passed +1 unit 8m 18s hadoop-common in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 41m 42s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-13738 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835915/HADOOP-13738.05.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux df1b853c0f21 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 112f04e Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10922/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10922/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          xyao Xiaoyu Yao added a comment -

          Thanks Arpit Agarwal for updating the patch. Patch v5 LGTM. +1.

          Show
          xyao Xiaoyu Yao added a comment - Thanks Arpit Agarwal for updating the patch. Patch v5 LGTM. +1.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Thanks for the review Xiaoyu Yao.

          Kihwal Lee, it didn't sound like you had any objections to the proposed approach but I'll wait until the end of this week before committing.

          Filed HDFS-11086 to separately address some more improvements in DN's use of DiskChecker. This will address the check taking minutes failure case you brought up.

          Show
          arpitagarwal Arpit Agarwal added a comment - Thanks for the review Xiaoyu Yao . Kihwal Lee , it didn't sound like you had any objections to the proposed approach but I'll wait until the end of this week before committing. Filed HDFS-11086 to separately address some more improvements in DN's use of DiskChecker. This will address the check taking minutes failure case you brought up.
          Hide
          kihwal Kihwal Lee added a comment -

          +1

          Show
          kihwal Kihwal Lee added a comment - +1
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Thank you all for the reviews and discussion.

          Pushed for 2.9.0. The branch-2 commit required the following delta wrt trunk to compile:

             private static AtomicReference<FileIoProvider> fileIoProvider =
          -      new AtomicReference<>(new DefaultFileIoProvider());
          +      new AtomicReference<FileIoProvider>(new DefaultFileIoProvider());
          
          Show
          arpitagarwal Arpit Agarwal added a comment - Thank you all for the reviews and discussion. Pushed for 2.9.0. The branch-2 commit required the following delta wrt trunk to compile: private static AtomicReference<FileIoProvider> fileIoProvider = - new AtomicReference<>( new DefaultFileIoProvider()); + new AtomicReference<FileIoProvider>( new DefaultFileIoProvider());
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10750 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10750/)
          HADOOP-13738. DiskChecker should perform some disk IO. (arp: rev 1b6ecaf016aaf7f6a09a4d576294b5e0a6850a1f)

          • (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestDiskChecker.java
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DiskChecker.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10750 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10750/ ) HADOOP-13738 . DiskChecker should perform some disk IO. (arp: rev 1b6ecaf016aaf7f6a09a4d576294b5e0a6850a1f) (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestDiskChecker.java (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DiskChecker.java

            People

            • Assignee:
              arpitagarwal Arpit Agarwal
              Reporter:
              arpitagarwal Arpit Agarwal
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development