Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a bug that made fsck -list-corruptfileblocks counts corrupt erasure coded files incorrectly.

      Description

      HDFS-10826 fix fsck for corrupt EC files if no parameters are specified.

      However, if I change the test case added in HDFS-10826 (TestFsck#testFsckCorruptECFile) to run "fsck -list-corruptfileblocks", the same test test failed because fsck reports no corrupt files.

      Interestingly, if I run "fsck -files -blocks -replicaDetails" then the test passed and shows the corrupt file.

      Need to fix the discrepancy.

      1. HDFS-10975.1.patch
        3 kB
        Takanobu Asanuma

        Issue Links

          Activity

          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          It is also worth noting that without any parameters, the corrupt EC file is also reported as having an underr replicated block. An EC block should not be called under replicated if it misses 4 "replicas".

          Show
          jojochuang Wei-Chiu Chuang added a comment - It is also worth noting that without any parameters, the corrupt EC file is also reported as having an underr replicated block. An EC block should not be called under replicated if it misses 4 "replicas".
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          Wei-Chiu Chuang
          Thank you for filing the issue! If you haven't started working on this jira, can I take it?

          Show
          tasanuma0829 Takanobu Asanuma added a comment - Wei-Chiu Chuang Thank you for filing the issue! If you haven't started working on this jira, can I take it?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Sure please go ahead and take it. I have been looking for all kinds of EC bugs recently but I do not have time to work on all of them

          Show
          jojochuang Wei-Chiu Chuang added a comment - Sure please go ahead and take it. I have been looking for all kinds of EC bugs recently but I do not have time to work on all of them
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          Thanks, I will!

          Show
          tasanuma0829 Takanobu Asanuma added a comment - Thanks, I will!
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          Wei-Chiu Chuang
          I would appreciate if you could review HDFS-10933. It will affect this jira.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - Wei-Chiu Chuang I would appreciate if you could review HDFS-10933 . It will affect this jira.
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          "fsck -list-corruptfileblocks" calls FSNameSystem.listCorruptFileBlocks and HDFS-10827 will solve the bug for ec files. In this jira, I will fix the code which counts "Under-erasure-coded block groups" and add some unit tests.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - "fsck -list-corruptfileblocks" calls FSNameSystem.listCorruptFileBlocks and HDFS-10827 will solve the bug for ec files. In this jira, I will fix the code which counts "Under-erasure-coded block groups" and add some unit tests.
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          I confirmed that HDFS-10827 solved the "fsck -list-corruptfileblocks" problem for ec files.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - I confirmed that HDFS-10827 solved the "fsck -list-corruptfileblocks" problem for ec files.
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          I uploaded a new patch. It fixes the metric of ec blocks whose state is unrecoverable. And it also includes some unit tests.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - I uploaded a new patch. It fixes the metric of ec blocks whose state is unrecoverable. And it also includes some unit tests.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 24s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 7s trunk passed
          +1 compile 0m 49s trunk passed
          +1 checkstyle 0m 27s trunk passed
          +1 mvnsite 0m 54s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 2m 3s trunk passed
          +1 javadoc 1m 35s trunk passed
          +1 mvninstall 1m 58s the patch passed
          +1 compile 1m 50s the patch passed
          +1 javac 1m 50s the patch passed
          -0 checkstyle 1m 4s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 98 unchanged - 2 fixed = 99 total (was 100)
          +1 mvnsite 2m 3s the patch passed
          +1 mvneclipse 0m 35s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 58s the patch passed
          +1 javadoc 0m 41s the patch passed
          -1 unit 76m 9s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          105m 10s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestRollingUpgrade
            hadoop.hdfs.TestDFSClientRetries



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HDFS-10975
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12833679/HDFS-10975.1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 67c38bc23d10 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 1f304b0
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17181/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/17181/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17181/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17181/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 7s trunk passed +1 compile 0m 49s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 54s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 2m 3s trunk passed +1 javadoc 1m 35s trunk passed +1 mvninstall 1m 58s the patch passed +1 compile 1m 50s the patch passed +1 javac 1m 50s the patch passed -0 checkstyle 1m 4s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 98 unchanged - 2 fixed = 99 total (was 100) +1 mvnsite 2m 3s the patch passed +1 mvneclipse 0m 35s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 58s the patch passed +1 javadoc 0m 41s the patch passed -1 unit 76m 9s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 105m 10s Reason Tests Failed junit tests hadoop.hdfs.TestRollingUpgrade   hadoop.hdfs.TestDFSClientRetries Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-10975 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12833679/HDFS-10975.1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 67c38bc23d10 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 1f304b0 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17181/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/17181/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17181/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17181/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          The failed tests don't seem to be related.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - The failed tests don't seem to be related.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hi Takanobu Asanuma thanks for making the patch available. I think your patch fixed most of the problems, but there still appears to be one minor issue.

          If I look at the output of testFsckMissingECFile, which intentionally removes 4 out of 9 datanodes:

          Erasure Coded Block Groups:
           Total size:	393216 B
           Total files:	1
           Total block groups (validated):	1 (avg. block group size 393216 B)
            ********************************
            UNRECOVERABLE BLOCK GROUPS:	1 (100.0 %)
            CORRUPT FILES:	1
            MISSING BLOCK GROUPS:	1
            MISSING SIZE:		393216 B
            ********************************
           Minimally erasure-coded block groups:	0 (0.0 %)
           Over-erasure-coded block groups:	0 (0.0 %)
           Under-erasure-coded block groups:	0 (0.0 %)
           Unsatisfactory placement block groups:	0 (0.0 %)
           Default ecPolicy:		RS-DEFAULT-6-3-64k
           Average block group size:	5.0
           Missing block groups:		1
           Corrupt block groups:		0
           Missing internal blocks:	0 (0.0 %)
          FSCK ended at Tue Oct 18 09:43:23 PDT 2016 in 2 milliseconds
          
          
          The filesystem under path '/' is CORRUPT
          

          The output says missing 0 internal blocks. Shouldn't it say missing 4 internal blocks?

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hi Takanobu Asanuma thanks for making the patch available. I think your patch fixed most of the problems, but there still appears to be one minor issue. If I look at the output of testFsckMissingECFile, which intentionally removes 4 out of 9 datanodes: Erasure Coded Block Groups: Total size: 393216 B Total files: 1 Total block groups (validated): 1 (avg. block group size 393216 B) ******************************** UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) CORRUPT FILES: 1 MISSING BLOCK GROUPS: 1 MISSING SIZE: 393216 B ******************************** Minimally erasure-coded block groups: 0 (0.0 %) Over-erasure-coded block groups: 0 (0.0 %) Under-erasure-coded block groups: 0 (0.0 %) Unsatisfactory placement block groups: 0 (0.0 %) Default ecPolicy: RS-DEFAULT-6-3-64k Average block group size: 5.0 Missing block groups: 1 Corrupt block groups: 0 Missing internal blocks: 0 (0.0 %) FSCK ended at Tue Oct 18 09:43:23 PDT 2016 in 2 milliseconds The filesystem under path '/' is CORRUPT The output says missing 0 internal blocks. Shouldn't it say missing 4 internal blocks?
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          Thank you for reviewing, Wei-Chiu Chuang!

          In the case of a replication block, if the state of the block is low-redundancy(not missing), the missing replicas are counted as "Missing replicas".

          /replication/missing:  Under replicated BP-442454012-172.16.165.209-1476839633883:blk_1073741825_1001. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
          Status: HEALTHY
           Number of data-nodes:	1
           Number of racks:		1
           Total dirs:			2
           Total symlinks:		0
          
          Replicated Blocks:
           Total size:	1024 B
           Total files:	1
           Total blocks (validated):	1 (avg. block size 1024 B)
           Minimally replicated blocks:	1 (100.0 %)
           Over-replicated blocks:	0 (0.0 %)
           Under-replicated blocks:	1 (100.0 %)
           Mis-replicated blocks:		0 (0.0 %)
           Default replication factor:	3
           Average block replication:	1.0
           Missing blocks:		0
           Corrupt blocks:		0
           Missing replicas:		2 (66.666664 %)
           

          However, if the block is missing, which means all of the replicas of the block are missing, they are not counted as "Missing replicas".

          /replication/missing: MISSING 1 blocks of total size 1024 B.
          Status: CORRUPT
           Number of data-nodes:	0
           Number of racks:		0
           Total dirs:			2
           Total symlinks:		0
          
          Replicated Blocks:
           Total size:	1024 B
           Total files:	1
           Total blocks (validated):	1 (avg. block size 1024 B)
            ********************************
            UNDER MIN REPL'D BLOCKS:	1 (100.0 %)
            dfs.namenode.replication.min:	1
            CORRUPT FILES:	1
            MISSING BLOCKS:	1
            MISSING SIZE:		1024 B
            ********************************
           Minimally replicated blocks:	0 (0.0 %)
           Over-replicated blocks:	0 (0.0 %)
           Under-replicated blocks:	0 (0.0 %)
           Mis-replicated blocks:		0 (0.0 %)
           Default replication factor:	3
           Average block replication:	0.0
           Missing blocks:		1
           Corrupt blocks:		0
           Missing replicas:		0
           

          If we synchronize the fsck result of ec and replication, "Missing internal blocks" should not be counted when the state is unrecoverable. What do you think?

          Show
          tasanuma0829 Takanobu Asanuma added a comment - Thank you for reviewing, Wei-Chiu Chuang ! In the case of a replication block, if the state of the block is low-redundancy(not missing), the missing replicas are counted as "Missing replicas". /replication/missing: Under replicated BP-442454012-172.16.165.209-1476839633883:blk_1073741825_1001. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s). Status: HEALTHY Number of data-nodes: 1 Number of racks: 1 Total dirs: 2 Total symlinks: 0 Replicated Blocks: Total size: 1024 B Total files: 1 Total blocks (validated): 1 (avg. block size 1024 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (100.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 1.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 2 (66.666664 %) However, if the block is missing, which means all of the replicas of the block are missing, they are not counted as "Missing replicas". /replication/missing: MISSING 1 blocks of total size 1024 B. Status: CORRUPT Number of data-nodes: 0 Number of racks: 0 Total dirs: 2 Total symlinks: 0 Replicated Blocks: Total size: 1024 B Total files: 1 Total blocks (validated): 1 (avg. block size 1024 B) ******************************** UNDER MIN REPL'D BLOCKS: 1 (100.0 %) dfs.namenode.replication.min: 1 CORRUPT FILES: 1 MISSING BLOCKS: 1 MISSING SIZE: 1024 B ******************************** Minimally replicated blocks: 0 (0.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 0.0 Missing blocks: 1 Corrupt blocks: 0 Missing replicas: 0 If we synchronize the fsck result of ec and replication, "Missing internal blocks" should not be counted when the state is unrecoverable. What do you think?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks very much for the explanation. But I do find it counter-intuitive.
          Also, for verification, I ran the test case testFsckCorruptECFile.
          As you can see below, in this case the EC block is also unrecoverable, but it says missing 4 internal blocks.

          Erasure Coded Block Groups:
           Total size:	393216 B
           Total files:	1
           Total block groups (validated):	1 (avg. block group size 393216 B)
            ********************************
            UNRECOVERABLE BLOCK GROUPS:	1 (100.0 %)
            CORRUPT FILES:	1
            CORRUPT BLOCK GROUPS: 	1
            CORRUPT SIZE:		393216 B
            ********************************
           Minimally erasure-coded block groups:	0 (0.0 %)
           Over-erasure-coded block groups:	0 (0.0 %)
           Under-erasure-coded block groups:	1 (100.0 %)
           Unsatisfactory placement block groups:	0 (0.0 %)
           Default ecPolicy:		RS-DEFAULT-6-3-64k
           Average block group size:	5.0
           Missing block groups:		0
           Corrupt block groups:		1
           Missing internal blocks:	4 (44.444443 %)
          FSCK ended at Wed Oct 19 11:43:07 PDT 2016 in 2 milliseconds
          
          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks very much for the explanation. But I do find it counter-intuitive. Also, for verification, I ran the test case testFsckCorruptECFile . As you can see below, in this case the EC block is also unrecoverable, but it says missing 4 internal blocks. Erasure Coded Block Groups: Total size: 393216 B Total files: 1 Total block groups (validated): 1 (avg. block group size 393216 B) ******************************** UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) CORRUPT FILES: 1 CORRUPT BLOCK GROUPS: 1 CORRUPT SIZE: 393216 B ******************************** Minimally erasure-coded block groups: 0 (0.0 %) Over-erasure-coded block groups: 0 (0.0 %) Under-erasure-coded block groups: 1 (100.0 %) Unsatisfactory placement block groups: 0 (0.0 %) Default ecPolicy: RS-DEFAULT-6-3-64k Average block group size: 5.0 Missing block groups: 0 Corrupt block groups: 1 Missing internal blocks: 4 (44.444443 %) FSCK ended at Wed Oct 19 11:43:07 PDT 2016 in 2 milliseconds
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          I agree that it is counter-intuitive. But as Andrew Wang commented here, I also think this specification is required. It would be good if we add add more documents about fsck.

          Thanks for the verification, but I could not reproduce it. The latest patch includes the test in testFsckCorruptECFile. Could you check it again?

          Show
          tasanuma0829 Takanobu Asanuma added a comment - I agree that it is counter-intuitive. But as Andrew Wang commented here , I also think this specification is required. It would be good if we add add more documents about fsck. Thanks for the verification, but I could not reproduce it. The latest patch includes the test in testFsckCorruptECFile . Could you check it again?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Ah I see. I didn't rebase my local tree. +1.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Ah I see. I didn't rebase my local tree. +1.
          Hide
          andrew.wang Andrew Wang added a comment -

          I'm +1 too, Wei-Chiu Chuang you want to handle the commit since you did most of the review?

          Show
          andrew.wang Andrew Wang added a comment - I'm +1 too, Wei-Chiu Chuang you want to handle the commit since you did most of the review?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Sure I'll do it.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Sure I'll do it.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks to Takanobu Asanuma for contributing the patch and to Andrew Wang for the review. Committed this to trunk.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks to Takanobu Asanuma for contributing the patch and to Andrew Wang for the review. Committed this to trunk.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10659 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10659/)
          HDFS-10975. fsck -list-corruptfileblocks does not report corrupt EC (weichiu: rev df857f0d10bda9fbb9c3f6ec77aba0cf46fe3631)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFsck.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10659 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10659/ ) HDFS-10975 . fsck -list-corruptfileblocks does not report corrupt EC (weichiu: rev df857f0d10bda9fbb9c3f6ec77aba0cf46fe3631) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFsck.java
          Hide
          tasanuma0829 Takanobu Asanuma added a comment -

          Thanks for committing and creating this issue, Wei-Chiu Chuang. Thanks for reviewing, Andrew Wang.

          Show
          tasanuma0829 Takanobu Asanuma added a comment - Thanks for committing and creating this issue, Wei-Chiu Chuang . Thanks for reviewing, Andrew Wang .

            People

            • Assignee:
              tasanuma0829 Takanobu Asanuma
              Reporter:
              jojochuang Wei-Chiu Chuang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development