Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10512

VolumeScanner may terminate due to NPE in DataNode.reportBadBlocks

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      VolumeScanner may terminate due to unexpected NullPointerException thrown in DataNode.reportBadBlocks(). This is different from HDFS-8850/HDFS-9190

      I observed this bug in a production CDH 5.5.1 cluster and the same bug still persist in upstream trunk.

      2016-04-07 20:30:53,830 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad BP-1800173197-10.204.68.5-1444425156296:blk_1170134484_96468685 on /dfs/dn
      2016-04-07 20:30:53,831 ERROR org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/dfs/dn, DS-89b72832-2a8c-48f3-8235-48e6c5eb5ab3) exiting because of exception
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.datanode.DataNode.reportBadBlocks(DataNode.java:1018)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner$ScanResultHandler.handle(VolumeScanner.java:287)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:443)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:547)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:621)
      2016-04-07 20:30:53,832 INFO org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/dfs/dn, DS-89b72832-2a8c-48f3-8235-48e6c5eb5ab3) exiting.
      

      I think the NPE comes from the volume variable in the following code snippet. Somehow the volume scanner know the volume, but the datanode can not lookup the volume using the block.

      public void reportBadBlocks(ExtendedBlock block) throws IOException{
          BPOfferService bpos = getBPOSForBlock(block);
          FsVolumeSpi volume = getFSDataset().getVolume(block);
          bpos.reportBadBlocks(
              block, volume.getStorageID(), volume.getStorageType());
        }
      
      1. HDFS-10512.001.patch
        0.9 kB
        Yiqun Lin
      2. HDFS-10512.002.patch
        2 kB
        Yiqun Lin
      3. HDFS-10512.004.patch
        4 kB
        Wei-Chiu Chuang
      4. HDFS-10512.005.patch
        7 kB
        Yiqun Lin
      5. HDFS-10512.006.patch
        7 kB
        Yiqun Lin

        Issue Links

          Activity

          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          This is different from HDFS-8850/HDFS-9190.

          Show
          jojochuang Wei-Chiu Chuang added a comment - This is different from HDFS-8850 / HDFS-9190 .
          Hide
          linyiqun Yiqun Lin added a comment -

          The block has been checked before datanode.reportBadBlocks in ScanResultHandler#handle. So I suspect that the block was removed after the logic of checking and before datanode.reportBadBlocks. Otherwise, if the block was already not exist in datanode, it will just return.

              public void handle(ExtendedBlock block, IOException e) {
                FsVolumeSpi volume = scanner.volume;
                if (e == null) {
                  LOG.trace("Successfully scanned {} on {}", block, volume.getBasePath());
                  return;
                }
                // If the block does not exist anymore, then it's not an error.
                if (!volume.getDataset().contains(block)) {
                  LOG.debug("Volume {}: block {} is no longer in the dataset.",
                      volume.getBasePath(), block);
                  return;
                }
                ...
                LOG.warn("Reporting bad {} on {}", block, volume.getBasePath());
                try {
                  scanner.datanode.reportBadBlocks(block);
                } catch (IOException ie) {
                  // This is bad, but not bad enough to shut down the scanner.
                  LOG.warn("Cannot report bad " + block.getBlockId(), e);
                }
              }
          

          Attach a patch for this.

          Show
          linyiqun Yiqun Lin added a comment - The block has been checked before datanode.reportBadBlocks in ScanResultHandler#handle . So I suspect that the block was removed after the logic of checking and before datanode.reportBadBlocks . Otherwise, if the block was already not exist in datanode, it will just return. public void handle(ExtendedBlock block, IOException e) { FsVolumeSpi volume = scanner.volume; if (e == null ) { LOG.trace( "Successfully scanned {} on {}" , block, volume.getBasePath()); return ; } // If the block does not exist anymore, then it's not an error. if (!volume.getDataset().contains(block)) { LOG.debug( "Volume {}: block {} is no longer in the dataset." , volume.getBasePath(), block); return ; } ... LOG.warn( "Reporting bad {} on {}" , block, volume.getBasePath()); try { scanner.datanode.reportBadBlocks(block); } catch (IOException ie) { // This is bad, but not bad enough to shut down the scanner. LOG.warn( "Cannot report bad " + block.getBlockId(), e); } } Attach a patch for this.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 32s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 8m 3s trunk passed
          +1 compile 0m 56s trunk passed
          +1 checkstyle 0m 33s trunk passed
          +1 mvnsite 1m 3s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 1m 53s trunk passed
          +1 javadoc 1m 14s trunk passed
          +1 mvninstall 0m 55s the patch passed
          +1 compile 0m 50s the patch passed
          +1 javac 0m 50s the patch passed
          +1 checkstyle 0m 27s the patch passed
          +1 mvnsite 0m 57s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 findbugs 2m 1s the patch passed
          +1 javadoc 1m 9s the patch passed
          -1 unit 73m 25s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          96m 10s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestEditLog



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809345/HDFS-10512.001.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 40a5f2c7d65a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9581fb7
          Default Java 1.8.0_91
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15729/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15729/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 32s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 3s trunk passed +1 compile 0m 56s trunk passed +1 checkstyle 0m 33s trunk passed +1 mvnsite 1m 3s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 53s trunk passed +1 javadoc 1m 14s trunk passed +1 mvninstall 0m 55s the patch passed +1 compile 0m 50s the patch passed +1 javac 0m 50s the patch passed +1 checkstyle 0m 27s the patch passed +1 mvnsite 0m 57s the patch passed +1 mvneclipse 0m 10s the patch passed -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 findbugs 2m 1s the patch passed +1 javadoc 1m 9s the patch passed -1 unit 73m 25s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 96m 10s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestEditLog Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809345/HDFS-10512.001.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 40a5f2c7d65a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9581fb7 Default Java 1.8.0_91 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15729/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15729/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15729/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks Yiqun Lin The check in the patch looks good to me.
          I think that if the volume is null for some reason and can't report the bad block to the NN, it should throw an IOException so that this not ignored. At this point, I am not sure if it's some race condition in a bug somewhere.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks Yiqun Lin The check in the patch looks good to me. I think that if the volume is null for some reason and can't report the bad block to the NN, it should throw an IOException so that this not ignored. At this point, I am not sure if it's some race condition in a bug somewhere.
          Hide
          xyao Xiaoyu Yao added a comment -

          Thanks Wei-Chiu Chuang for reporting the issue and Yiqun Lin for posting the patch.
          There is a similar usage DataNode#reportBadBlock that needs to check null volume as well.
          For both cases, I would suggest we LOG an ERROR like follows.

              if (volume != null) {
                bpos.reportBadBlocks(
                    block, volume.getStorageID(), volume.getStorageType());
              } else {
                LOG.error("Cannot find FsVolumeSpi to report bad block id:"
                    + blockBlockId()  + " bpid: " + block.getBlockPoolId());
              }
          
          Show
          xyao Xiaoyu Yao added a comment - Thanks Wei-Chiu Chuang for reporting the issue and Yiqun Lin for posting the patch. There is a similar usage DataNode#reportBadBlock that needs to check null volume as well. For both cases, I would suggest we LOG an ERROR like follows. if (volume != null ) { bpos.reportBadBlocks( block, volume.getStorageID(), volume.getStorageType()); } else { LOG.error( "Cannot find FsVolumeSpi to report bad block id:" + blockBlockId() + " bpid: " + block.getBlockPoolId()); }
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks Wei-Chiu Chuang and Xiaoyu Yao for review. Post the new patch for addressing the comments.

          Show
          linyiqun Yiqun Lin added a comment - Thanks Wei-Chiu Chuang and Xiaoyu Yao for review. Post the new patch for addressing the comments.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 9m 11s Docker mode activated.
          +1 @author 0m 1s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 8m 13s trunk passed
          +1 compile 0m 51s trunk passed
          +1 checkstyle 0m 29s trunk passed
          +1 mvnsite 1m 3s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 56s trunk passed
          +1 javadoc 1m 0s trunk passed
          +1 mvninstall 0m 57s the patch passed
          +1 compile 0m 49s the patch passed
          +1 javac 0m 49s the patch passed
          +1 checkstyle 0m 25s the patch passed
          +1 mvnsite 0m 54s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 findbugs 2m 1s the patch passed
          +1 javadoc 0m 57s the patch passed
          +1 unit 66m 0s hadoop-hdfs in the patch passed.
          +1 asflicense 0m 28s The patch does not generate ASF License warnings.
          97m 6s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809594/HDFS-10512.002.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4d77ed9c5525 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8a1dcce
          Default Java 1.8.0_91
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15741/artifact/patchprocess/whitespace-eol.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15741/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15741/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 9m 11s Docker mode activated. +1 @author 0m 1s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 13s trunk passed +1 compile 0m 51s trunk passed +1 checkstyle 0m 29s trunk passed +1 mvnsite 1m 3s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 56s trunk passed +1 javadoc 1m 0s trunk passed +1 mvninstall 0m 57s the patch passed +1 compile 0m 49s the patch passed +1 javac 0m 49s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 10s the patch passed -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 findbugs 2m 1s the patch passed +1 javadoc 0m 57s the patch passed +1 unit 66m 0s hadoop-hdfs in the patch passed. +1 asflicense 0m 28s The patch does not generate ASF License warnings. 97m 6s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809594/HDFS-10512.002.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4d77ed9c5525 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8a1dcce Default Java 1.8.0_91 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15741/artifact/patchprocess/whitespace-eol.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15741/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15741/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 24s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 6m 12s trunk passed
          +1 compile 0m 47s trunk passed
          +1 checkstyle 0m 27s trunk passed
          +1 mvnsite 0m 51s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 39s trunk passed
          +1 javadoc 0m 55s trunk passed
          +1 mvninstall 0m 46s the patch passed
          +1 compile 0m 42s the patch passed
          +1 javac 0m 42s the patch passed
          +1 checkstyle 0m 25s the patch passed
          +1 mvnsite 0m 49s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 findbugs 1m 59s the patch passed
          +1 javadoc 0m 53s the patch passed
          -1 unit 85m 48s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          104m 34s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock
          Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809604/HDFS-10512.002.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux d19930ffbfb6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8a1dcce
          Default Java 1.8.0_91
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15742/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15742/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 12s trunk passed +1 compile 0m 47s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 39s trunk passed +1 javadoc 0m 55s trunk passed +1 mvninstall 0m 46s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 49s the patch passed +1 mvneclipse 0m 10s the patch passed -1 whitespace 0m 0s The patch has 20 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 findbugs 1m 59s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 85m 48s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 104m 34s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2 Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12809604/HDFS-10512.002.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d19930ffbfb6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8a1dcce Default Java 1.8.0_91 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15742/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15742/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15742/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          linyiqun Yiqun Lin added a comment - - edited

          The failed test TestPendingInvalidateBlock was tracked by HDFS-10426, other failed unit tests are not related, thanks for review.

          Show
          linyiqun Yiqun Lin added a comment - - edited The failed test TestPendingInvalidateBlock was tracked by HDFS-10426 , other failed unit tests are not related, thanks for review.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hi Yiqun Lin much appreciate your patch. The patch itself looks good to me.

          However, I have been hesitate to give my non-binding +1, because when this method is being called, a block is corrupt. After this patch, VolumeScanner will not terminate prematurely, which is good, but it still won't tell NameNode to mark this replica corrupt. And that's still a really bad thing to have.

          Any comments? Xiaoyu Yao or other watchers?
          Do you think this patch should go in despite we do not know the root cause of NPE?

          This is a really bad bug, which causes pipeline to abort, because it will never transmit correct block to the downstream pipeline, and pipeline can not construct three good replicas.

          BTW, I have found the root cause of corrupt replica, and I'll file another jira today, but I still think it would be nice to know what causes NPE.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hi Yiqun Lin much appreciate your patch. The patch itself looks good to me. However, I have been hesitate to give my non-binding +1, because when this method is being called, a block is corrupt. After this patch, VolumeScanner will not terminate prematurely, which is good, but it still won't tell NameNode to mark this replica corrupt. And that's still a really bad thing to have. Any comments? Xiaoyu Yao or other watchers? Do you think this patch should go in despite we do not know the root cause of NPE? This is a really bad bug, which causes pipeline to abort, because it will never transmit correct block to the downstream pipeline, and pipeline can not construct three good replicas. BTW, I have found the root cause of corrupt replica, and I'll file another jira today, but I still think it would be nice to know what causes NPE.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          HDFS-10587 was filed for the root cause of the bad block that I observed.

          In addition, HDFS-6937 will be obsolete if we can get a good fix for this bug.

          Show
          jojochuang Wei-Chiu Chuang added a comment - HDFS-10587 was filed for the root cause of the bad block that I observed. In addition, HDFS-6937 will be obsolete if we can get a good fix for this bug.
          Hide
          linyiqun Yiqun Lin added a comment -

          From the code in FsDatasetImpl, I see that the method FsDatasetImpl#getVolume returns null cause the NPE. In these code:

            @Override
            public synchronized FsVolumeImpl getVolume(final ExtendedBlock b) {
              final ReplicaInfo r =  volumeMap.get(b.getBlockPoolId(), b.getLocalBlock());
              return r != null? (FsVolumeImpl)r.getVolume(): null;
            }
          

          So it means that the Replicainfo of corrupt block in volumeMap has been removed. And there are many cases will trigger the operation volumeMap.remove in FsDatasetImpl. So I want to say, the case that mentioned in HDFS-10587 will lead this, can you confirm this, Wei-Chiu Chuang?

          Show
          linyiqun Yiqun Lin added a comment - From the code in FsDatasetImpl , I see that the method FsDatasetImpl#getVolume returns null cause the NPE. In these code: @Override public synchronized FsVolumeImpl getVolume( final ExtendedBlock b) { final ReplicaInfo r = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock()); return r != null ? (FsVolumeImpl)r.getVolume(): null ; } So it means that the Replicainfo of corrupt block in volumeMap has been removed. And there are many cases will trigger the operation volumeMap.remove in FsDatasetImpl . So I want to say, the case that mentioned in HDFS-10587 will lead this, can you confirm this, Wei-Chiu Chuang ?
          Hide
          jojochuang Wei-Chiu Chuang added a comment - - edited

          Yes that is correct. To add more color, here's what happened before the NPE:

          A datanode's replica had corruption. It transferred the block to a destination for block recovery. The destination verified checksum of block and found it corrupt, and terminate the socket. The source datanode got a socket reset exception, and invokes the following code snippet:

          BlockSender#sendPacket
          datanode.getBlockScanner().markSuspectBlock(
                        volumeRef.getVolume().getStorageID(),
                        block);
          

          VolumeScanner has an asynchronous thread which checks the integrity of the block. When it found the block is bad, it invokes DataNode.reportBadBlocks().

          How about we create another version of DataNode.reportBadBlocks() and which takes two parameters: volume and block and let VolumeScanner invoke this new overloaded method?

          Show
          jojochuang Wei-Chiu Chuang added a comment - - edited Yes that is correct. To add more color, here's what happened before the NPE: A datanode's replica had corruption. It transferred the block to a destination for block recovery. The destination verified checksum of block and found it corrupt, and terminate the socket. The source datanode got a socket reset exception, and invokes the following code snippet: BlockSender#sendPacket datanode.getBlockScanner().markSuspectBlock( volumeRef.getVolume().getStorageID(), block); VolumeScanner has an asynchronous thread which checks the integrity of the block. When it found the block is bad, it invokes DataNode.reportBadBlocks(). How about we create another version of DataNode.reportBadBlocks() and which takes two parameters: volume and block and let VolumeScanner invoke this new overloaded method?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          We actually saw the same NPE happened after HDFS-10587 on two independent clusters.

          Show
          jojochuang Wei-Chiu Chuang added a comment - We actually saw the same NPE happened after HDFS-10587 on two independent clusters.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          I think I know why FsDatasetImpl#getVolume() returned null:
          Basically, the block was still there, but its generation stamp was updated.

          1. The sender got a connection reset, and invoked BlockScanner#markSuspectBlock:
          2. Then the block generation stamp was updated (due to pipeline recovery) while the scanner was running:
          3. The scanner detected checksum error, but since the block generation stamp was updated, FsDatasetImpl couldn't find an exact block.

          So I think we should consider adding an overloaded version of DataNode#reportBadBlocks which uses volume object that was known when VolumeScanner object was instantiated.

          Comments?

          Show
          jojochuang Wei-Chiu Chuang added a comment - I think I know why FsDatasetImpl#getVolume() returned null: Basically, the block was still there, but its generation stamp was updated. The sender got a connection reset, and invoked BlockScanner#markSuspectBlock : Then the block generation stamp was updated (due to pipeline recovery) while the scanner was running: The scanner detected checksum error, but since the block generation stamp was updated, FsDatasetImpl couldn't find an exact block. So I think we should consider adding an overloaded version of DataNode#reportBadBlocks which uses volume object that was known when VolumeScanner object was instantiated. Comments?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          In sort, this is a race condition between VolumeScanner and FsDatasetImpl.

          Show
          jojochuang Wei-Chiu Chuang added a comment - In sort, this is a race condition between VolumeScanner and FsDatasetImpl.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hi Yiqun Lin and Xiaoyu Yao,
          how do you think about this patch:

          diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          index d782e85..0b3738a 100644
          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          @@ -1152,6 +1152,12 @@ public void reportBadBlocks(ExtendedBlock block) throws IOException{
                   block, volume.getStorageID(), volume.getStorageType());
             }
          
          +  public void reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume ) throws IOException{
          +    BPOfferService bpos = getBPOSForBlock(block);
          +    bpos.reportBadBlocks(
          +        block, volume.getStorageID(), volume.getStorageType());
          +  }
          +
             /**
              * Report a bad block on another DN (eg if we received a corrupt replica
              * from a remote host).
          diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
          index d0dc9ed..7a9ecf2 100644
          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
          @@ -283,7 +283,7 @@ public void handle(ExtendedBlock block, IOException e) {
                 }
                 LOG.warn("Reporting bad {} on {}", block, volume.getBasePath());
                 try {
          -        scanner.datanode.reportBadBlocks(block);
          +        scanner.datanode.reportBadBlocks(block, volume);
                 } catch (IOException ie) {
                   // This is bad, but not bad enough to shut down the scanner.
                   LOG.warn("Cannot report bad " + block.getBlockId(), e);
          
          

          Thanks

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hi Yiqun Lin and Xiaoyu Yao , how do you think about this patch: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java index d782e85..0b3738a 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java @@ -1152,6 +1152,12 @@ public void reportBadBlocks(ExtendedBlock block) throws IOException{ block, volume.getStorageID(), volume.getStorageType()); } + public void reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume ) throws IOException{ + BPOfferService bpos = getBPOSForBlock(block); + bpos.reportBadBlocks( + block, volume.getStorageID(), volume.getStorageType()); + } + /** * Report a bad block on another DN (eg if we received a corrupt replica * from a remote host). diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java index d0dc9ed..7a9ecf2 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java @@ -283,7 +283,7 @@ public void handle(ExtendedBlock block, IOException e) { } LOG.warn("Reporting bad {} on {}", block, volume.getBasePath()); try { - scanner.datanode.reportBadBlocks(block); + scanner.datanode.reportBadBlocks(block, volume); } catch (IOException ie) { // This is bad, but not bad enough to shut down the scanner. LOG.warn("Cannot report bad " + block.getBlockId(), e); Thanks
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hi Yiqun Lin I just realized this might not be cool for me to post a patch, but we've been seeing a few block corruption issues and would like to push forward a quick fix. I hope you don't mind.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hi Yiqun Lin I just realized this might not be cool for me to post a patch, but we've been seeing a few block corruption issues and would like to push forward a quick fix. I hope you don't mind.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          The fix looks good. Can we reuse the code so that reportBadBlocks(ExtendedBlock) calls reportBadBlocks(ExtendedBlock, FsVolumeSpi)?

          Show
          ajisakaa Akira Ajisaka added a comment - The fix looks good. Can we reuse the code so that reportBadBlocks(ExtendedBlock) calls reportBadBlocks(ExtendedBlock, FsVolumeSpi) ?
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks Wei-Chiu for providing the patch, it also looks good to me. You can assign this jira to yourself and do a quick fix, I don't mind that. Thanks Wei-Chiu again for much work for this issue.

          Show
          linyiqun Yiqun Lin added a comment - Thanks Wei-Chiu for providing the patch, it also looks good to me. You can assign this jira to yourself and do a quick fix, I don't mind that. Thanks Wei-Chiu again for much work for this issue.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Attached a patch based on Yiqun's original patch. Also updated the original reportBadBlocks to invoke the overloaded reportBadBlocks.

          Additionally, changed the caller of reportBadBlocks(ExtendedBlock block)) to use reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume) to avoid potential race condition.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Attached a patch based on Yiqun's original patch. Also updated the original reportBadBlocks to invoke the overloaded reportBadBlocks. Additionally, changed the caller of reportBadBlocks(ExtendedBlock block)) to use reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume) to avoid potential race condition.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 36s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 8m 44s trunk passed
          +1 compile 0m 45s trunk passed
          +1 checkstyle 0m 30s trunk passed
          +1 mvnsite 0m 54s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 47s trunk passed
          +1 javadoc 1m 1s trunk passed
          +1 mvninstall 0m 51s the patch passed
          +1 compile 0m 44s the patch passed
          +1 javac 0m 44s the patch passed
          +1 checkstyle 0m 30s the patch passed
          +1 mvnsite 0m 54s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 51s the patch passed
          +1 javadoc 0m 55s the patch passed
          -1 unit 72m 29s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          94m 37s



          Reason Tests
          Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12815775/HDFS-10512.004.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5406e35f7b21 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c25021f
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15961/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15961/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15961/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 36s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 44s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 30s trunk passed +1 mvnsite 0m 54s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 47s trunk passed +1 javadoc 1m 1s trunk passed +1 mvninstall 0m 51s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed +1 checkstyle 0m 30s the patch passed +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 51s the patch passed +1 javadoc 0m 55s the patch passed -1 unit 72m 29s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 94m 37s Reason Tests Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12815775/HDFS-10512.004.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5406e35f7b21 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c25021f Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15961/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15961/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15961/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Wei-Chiu Chuang for updating the patch! Would you document that volumes must not be null in reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume)?
          In addition, can we add a regression test for this issue?

          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Wei-Chiu Chuang for updating the patch! Would you document that volumes must not be null in reportBadBlocks(ExtendedBlock block, FsVolumeSpi volume) ? In addition, can we add a regression test for this issue?
          Hide
          linyiqun Yiqun Lin added a comment -

          Attached a patch based on Wei-Chiu Chuang's v04 patch. The new patch addressing the comment that Akira Ajisaka memtioned. I am looking forwarding to seeing your response, Wei-Chiu Chuang, Akira Ajisaka,

          Show
          linyiqun Yiqun Lin added a comment - Attached a patch based on Wei-Chiu Chuang 's v04 patch. The new patch addressing the comment that Akira Ajisaka memtioned. I am looking forwarding to seeing your response, Wei-Chiu Chuang , Akira Ajisaka ,
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 36s trunk passed
          +1 compile 0m 44s trunk passed
          +1 checkstyle 0m 31s trunk passed
          +1 mvnsite 0m 51s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 40s trunk passed
          +1 javadoc 0m 56s trunk passed
          +1 mvninstall 0m 48s the patch passed
          +1 compile 0m 42s the patch passed
          +1 javac 0m 42s the patch passed
          +1 checkstyle 0m 27s the patch passed
          +1 mvnsite 0m 49s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 48s the patch passed
          +1 javadoc 0m 56s the patch passed
          -1 unit 75m 3s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          94m 19s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestLeaseRecovery2
            hadoop.hdfs.server.balancer.TestBalancer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816387/HDFS-10512.005.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a268c8b2d39b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d792a90
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15990/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15990/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15990/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 36s trunk passed +1 compile 0m 44s trunk passed +1 checkstyle 0m 31s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 40s trunk passed +1 javadoc 0m 56s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 27s the patch passed +1 mvnsite 0m 49s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 48s the patch passed +1 javadoc 0m 56s the patch passed -1 unit 75m 3s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 94m 19s Reason Tests Failed junit tests hadoop.hdfs.TestLeaseRecovery2   hadoop.hdfs.server.balancer.TestBalancer Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816387/HDFS-10512.005.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a268c8b2d39b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d792a90 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15990/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15990/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15990/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          linyiqun Yiqun Lin added a comment -

          Two failed tests were not related. The one failed test TestBalancer was tracked by HDFS-10336.

          Show
          linyiqun Yiqun Lin added a comment - Two failed tests were not related. The one failed test TestBalancer was tracked by HDFS-10336 .
          Hide
          ajisakaa Akira Ajisaka added a comment -

          +1 pending Wei-Chiu Chuang's response. Thanks Yiqun Lin for updating the patch.

          Show
          ajisakaa Akira Ajisaka added a comment - +1 pending Wei-Chiu Chuang 's response. Thanks Yiqun Lin for updating the patch.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Wei-Chiu Chuang and Yiqun Lin for working on the issue and Akira Ajisaka for the review.

          I looked at the patch, and have one question:

          The patch changed from

                  datanode.reportBadBlocks(new ExtendedBlock(bpid, corruptBlock));
          

          to

                  datanode.reportBadBlocks(new ExtendedBlock(bpid, memBlockInfo),
                      memBlockInfo.getVolume());
          

          where the second parameter of constructor ExtendedBlock was changed from corruptBlock to memBlockInfo. As we know, the block size recorded in corruptBlock and memBlockInfo are different per the following code:

                // Compare block size
                if (memBlockInfo.getNumBytes() != memFile.length()) {
                  // Update the length based on the block file
                  corruptBlock = new Block(memBlockInfo);
                  LOG.warn("Updating size of block " + blockId + " from "
                      + memBlockInfo.getNumBytes() + " to " + memFile.length());
                  memBlockInfo.setNumBytes(memFile.length());
                }
          

          When reporting the bad block, do we intend to report the new length or the old length back to NN (the old code reported the old length, the patch reported the new length)?

          I can see this might not be a real issue, just want to point it out, Is it intended change? I guess passing either corruptBlock or memBlockInfo to the second parameter of constructor ExtendedBlock is fine.

          Would you guys please comment?

          Other than that, the patch looks good to me.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Wei-Chiu Chuang and Yiqun Lin for working on the issue and Akira Ajisaka for the review. I looked at the patch, and have one question: The patch changed from datanode.reportBadBlocks( new ExtendedBlock(bpid, corruptBlock)); to datanode.reportBadBlocks( new ExtendedBlock(bpid, memBlockInfo), memBlockInfo.getVolume()); where the second parameter of constructor ExtendedBlock was changed from corruptBlock to memBlockInfo . As we know, the block size recorded in corruptBlock and memBlockInfo are different per the following code: // Compare block size if (memBlockInfo.getNumBytes() != memFile.length()) { // Update the length based on the block file corruptBlock = new Block(memBlockInfo); LOG.warn( "Updating size of block " + blockId + " from " + memBlockInfo.getNumBytes() + " to " + memFile.length()); memBlockInfo.setNumBytes(memFile.length()); } When reporting the bad block, do we intend to report the new length or the old length back to NN (the old code reported the old length, the patch reported the new length)? I can see this might not be a real issue, just want to point it out, Is it intended change? I guess passing either corruptBlock or memBlockInfo to the second parameter of constructor ExtendedBlock is fine. Would you guys please comment? Other than that, the patch looks good to me. Thanks.
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks Yongjun Zhang for review. There are some comments from me:

          where the second parameter of constructor ExtendedBlock was changed from corruptBlock to memBlockInfo

          The reason why we use memBlockInfo is that we want to get the corruptBlock's volume info memBlockInfo.getVolume, while corruptBlock has no this method.

          But it seems that the code datanode.reportBadBlocks(new ExtendedBlock(bpid, memBlockInfo), ..) is not a intended change in this issue. I think It will be better to report the old length for a bad block and not changed the current logic. This change was based on Wei-Chiu Chuang's v04 patch, so I'd like to wait for the Wei-Chiu Chuang's response.

          Finally, post the new patch for addressing this change.
          From

          datanode.reportBadBlocks(new ExtendedBlock(bpid, memBlockInfo),
                      memBlockInfo.getVolume());
          

          To

          datanode.reportBadBlocks(new ExtendedBlock(bpid, corruptBlock),
                      memBlockInfo.getVolume());
          
          Show
          linyiqun Yiqun Lin added a comment - Thanks Yongjun Zhang for review. There are some comments from me: where the second parameter of constructor ExtendedBlock was changed from corruptBlock to memBlockInfo The reason why we use memBlockInfo is that we want to get the corruptBlock 's volume info memBlockInfo.getVolume , while corruptBlock has no this method. But it seems that the code datanode.reportBadBlocks(new ExtendedBlock(bpid, memBlockInfo), ..) is not a intended change in this issue. I think It will be better to report the old length for a bad block and not changed the current logic. This change was based on Wei-Chiu Chuang 's v04 patch, so I'd like to wait for the Wei-Chiu Chuang 's response. Finally, post the new patch for addressing this change. From datanode.reportBadBlocks( new ExtendedBlock(bpid, memBlockInfo), memBlockInfo.getVolume()); To datanode.reportBadBlocks( new ExtendedBlock(bpid, corruptBlock), memBlockInfo.getVolume());
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Yiqun Lin for the updated rev, agree with this assessment. +1 pending jenkins.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Yiqun Lin for the updated rev, agree with this assessment. +1 pending jenkins.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 15s trunk passed
          +1 compile 0m 49s trunk passed
          +1 checkstyle 0m 31s trunk passed
          +1 mvnsite 0m 55s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 1m 44s trunk passed
          +1 javadoc 1m 0s trunk passed
          +1 mvninstall 0m 50s the patch passed
          +1 compile 0m 45s the patch passed
          +1 javac 0m 45s the patch passed
          +1 checkstyle 0m 28s the patch passed
          +1 mvnsite 0m 54s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 48s the patch passed
          +1 javadoc 0m 59s the patch passed
          -1 unit 74m 2s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          94m 35s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816777/HDFS-10512.006.patch
          JIRA Issue HDFS-10512
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3f4ea5af3b72 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 673e5e0
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/16005/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16005/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16005/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 15s trunk passed +1 compile 0m 49s trunk passed +1 checkstyle 0m 31s trunk passed +1 mvnsite 0m 55s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 44s trunk passed +1 javadoc 1m 0s trunk passed +1 mvninstall 0m 50s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 28s the patch passed +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 48s the patch passed +1 javadoc 0m 59s the patch passed -1 unit 74m 2s hadoop-hdfs in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 94m 35s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816777/HDFS-10512.006.patch JIRA Issue HDFS-10512 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3f4ea5af3b72 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 673e5e0 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16005/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16005/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16005/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          I reran the failed test several times with and without the patch here,

          org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot

          and it sometimes fails and sometimes succeed. I reported HDFS-10603 for that.

          Hi Akira Ajisaka, do you have further comment on the patch? Otherwise I hope we can commit this by end of today.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - I reran the failed test several times with and without the patch here, org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot and it sometimes fails and sometimes succeed. I reported HDFS-10603 for that. Hi Akira Ajisaka , do you have further comment on the patch? Otherwise I hope we can commit this by end of today. Thanks.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          +1 for the latest patch. Thanks Yiqun and Yongjun.

          Show
          ajisakaa Akira Ajisaka added a comment - +1 for the latest patch. Thanks Yiqun and Yongjun.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10072/)
          HDFS-10512. VolumeScanner may terminate due to NPE in (yzhang: rev da6f1b88dd47e22b24d44f6fc8bbee73e85746f7)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10072/ ) HDFS-10512 . VolumeScanner may terminate due to NPE in (yzhang: rev da6f1b88dd47e22b24d44f6fc8bbee73e85746f7) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
          Hide
          yzhangal Yongjun Zhang added a comment -

          Committed to trunk, branch-2 and branch-2.8.

          Thanks Wei-Chiu Chuang and Yiqun Lin for working on the issue and Akira Ajisaka for the review.

          Show
          yzhangal Yongjun Zhang added a comment - Committed to trunk, branch-2 and branch-2.8. Thanks Wei-Chiu Chuang and Yiqun Lin for working on the issue and Akira Ajisaka for the review.
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks a lot Yongjun Zhang for review and commit!

          Show
          linyiqun Yiqun Lin added a comment - Thanks a lot Yongjun Zhang for review and commit!
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks Yiqun Lin, Yongjun Zhang and Akira Ajisaka for the collaboration!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks Yiqun Lin , Yongjun Zhang and Akira Ajisaka for the collaboration!
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          I think this is a pretty critical fix so cherry pick this to 2.7.x

          Show
          jojochuang Wei-Chiu Chuang added a comment - I think this is a pretty critical fix so cherry pick this to 2.7.x

            People

            • Assignee:
              linyiqun Yiqun Lin
              Reporter:
              jojochuang Wei-Chiu Chuang
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development