Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1371

One bad node can incorrectly flag many files as corrupt

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: hdfs-client, namenode
    • Labels:
      None
    • Environment:

      yahoo internal version
      [knoguchi@gwgd4003 ~]$ hadoop version
      Hadoop 0.20.104.3.1007030707

    • Hadoop Flags:
      Reviewed

      Description

      On our cluster, 12 files were reported as corrupt by fsck even though the replicas on the datanodes were healthy.
      Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt from one node.

      Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without any problems.

      1. HDFS-1371.04252011.patch
        25 kB
        Tanping Wang
      2. HDFS-1371.0503.patch
        21 kB
        Tanping Wang
      3. HDFS-1371.0513.patch
        27 kB
        Tanping Wang
      4. HDFS-1371.0515.patch
        27 kB
        Tanping Wang
      5. HDFS-1371.0517.patch
        27 kB
        Tanping Wang
      6. HDFS-1371.0517.2.patch
        27 kB
        Tanping Wang
      7. HDFS-1371.0518.patch
        13 kB
        Tanping Wang
      8. HDFS-1371.0518.2.patch
        27 kB
        Tanping Wang

        Activity

        Hide
        Todd Lipcon added a comment -

        Just to clarify, when you say "reported corrupt" you mean from client nodes, right?

        Show
        Todd Lipcon added a comment - Just to clarify, when you say "reported corrupt" you mean from client nodes, right?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I have checked the code and discussed it with Koji.

        (a) When DFSClient detects a corrupt replica, it reports to NN. Then, NN will blindly mark the replica as corrupted in the BlocksMap.

        (b) When NN receives a ClientProtocol.getBlockLocations(..) rpc call, it gets all the replicas from the BlocksMap. If there are one or more good replicas, NN returns the good replicas only. If all replicas are corrupted, it returns all (corrupted) replicas and set LocatedBlock.corrupt = true.

        (c) When DFSClient gets a LocatedBlock from NN, it does not care whether LocatedBlock.corrupt is true or false.

        The flaws are in (a) and (c). The problem here is that if the DFSClient in (a) is bad (e.g. bad machine), NN may incorrectly mark the replicas as corrupted. Then, when another DFSClient tries to read the block, it receives a LocatedBlock with LocatedBlock.corrupt = true but it still keeps using them because of (c). Luckily, the double negative cancels out, therefore, the read successes. However, the NN BlocksMap information is incorrect and will not be fixed until NN restarts.

        Show
        Tsz Wo Nicholas Sze added a comment - I have checked the code and discussed it with Koji. (a) When DFSClient detects a corrupt replica, it reports to NN. Then, NN will blindly mark the replica as corrupted in the BlocksMap. (b) When NN receives a ClientProtocol.getBlockLocations(..) rpc call, it gets all the replicas from the BlocksMap. If there are one or more good replicas, NN returns the good replicas only. If all replicas are corrupted, it returns all (corrupted) replicas and set LocatedBlock.corrupt = true. (c) When DFSClient gets a LocatedBlock from NN, it does not care whether LocatedBlock.corrupt is true or false. The flaws are in (a) and (c). The problem here is that if the DFSClient in (a) is bad (e.g. bad machine), NN may incorrectly mark the replicas as corrupted. Then, when another DFSClient tries to read the block, it receives a LocatedBlock with LocatedBlock.corrupt = true but it still keeps using them because of (c). Luckily, the double negative cancels out, therefore, the read successes. However, the NN BlocksMap information is incorrect and will not be fixed until NN restarts.
        Hide
        Koji Noguchi added a comment -

        (you guys are too fast. I wanted the description to be short and was going to paste the logs afterwards... )

        Picking one such file: /myfile/part-00145.gz blk_-1426587446408804113_970819282

        Namenode log showing

        2010-08-31 10:47:56,258 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..220:1004 by /ZZ.YY.XX.246
        2010-08-31 10:47:56,290 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..252:1004 by /ZZ.YY.XX.246
        2010-08-31 10:47:56,489 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..107:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:00,508 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:00,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:03,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:03,949 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:03,971 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:07,986 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:08,257 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246
        2010-08-31 10:49:08,895 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
        

        User Tasklogs on ZZ.YY.XX.246 showing

        [root@ZZ.YY.XX.246 ~]# find /my/mapred/userlogs/ -type f -exec grep 1426587446408804113 \{\} \; -print
        org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 222720
        2010-08-31 10:47:56,256 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.220:1004 at 222720
        org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 103936
        2010-08-31 10:47:56,284 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.252:1004 at 103936
        org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 250368
        2010-08-31 10:47:56,464 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.107:1004 at 250368
        2010-08-31 10:47:56,490 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-1426587446408804113_970819282 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
        

        This was consistent among all the 12 files reported as corrupt. All from the same node ZZ.YY.XX.246.

        When trying to pull this file from other healthy node, to my surprise it didn't fail.

        [knoguchi@gwgd4003 ~]$ hadoop dfs -ls /myfile/part-00145.gz
        Found 1 items
        -rw-r--r--   3 user1 users   67771377 2010-08-31 06:46 /myfile/part-00145.gz
        
        [knoguchi@gwgd4003 ~]$ hadoop fsck /myfile/part-00145.gz
        .
        /myfile/part-00145.gz: CORRUPT block blk_-1426587446408804113
        Status: CORRUPT
         Total size:    67771377 B
         Total dirs:    0
         Total files:   1
         Total blocks (validated):      1 (avg. block size 67771377 B)
          ********************************
          CORRUPT FILES:        1
          CORRUPT BLOCKS:       1
          ********************************
         Minimally replicated blocks:   1 (100.0 %)
         Over-replicated blocks:        0 (0.0 %)
         Under-replicated blocks:       0 (0.0 %)
         Mis-replicated blocks:         0 (0.0 %)
         Default replication factor:    3
         Average block replication:     3.0
         Corrupt blocks:                1
         Missing replicas:              0 (0.0 %)
        
        
        The filesystem under path '/myfile/part-00145.gz' is CORRUPT
        [knoguchi@gwgd4003 ~]$
        [knoguchi@gwgd4003 ~]$ hadoop dfs -get /myfile/part-00145.gz /tmp
        [knoguchi@gwgd4003 ~]$ echo $?
        0
        [knoguchi@gwgd4003 ~]$ ls -l /tmp/part-00145.gz
        -rw-r--r-- 1 knoguchi users 67771377 Sep  2 21:04 /tmp/part-00145.gz
        [knoguchi@gwgd4003 ~]$
        
        Show
        Koji Noguchi added a comment - (you guys are too fast. I wanted the description to be short and was going to paste the logs afterwards... ) Picking one such file: /myfile/part-00145.gz blk_-1426587446408804113_970819282 Namenode log showing 2010-08-31 10:47:56,258 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..220:1004 by /ZZ.YY.XX.246 2010-08-31 10:47:56,290 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..252:1004 by /ZZ.YY.XX.246 2010-08-31 10:47:56,489 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..107:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:00,508 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:00,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:03,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:03,949 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:03,971 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:07,986 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:08,257 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246 2010-08-31 10:49:08,895 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246 User Tasklogs on ZZ.YY.XX.246 showing [root@ZZ.YY.XX.246 ~]# find /my/mapred/userlogs/ -type f -exec grep 1426587446408804113 \{\} \; -print org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 222720 2010-08-31 10:47:56,256 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.220:1004 at 222720 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 103936 2010-08-31 10:47:56,284 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.252:1004 at 103936 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz at 250368 2010-08-31 10:47:56,464 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.107:1004 at 250368 2010-08-31 10:47:56,490 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-1426587446408804113_970819282 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... This was consistent among all the 12 files reported as corrupt. All from the same node ZZ.YY.XX.246. When trying to pull this file from other healthy node, to my surprise it didn't fail. [knoguchi@gwgd4003 ~]$ hadoop dfs -ls /myfile/part-00145.gz Found 1 items -rw-r--r-- 3 user1 users 67771377 2010-08-31 06:46 /myfile/part-00145.gz [knoguchi@gwgd4003 ~]$ hadoop fsck /myfile/part-00145.gz . /myfile/part-00145.gz: CORRUPT block blk_-1426587446408804113 Status: CORRUPT Total size: 67771377 B Total dirs: 0 Total files: 1 Total blocks (validated): 1 (avg. block size 67771377 B) ******************************** CORRUPT FILES: 1 CORRUPT BLOCKS: 1 ******************************** Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 1 Missing replicas: 0 (0.0 %) The filesystem under path '/myfile/part-00145.gz' is CORRUPT [knoguchi@gwgd4003 ~]$ [knoguchi@gwgd4003 ~]$ hadoop dfs -get /myfile/part-00145.gz /tmp [knoguchi@gwgd4003 ~]$ echo $? 0 [knoguchi@gwgd4003 ~]$ ls -l /tmp/part-00145.gz -rw-r--r-- 1 knoguchi users 67771377 Sep 2 21:04 /tmp/part-00145.gz [knoguchi@gwgd4003 ~]$
        Hide
        Allen Wittenauer added a comment -

        Is this actually apache hadoop 0.20 or 0.21 or trunk or what?

        Yahoo!'s internal hadoop versioning doesn't really match up to what everyone else is using...

        Show
        Allen Wittenauer added a comment - Is this actually apache hadoop 0.20 or 0.21 or trunk or what? Yahoo!'s internal hadoop versioning doesn't really match up to what everyone else is using...
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > Is this actually apache hadoop 0.20 or 0.21 or trunk or what?

        I have not checked the codes in details. I believe this problem exists in all Apache Hadoop 0.20, 0.21, trunk and even some earlier versions.

        Show
        Tsz Wo Nicholas Sze added a comment - > Is this actually apache hadoop 0.20 or 0.21 or trunk or what? I have not checked the codes in details. I believe this problem exists in all Apache Hadoop 0.20, 0.21, trunk and even some earlier versions.
        Hide
        Hairong Kuang added a comment -

        I think the readers does not check corrupt flag is done on purpose and I like it.

        What's problematic is to let client to report corruption to NameNode. I think a better solution could be client notify the datanode of the possible corruption and let DN to doublecheck and report to NN.

        Show
        Hairong Kuang added a comment - I think the readers does not check corrupt flag is done on purpose and I like it. What's problematic is to let client to report corruption to NameNode. I think a better solution could be client notify the datanode of the possible corruption and let DN to doublecheck and report to NN.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > I think the readers does not check corrupt flag is done on purpose and I like it.

        I agree that the behavior is okay. It probably is better to add a warning message when LocatedBlock.corrupt == true. Or the client should report "non-corrupted block" to the NN after the client successfully read the block.

        If this was really done on purpose but not a bug, do we have documentation about this "feature"? I have not found any javadoc mentioning it.

        Show
        Tsz Wo Nicholas Sze added a comment - > I think the readers does not check corrupt flag is done on purpose and I like it. I agree that the behavior is okay. It probably is better to add a warning message when LocatedBlock.corrupt == true. Or the client should report "non-corrupted block" to the NN after the client successfully read the block. If this was really done on purpose but not a bug, do we have documentation about this "feature"? I have not found any javadoc mentioning it.
        Hide
        Todd Lipcon added a comment -

        I agree that "c" seems like an "incorrect feature". The DFSClient should have to have
        a configuration set to say "allow reading corrupt blocks" in my opinion.

        Also I think Hairong's solution makes sense - the client should send
        OP_STATUS_ERROR_CHECKSUM back to the DN, and the DN could then add it
        to the front of the DatanodeBlockScanner queue.

        It would be even better if the client reported the offset of the supposed checksum error
        so we could verify it immediately rather than scanning the full 64M block, but that's
        more of a protocol change.

        Show
        Todd Lipcon added a comment - I agree that "c" seems like an "incorrect feature". The DFSClient should have to have a configuration set to say "allow reading corrupt blocks" in my opinion. Also I think Hairong's solution makes sense - the client should send OP_STATUS_ERROR_CHECKSUM back to the DN, and the DN could then add it to the front of the DatanodeBlockScanner queue. It would be even better if the client reported the offset of the supposed checksum error so we could verify it immediately rather than scanning the full 64M block, but that's more of a protocol change.
        Hide
        Konstantin Shvachko added a comment -

        We discussed this with Rob and Nicholas.

        • We should rather leave the notification logic in place as as.
          I think that if there is any suspicion that some data is bad it is better that NN knows about it right away.
          Rather than first going through an additional verification procedure with data-nodes.
          Bad clients are rare and we should optimize for the regular case.
        • We should still do something with bad clients marking good blocks as corrupt.
          The proposal is to add the verification logic to the name-node.
          When the NN encounters that all replicas of a block are corrupt it requests the respective data-nodes
          to verify their replicas. DNs verify and either confirm the corruptness or repair the replica state on the name-node.
        • NN should not worry until all replicas are corrupt as the general replication logic should recover the block.
        • This will minimize changes and utilize existing replica verification and restoration procedures.
        Show
        Konstantin Shvachko added a comment - We discussed this with Rob and Nicholas. We should rather leave the notification logic in place as as. I think that if there is any suspicion that some data is bad it is better that NN knows about it right away. Rather than first going through an additional verification procedure with data-nodes. Bad clients are rare and we should optimize for the regular case. We should still do something with bad clients marking good blocks as corrupt. The proposal is to add the verification logic to the name-node. When the NN encounters that all replicas of a block are corrupt it requests the respective data-nodes to verify their replicas. DNs verify and either confirm the corruptness or repair the replica state on the name-node. NN should not worry until all replicas are corrupt as the general replication logic should recover the block. This will minimize changes and utilize existing replica verification and restoration procedures.
        Hide
        Allen Wittenauer added a comment -

        Is it possible to DoS the NN by declaring all blocks and their replicas as corrupt?

        Show
        Allen Wittenauer added a comment - Is it possible to DoS the NN by declaring all blocks and their replicas as corrupt?
        Hide
        dhruba borthakur added a comment -

        > e. I think a better solution could be client notify the datanode of the possible corruption and let DN to doublecheck and report to NN.

        any drawbacks of this approach? Keeps namenode complexity down, prevents DOS, etc.

        Show
        dhruba borthakur added a comment - > e. I think a better solution could be client notify the datanode of the possible corruption and let DN to doublecheck and report to NN. any drawbacks of this approach? Keeps namenode complexity down, prevents DOS, etc.
        Hide
        Tanping Wang added a comment -

        I am working on this bug. I will take Konstantin's last comment as my approach to fix this bug:

        1) Add a new validate block DatanodeCommand for data node.
        2) Namenode sends the validation request to data node via heartbeat.

        public DatanodeCommand[] sendHeartbeat(...)
        

        3) data node receives the validate block command, process it and ask block scanner to validate the block.

          private boolean processCommand(DatanodeCommand cmd)
          switch(cmd.getAction()) {
              case DatanodeProtocol.DNA_VALIDATE:
               if (blockScanner != null) {
                  blockScanner.verifyBlock(toVerify);
                }
          }
        

        4) If the block is corrupted, repair the block. If the block is good, tell name node to mark the block as good.

        Any comments?

        Show
        Tanping Wang added a comment - I am working on this bug. I will take Konstantin's last comment as my approach to fix this bug: 1) Add a new validate block DatanodeCommand for data node. 2) Namenode sends the validation request to data node via heartbeat. public DatanodeCommand[] sendHeartbeat(...) 3) data node receives the validate block command, process it and ask block scanner to validate the block. private boolean processCommand(DatanodeCommand cmd) switch (cmd.getAction()) { case DatanodeProtocol.DNA_VALIDATE: if (blockScanner != null ) { blockScanner.verifyBlock(toVerify); } } 4) If the block is corrupted, repair the block. If the block is good, tell name node to mark the block as good. Any comments?
        Hide
        Jitendra Nath Pandey added a comment -

        > any drawbacks of this approach? Keeps namenode complexity down, prevents DOS, etc.

        The only drawback I can think of is that the namenode will continue to serve this replica to clients until datanode reports that it is corrupt. But, there is already a window of time when multiple clients can try to read a corrupt replica. This approach will increase that window (probably by a few seconds). It seems to be a justified cost to avoid DOS and additional complexity in namenode.

        Show
        Jitendra Nath Pandey added a comment - > any drawbacks of this approach? Keeps namenode complexity down, prevents DOS, etc. The only drawback I can think of is that the namenode will continue to serve this replica to clients until datanode reports that it is corrupt. But, there is already a window of time when multiple clients can try to read a corrupt replica. This approach will increase that window (probably by a few seconds). It seems to be a justified cost to avoid DOS and additional complexity in namenode.
        Hide
        dhruba borthakur added a comment -

        I agree with Jitendra. I think a good solution could be client directly notify the datanode of the possible corruption and let DN to doublecheck and report to NN

        Show
        dhruba borthakur added a comment - I agree with Jitendra. I think a good solution could be client directly notify the datanode of the possible corruption and let DN to doublecheck and report to NN
        Hide
        Tanping Wang added a comment -

        I also agree with client directly notify the data node is a good solution. I will implement this accordingly. Client will send OP_STATUS_CHECKSUM_ERROR to datanode to inform that there was potential checksum error.

        Show
        Tanping Wang added a comment - I also agree with client directly notify the data node is a good solution. I will implement this accordingly. Client will send OP_STATUS_CHECKSUM_ERROR to datanode to inform that there was potential checksum error.
        Hide
        Tanping Wang added a comment -

        Just to correct myself, Client will send DataTransferProtocol.Status.ERROR_CHECKSUM to data node to inform that there was potential checksum error.

        Show
        Tanping Wang added a comment - Just to correct myself, Client will send DataTransferProtocol.Status.ERROR_CHECKSUM to data node to inform that there was potential checksum error.
        Hide
        Koji Noguchi added a comment -

        Can we take a similar approach to mapred blacklisting?
        JobTracker only considers blacklisting TaskTrackers when retry attempt is successful on the other node.
        Here, can we change the logic on the dfsclient side so that it would only report a corrupted block when retry pull from another datanode is successful?

        Show
        Koji Noguchi added a comment - Can we take a similar approach to mapred blacklisting? JobTracker only considers blacklisting TaskTrackers when retry attempt is successful on the other node. Here, can we change the logic on the dfsclient side so that it would only report a corrupted block when retry pull from another datanode is successful?
        Hide
        Tanping Wang added a comment -

        Just to summaries, we have multiple options to solve this problem now
        1) After DFSClient detects a bad block from one data node, DN1, it reads from the next data node, DN2. If client can ever read one good replica of the block, it reports to NN that DN1 has a bad replica of the block. If client can not successfully read any replica of the block, it does not report anything to NN. If all the replicas of the block are bad, there is nothing we can do to recover this anyways. This is a simple change that only happens on DFSClient.

        2) after DFSclient detects a bad block replica, it reports back to that DN directly by sending a OP_STATUS_CHECKSUM_ERROR message. The DN puts these blocks to the head of block scanner to verify. If the replica is bad, repair as how block scanner is now doing. This way no traffic driven to NN. Logic changes are in block scanner and adding communication of echo message between DN and client.

        3) client reports to NN, once NN finds that ALL replicas are bad, NN asks DNs to verify. One drawback would be a bad client can keep reporting to NN. NN can be overwored.

        Show
        Tanping Wang added a comment - Just to summaries, we have multiple options to solve this problem now 1) After DFSClient detects a bad block from one data node, DN1, it reads from the next data node, DN2. If client can ever read one good replica of the block, it reports to NN that DN1 has a bad replica of the block. If client can not successfully read any replica of the block, it does not report anything to NN. If all the replicas of the block are bad, there is nothing we can do to recover this anyways. This is a simple change that only happens on DFSClient. 2) after DFSclient detects a bad block replica, it reports back to that DN directly by sending a OP_STATUS_CHECKSUM_ERROR message. The DN puts these blocks to the head of block scanner to verify. If the replica is bad, repair as how block scanner is now doing. This way no traffic driven to NN. Logic changes are in block scanner and adding communication of echo message between DN and client. 3) client reports to NN, once NN finds that ALL replicas are bad, NN asks DNs to verify. One drawback would be a bad client can keep reporting to NN. NN can be overwored.
        Hide
        Tanping Wang added a comment -

        Had a discussion with multiple people, after a thoughtful debate, the conclusion we draw at the end is that we will take the solution that Koji suggested:

        1) With replica number that is great than 1, if DFS client detects corruption of a block replica, it continuities to try to get another replica from another data node. If DFS client can at least read one good replica of the block, client reports to name node with all the bad block replicas with their data nodes information. If DFS client can not even read one block replica, it does not report anything to name node.

        2) with replica number that is 1, DFS reports back to the name node if it detects a block corruption.

        This is change only happens on the client side. The existing logic remains the same on the server end.

        We take this approach because
        1)we consider a bad client is a client who has "good wish" but handicapped with some physical difficulties (such as memory problem), not a malicious client.
        2) If a client can not even read one good replica, it could be a handicapped client. In the worst case scenario, if all the replicas of the block are corrupted, even reporting this back to the name node, there is no repairing work can be done. More over, based on Koji's experience, it has never been a case that all the replicas of a block are all corrupted in our production environment in the past years.
        3)Handicapped client is extremely rare. We do not want to put in heavy verification logic on the name/data node end and neither want to have protocol change to just verify blocks for this extremely rare case of handicapped client.

        With that said, having change only on the client reading logic takes care of handicapped clients. It is a light weighted solution with no need of protocol change.

        Show
        Tanping Wang added a comment - Had a discussion with multiple people, after a thoughtful debate, the conclusion we draw at the end is that we will take the solution that Koji suggested: 1) With replica number that is great than 1, if DFS client detects corruption of a block replica, it continuities to try to get another replica from another data node. If DFS client can at least read one good replica of the block, client reports to name node with all the bad block replicas with their data nodes information. If DFS client can not even read one block replica, it does not report anything to name node. 2) with replica number that is 1, DFS reports back to the name node if it detects a block corruption. This is change only happens on the client side. The existing logic remains the same on the server end. We take this approach because 1)we consider a bad client is a client who has "good wish" but handicapped with some physical difficulties (such as memory problem), not a malicious client. 2) If a client can not even read one good replica, it could be a handicapped client. In the worst case scenario, if all the replicas of the block are corrupted, even reporting this back to the name node, there is no repairing work can be done. More over, based on Koji's experience, it has never been a case that all the replicas of a block are all corrupted in our production environment in the past years. 3)Handicapped client is extremely rare. We do not want to put in heavy verification logic on the name/data node end and neither want to have protocol change to just verify blocks for this extremely rare case of handicapped client. With that said, having change only on the client reading logic takes care of handicapped clients. It is a light weighted solution with no need of protocol change.
        Hide
        Tanping Wang added a comment -

        Disable block scanner in TestFsck#testCorruptBlock by adding
        conf.setInt(DFSConfigKeys.DFS_DATANODE_SCAN_PERIOD_HOURS_KEY, -1);
        to avoid interfere of block scan to report corrupted blocks.

        Show
        Tanping Wang added a comment - Disable block scanner in TestFsck#testCorruptBlock by adding conf.setInt(DFSConfigKeys.DFS_DATANODE_SCAN_PERIOD_HOURS_KEY, -1); to avoid interfere of block scan to report corrupted blocks.
        Hide
        Jitendra Nath Pandey added a comment -

        TestClientReportBadBlock
        1. Is there a specific reason for creating a dfsClient instead of using dfs.dfs? dfs.dfs is also used at a place, though.
        2. No need to separately call cluster.shutdownDataNodes, cluster.shutdown calls it.
        3. Please don't use System.out.println. You can use LOG instead.

        TestFsck.java
        1. testCorruptBlock has become kind of useless as it doesn't test whether fsck can catch corrupt blocks or not. Please try to see if it is possible to actually update the namenode with corrupt blocks which fsck can detect.

        DFSInputStream.java
        1. Instead of storing a new field dataNodeCount, we could just store currentLocatedBlock. We can also remove currentBlock and get both
        currentBlock and dataNodeCount from currentLocatedBlock. Does that make sense?
        2. Is it possible to avoid declaring corruptedBlockMap as a field in DFSInputStream; because we are checking corrupt replicas for each block and reporting them, but this map has a lifecycle spanning entire DFSInputStream?

        Show
        Jitendra Nath Pandey added a comment - TestClientReportBadBlock 1. Is there a specific reason for creating a dfsClient instead of using dfs.dfs? dfs.dfs is also used at a place, though. 2. No need to separately call cluster.shutdownDataNodes, cluster.shutdown calls it. 3. Please don't use System.out.println. You can use LOG instead. TestFsck.java 1. testCorruptBlock has become kind of useless as it doesn't test whether fsck can catch corrupt blocks or not. Please try to see if it is possible to actually update the namenode with corrupt blocks which fsck can detect. DFSInputStream.java 1. Instead of storing a new field dataNodeCount, we could just store currentLocatedBlock. We can also remove currentBlock and get both currentBlock and dataNodeCount from currentLocatedBlock. Does that make sense? 2. Is it possible to avoid declaring corruptedBlockMap as a field in DFSInputStream; because we are checking corrupt replicas for each block and reporting them, but this map has a lifecycle spanning entire DFSInputStream?
        Hide
        Konstantin Shvachko added a comment -

        Projecting JT/TT blacklisting logic to replicas doesn't work very well when all replicas are corrupt. The client will not report in this case, although it is critical. The argument that it never happened in practice is not very strong as it can happen any time.

        Show
        Konstantin Shvachko added a comment - Projecting JT/TT blacklisting logic to replicas doesn't work very well when all replicas are corrupt. The client will not report in this case, although it is critical. The argument that it never happened in practice is not very strong as it can happen any time.
        Hide
        Tanping Wang added a comment -

        In the case of all replicas are corrupted, there is nothing cluster can do to recover. Reporting this to the name node or not would not make a difference. We can argue that if a operator knows about the block corruption, he can choose to physically copy the file from somewhere else. However, in real life, based on Koji's experience, in the past couple years, this has never been a case on Yahoo's clusters that all block replicas are corrupted. Beyond this point, we do not want to too rely on a client to report block corruption and want to restrict the solution to just deal with a handicapped client.

        Show
        Tanping Wang added a comment - In the case of all replicas are corrupted, there is nothing cluster can do to recover. Reporting this to the name node or not would not make a difference. We can argue that if a operator knows about the block corruption, he can choose to physically copy the file from somewhere else. However, in real life, based on Koji's experience, in the past couple years, this has never been a case on Yahoo's clusters that all block replicas are corrupted. Beyond this point, we do not want to too rely on a client to report block corruption and want to restrict the solution to just deal with a handicapped client.
        Hide
        Konstantin Shvachko added a comment -

        I understand you are trying to avoid writing verification logic on the NN, and you don't want to trust clients, as they can be wrong. I agree with the first part of your design, which reports a bad replica if the client can read at least one. I disagree that failure of all replicas should be concealed from the NN.
        Can we do this: if a client reports all replicas corrupt, then NN chooses one of the remaining, which is presumed healthy, and adds it to corrupt. In this case one bad client cannot immediately corrupt the entire block, but if the block is really corrupt then eventually all replicas will be marked corrupt by different clients reading it.
        As I said not seeing something in the past doesn't mean you should not plan for it. In real life things may change quickly. You can get a shipment of faulty drives or get buggy software (not necessarily in Hadoop). With your solution you will not even know there is a problem.

        Show
        Konstantin Shvachko added a comment - I understand you are trying to avoid writing verification logic on the NN, and you don't want to trust clients, as they can be wrong. I agree with the first part of your design, which reports a bad replica if the client can read at least one. I disagree that failure of all replicas should be concealed from the NN. Can we do this: if a client reports all replicas corrupt, then NN chooses one of the remaining, which is presumed healthy, and adds it to corrupt. In this case one bad client cannot immediately corrupt the entire block, but if the block is really corrupt then eventually all replicas will be marked corrupt by different clients reading it. As I said not seeing something in the past doesn't mean you should not plan for it. In real life things may change quickly. You can get a shipment of faulty drives or get buggy software (not necessarily in Hadoop). With your solution you will not even know there is a problem.
        Hide
        Tanping Wang added a comment -

        Regarding to the scenario mentioned that if a client reports all replicas corrupted and let NN mark one of the healthy replicas as corrupted, if a client retries reading the same file, it can easily makes all block replicas corrupted. It is a good idea to completely solve the problem by either adding NN verification logic or let client notify DN and let DN verify it. What about solving it in a separate jira, but keep this solution as it for now?

        Show
        Tanping Wang added a comment - Regarding to the scenario mentioned that if a client reports all replicas corrupted and let NN mark one of the healthy replicas as corrupted, if a client retries reading the same file, it can easily makes all block replicas corrupted. It is a good idea to completely solve the problem by either adding NN verification logic or let client notify DN and let DN verify it. What about solving it in a separate jira, but keep this solution as it for now?
        Hide
        Tanping Wang added a comment -

        Upload a patch to address the review comments.

        Show
        Tanping Wang added a comment - Upload a patch to address the review comments.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479061/HDFS-1371.0513.patch
        against trunk revision 1102513.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.cli.TestHDFSCLI
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.tools.TestJMXGet

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479061/HDFS-1371.0513.patch against trunk revision 1102513. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/514//console This message is automatically generated.
        Hide
        Jitendra Nath Pandey added a comment -

        DFSInputStream.java :
        1. Please don't remove the public method getCurrentBlock. Change in DFSClient won't be required if we don't change the method. Also, there is no need to introduce the public method getCurrentLocatedBlock().
        2. javadoc comment for reportCheckSumFailure and addIntoCorruptedBlockMap doesn't list all parameters.
        3. The corruptedBlockMap is created outside the loop in read methods, therefore after reporting the checksum you should still clear the map.
        4. TestClientReportBadBlock : The comment "The order of data nodes..." should be moved before the loop.

        Minor:
        5. TestFsck.java: Indentation starting at comment "// corrupt replicas" has two extra spaces.

        Show
        Jitendra Nath Pandey added a comment - DFSInputStream.java : 1. Please don't remove the public method getCurrentBlock. Change in DFSClient won't be required if we don't change the method. Also, there is no need to introduce the public method getCurrentLocatedBlock(). 2. javadoc comment for reportCheckSumFailure and addIntoCorruptedBlockMap doesn't list all parameters. 3. The corruptedBlockMap is created outside the loop in read methods, therefore after reporting the checksum you should still clear the map. 4. TestClientReportBadBlock : The comment "The order of data nodes..." should be moved before the loop. Minor: 5. TestFsck.java: Indentation starting at comment "// corrupt replicas" has two extra spaces.
        Hide
        Tanping Wang added a comment -

        Upload a new patch to address review comments.

        Show
        Tanping Wang added a comment - Upload a new patch to address review comments.
        Hide
        Jitendra Nath Pandey added a comment -

        +1 for the patch.

        Show
        Jitendra Nath Pandey added a comment - +1 for the patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479526/HDFS-1371.0517.patch
        against trunk revision 1104579.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.tools.TestJMXGet

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479526/HDFS-1371.0517.patch against trunk revision 1104579. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/548//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479533/HDFS-1371.0517.2.patch
        against trunk revision 1104579.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.tools.TestJMXGet

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479533/HDFS-1371.0517.2.patch against trunk revision 1104579. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/550//console This message is automatically generated.
        Hide
        Tanping Wang added a comment -

        Fix the warning.

        Show
        Tanping Wang added a comment - Fix the warning.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479660/HDFS-1371.0518.patch
        against trunk revision 1124364.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.cli.TestHDFSCLI
        org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
        org.apache.hadoop.hdfs.server.namenode.TestMetaSave
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage
        org.apache.hadoop.tools.TestJMXGet

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479660/HDFS-1371.0518.patch against trunk revision 1124364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.server.namenode.TestMetaSave org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/566//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479689/HDFS-1371.0518.2.patch
        against trunk revision 1125057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader
        org.apache.hadoop.hdfs.TestHDFSTrash

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479689/HDFS-1371.0518.2.patch against trunk revision 1125057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/588//console This message is automatically generated.
        Hide
        Tanping Wang added a comment -

        These three tests are already failing on trunk.

        Show
        Tanping Wang added a comment - These three tests are already failing on trunk.
        Hide
        Jitendra Nath Pandey added a comment -

        +1

        Show
        Jitendra Nath Pandey added a comment - +1
        Hide
        Jitendra Nath Pandey added a comment -

        I have committed this. Thanks to Tanping!

        Show
        Jitendra Nath Pandey added a comment - I have committed this. Thanks to Tanping!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #677 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/677/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #677 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/677/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )

          People

          • Assignee:
            Tanping Wang
            Reporter:
            Koji Noguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development