Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1592

Datanode startup doesn't honor volumes.tolerated

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: 0.20.204.0, 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Datanode startup doesn't honor volumes.tolerated for hadoop 20 version.

      1. HDFS-1592-rel20.patch
        4 kB
        Bharath Mundlapudi
      2. HDFS-1592-1.patch
        4 kB
        Bharath Mundlapudi
      3. HDFS-1592-2.patch
        5 kB
        Bharath Mundlapudi
      4. HDFS-1592-3.patch
        6 kB
        Bharath Mundlapudi
      5. HDFS-1592-4.patch
        9 kB
        Bharath Mundlapudi
      6. HDFS-1592-5.patch
        8 kB
        Bharath Mundlapudi

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          The feature was added in 0.21, so it isn't too surprising that it doesn't work in 0.20....

          Show
          Allen Wittenauer added a comment - The feature was added in 0.21, so it isn't too surprising that it doesn't work in 0.20....
          Hide
          Bharath Mundlapudi added a comment -

          Attached the patch for release 20.

          This patch takes care of two things.
          1. When Datanode is started, it checks if volumes tolerated is honored.
          2. Also, volumes required is calculated correctly at the startup.

          Show
          Bharath Mundlapudi added a comment - Attached the patch for release 20. This patch takes care of two things. 1. When Datanode is started, it checks if volumes tolerated is honored. 2. Also, volumes required is calculated correctly at the startup.
          Hide
          Eli Collins added a comment -

          Closed HDFS-1849 which is a dupe.

          Bharath, this needs to be fixed on trunk as well. Are you submitting a patch for trunk too?

          Show
          Eli Collins added a comment - Closed HDFS-1849 which is a dupe. Bharath, this needs to be fixed on trunk as well. Are you submitting a patch for trunk too?
          Hide
          Bharath Mundlapudi added a comment -

          Yes, I will be submitting for the trunk too.

          Show
          Bharath Mundlapudi added a comment - Yes, I will be submitting for the trunk too.
          Hide
          Bharath Mundlapudi added a comment -

          Attaching the patch for 0.23 version.

          Show
          Bharath Mundlapudi added a comment - Attaching the patch for 0.23 version.
          Hide
          Jitendra Nath Pandey added a comment -

          1. There seems to be a redundancy in following conditions
          (volsFailed > volFailuresTolerated)
          and validVolsRequired > storage.getNumStorageDirs().

          Since both checks throw the same exception, I will recommend doing it in one condition.

          2. Please don't remove the DataNode.LOG.error.

          Show
          Jitendra Nath Pandey added a comment - 1. There seems to be a redundancy in following conditions (volsFailed > volFailuresTolerated) and validVolsRequired > storage.getNumStorageDirs(). Since both checks throw the same exception, I will recommend doing it in one condition. 2. Please don't remove the DataNode.LOG.error.
          Hide
          Bharath Mundlapudi added a comment -

          Thanks for the review, Jitendra.

          1. The conditions are there for better readability. Yes, we can change this into one condition.

          2. Error is logged where the exception is caught.

          Show
          Bharath Mundlapudi added a comment - Thanks for the review, Jitendra. 1. The conditions are there for better readability. Yes, we can change this into one condition. 2. Error is logged where the exception is caught.
          Hide
          Bharath Mundlapudi added a comment -

          Attaching patch which address the review comments. I have add some more tests.

          Show
          Bharath Mundlapudi added a comment - Attaching patch which address the review comments. I have add some more tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479026/HDFS-1592-2.patch
          against trunk revision 1102833.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479026/HDFS-1592-2.patch against trunk revision 1102833. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/518//console This message is automatically generated.
          Hide
          Bharath Mundlapudi added a comment -

          These failing tests are not related to this patch.

          Show
          Bharath Mundlapudi added a comment - These failing tests are not related to this patch.
          Hide
          Jitendra Nath Pandey added a comment -

          +1 for the patch.

          Show
          Jitendra Nath Pandey added a comment - +1 for the patch.
          Hide
          Eli Collins added a comment - - edited

          The intent of this jira (as I understand it, see HDFS-1849) is that the DN should start even if there are failed volumes, as long as the number of failed volumes is <= dfs.datanode.failed.volumes.tolerated. The use case is that an admin configures n volume failures to tolerate, then when the cluster is restarted all the nodes with less than n failed volumes should startup, ie restarting the DN should respect the dfs.datanode.failed.volumes.tolerated value so you don't end up with a cluster with DNs that were running successfully but fail to restart.

          With the current patch the DN will refuse to come up if any of the volumes have failed, no matter how dfs.datanode.failed.volumes.tolerated is configured. We need tests that verifies:

          • A DN will successfully start with a failed volume as long as it's configured to tolerate a failed volume
          • A DN will fail to start if more than the number of tolerated volumes are failed

          Make sense?

          Show
          Eli Collins added a comment - - edited The intent of this jira (as I understand it, see HDFS-1849 ) is that the DN should start even if there are failed volumes, as long as the number of failed volumes is <= dfs.datanode.failed.volumes.tolerated . The use case is that an admin configures n volume failures to tolerate, then when the cluster is restarted all the nodes with less than n failed volumes should startup, ie restarting the DN should respect the dfs.datanode.failed.volumes.tolerated value so you don't end up with a cluster with DNs that were running successfully but fail to restart. With the current patch the DN will refuse to come up if any of the volumes have failed, no matter how dfs.datanode.failed.volumes.tolerated is configured. We need tests that verifies: A DN will successfully start with a failed volume as long as it's configured to tolerate a failed volume A DN will fail to start if more than the number of tolerated volumes are failed Make sense?
          Hide
          Bharath Mundlapudi added a comment -

          Yes, what you mentioned w.r.t usecases are right.

          • A DN will successfully start with a failed volume as long as it's configured to tolerate a failed volume
          • A DN will fail to start if more than the number of tolerated volumes are failed

          This is the expected behavior with this patch.

          I had some difficulty in failing the disks through the unit tests. If we set the directory permissions to not writable, then once we run datanode, it will reset the directory permissions and test will always succeed.

          These tests were done outside of unit tests through umount -l etc. All the above mentioned cases were manually tested.

          Show
          Bharath Mundlapudi added a comment - Yes, what you mentioned w.r.t usecases are right. A DN will successfully start with a failed volume as long as it's configured to tolerate a failed volume A DN will fail to start if more than the number of tolerated volumes are failed This is the expected behavior with this patch. I had some difficulty in failing the disks through the unit tests. If we set the directory permissions to not writable, then once we run datanode, it will reset the directory permissions and test will always succeed. These tests were done outside of unit tests through umount -l etc. All the above mentioned cases were manually tested.
          Hide
          Eli Collins added a comment -

          Did you test the patch on trunk?

          Currently if a storage directory has failed the BPOfferService daemon will fail to start. This patch only throws an exception if there are an insufficient number of valid volumes, it doesn't do anything to ensure that the BP actually comes up even if there is a failed storage directory. Ie it doesn't implement the expected behavior.

          You should be able to write a test that fails a volume using Mockito (see examples in other tests), the fault injection framework, or via having the test manage the data dirs itself (eg pass false for the 3rd argument in startDataNodes) and fail them individually yourself.

          Show
          Eli Collins added a comment - Did you test the patch on trunk? Currently if a storage directory has failed the BPOfferService daemon will fail to start. This patch only throws an exception if there are an insufficient number of valid volumes, it doesn't do anything to ensure that the BP actually comes up even if there is a failed storage directory. Ie it doesn't implement the expected behavior. You should be able to write a test that fails a volume using Mockito (see examples in other tests), the fault injection framework, or via having the test manage the data dirs itself (eg pass false for the 3rd argument in startDataNodes) and fail them individually yourself.
          Hide
          Bharath Mundlapudi added a comment -

          Eli, thanks for your review and comments.

          Yes, I have tested against trunk. How did you test this? Did you configure volumes tolerated correctly?
          The expected behavior is - if volumes failed are more than volumes tolerated, BPOfferService daemon will fail to start.

          Also, note that, i have filed another Jira for - if all BPService exit due to some reason, Datanode should exit. This is a bug in the current code.

          Please see the following four tests i have performed and their outcome on trunk.

          Case 1: One disk failure (/grid/2) and Vol Tolerated = 0. Outcome: BP Service should exit.

          11/05/18 07:48:56 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 07:48:56 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 07:48:56 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 07:48:56 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 07:48:56 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir:
          java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist.
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:424)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315)
          at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:131)
          at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:148)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:2154)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2133)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2074)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2097)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2240)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2250)
          11/05/18 07:48:56 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          11/05/18 07:48:56 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          11/05/18 07:48:56 INFO impl.MetricsSystemImpl: DataNode metrics system started
          11/05/18 07:48:56 INFO impl.MetricsSystemImpl: Registered source UgiMetrics
          11/05/18 07:48:56 INFO datanode.DataNode: Opened info server at 50010
          11/05/18 07:48:56 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s
          11/05/18 07:48:56 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
          11/05/18 07:48:56 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
          11/05/18 07:48:56 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075
          11/05/18 07:48:56 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075
          11/05/18 07:48:56 INFO http.HttpServer: Jetty bound to port 50075
          11/05/18 07:48:56 INFO mortbay.log: jetty-6.1.14
          11/05/18 07:48:56 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq_6441176730816569391
          11/05/18 07:49:01 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075
          11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #1 for port 50020
          11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #2 for port 50020
          11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #3 for port 50020
          11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #4 for port 50020
          11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #5 for port 50020
          11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020
          11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020
          11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source JvmMetrics
          11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010
          11/05/18 07:49:01 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null
          11/05/18 07:49:01 INFO ipc.Server: IPC Server Responder: starting
          11/05/18 07:49:01 INFO ipc.Server: IPC Server listener on 50020: starting
          11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 0 on 50020: starting
          11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 1 on 50020: starting
          11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 2 on 50020: starting
          11/05/18 07:49:01 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 07:49:01 INFO common.Storage: Locking is disabled
          11/05/18 07:49:01 INFO common.Storage: Locking is disabled
          11/05/18 07:49:01 INFO common.Storage: Locking is disabled
          11/05/18 07:49:01 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 07:49:01 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822
          org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for volumes required - validVolsRequired: 4, Current valid volumes: 3, volsConfigured: 4, volFailuresTolerated: 0
          at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
          at java.lang.Thread.run(Thread.java:619)
          11/05/18 07:49:01 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822

          Case 2: One disk failure (/grid/2) and Vol Tolerated = 1. Outcome: BP Service should not exit

          11/05/18 08:48:39 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 08:48:39 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 08:48:39 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 08:48:39 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 08:48:39 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir:
          java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist.
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:424)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315)
          at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:131)
          at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:148)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:2154)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2133)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2074)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2097)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2240)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2250)
          11/05/18 08:48:40 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: DataNode metrics system started
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source UgiMetrics
          11/05/18 08:48:40 INFO datanode.DataNode: Opened info server at 50010
          11/05/18 08:48:40 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s
          11/05/18 08:48:40 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
          11/05/18 08:48:40 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
          11/05/18 08:48:40 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075
          11/05/18 08:48:40 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075
          11/05/18 08:48:40 INFO http.HttpServer: Jetty bound to port 50075
          11/05/18 08:48:40 INFO mortbay.log: jetty-6.1.14
          11/05/18 08:48:40 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq_4334063446071982759
          11/05/18 08:48:40 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075
          11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #1 for port 50020
          11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #2 for port 50020
          11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #3 for port 50020
          11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #4 for port 50020
          11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #5 for port 50020
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source JvmMetrics
          11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010
          11/05/18 08:48:40 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null
          11/05/18 08:48:40 INFO ipc.Server: IPC Server Responder: starting
          11/05/18 08:48:40 INFO ipc.Server: IPC Server listener on 50020: starting
          11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 0 on 50020: starting
          11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 1 on 50020: starting
          11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 2 on 50020: starting
          11/05/18 08:48:40 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 08:48:40 INFO common.Storage: Locking is disabled
          11/05/18 08:48:40 INFO common.Storage: Locking is disabled
          11/05/18 08:48:40 INFO common.Storage: Locking is disabled
          11/05/18 08:48:40 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current
          11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current
          11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current
          11/05/18 08:48:40 INFO datanode.DataNode: Registered FSDatasetState MBean
          11/05/18 08:48:40 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822
          11/05/18 08:48:40 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305719925918 with interval 21600000
          11/05/18 08:48:40 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 08:48:40 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
          11/05/18 08:48:40 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 08:48:40 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
          11/05/18 08:48:40 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs
          11/05/18 08:48:40 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@135ae7e
          11/05/18 08:48:40 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000.
          11/05/18 08:48:40 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
          11/05/18 08:48:45 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%

          Case 3: All good volumes and Vol Tolerated = 1. Outcome: BP Service should not exit.

          11/05/18 09:18:56 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:18:56 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:18:56 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:18:56 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:18:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
          11/05/18 09:18:56 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          11/05/18 09:18:56 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          11/05/18 09:18:56 INFO impl.MetricsSystemImpl: DataNode metrics system started
          11/05/18 09:18:56 INFO impl.MetricsSystemImpl: Registered source UgiMetrics
          11/05/18 09:18:56 INFO datanode.DataNode: Opened info server at 50010
          11/05/18 09:18:56 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s
          11/05/18 09:18:56 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
          11/05/18 09:18:56 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
          11/05/18 09:18:56 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075
          11/05/18 09:18:56 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075
          11/05/18 09:18:56 INFO http.HttpServer: Jetty bound to port 50075
          11/05/18 09:18:56 INFO mortbay.log: jetty-6.1.14
          11/05/18 09:18:56 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq_5832726280495656689
          11/05/18 09:18:56 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075
          11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #1 for port 50020
          11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #2 for port 50020
          11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #3 for port 50020
          11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #4 for port 50020
          11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #5 for port 50020
          11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020
          11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020
          11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source JvmMetrics
          11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010
          11/05/18 09:18:57 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null
          11/05/18 09:18:57 INFO ipc.Server: IPC Server Responder: starting
          11/05/18 09:18:57 INFO ipc.Server: IPC Server listener on 50020: starting
          11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 1 on 50020: starting
          11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 0 on 50020: starting
          11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 2 on 50020: starting
          11/05/18 09:18:57 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:18:57 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted.
          11/05/18 09:18:57 INFO common.Storage: Formatting ...
          11/05/18 09:18:57 INFO common.Storage: Locking is disabled
          11/05/18 09:18:57 INFO common.Storage: Locking is disabled
          11/05/18 09:18:57 INFO common.Storage: Locking is disabled
          11/05/18 09:18:57 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 is not formatted.
          11/05/18 09:18:57 INFO common.Storage: Formatting ...
          11/05/18 09:18:57 INFO common.Storage: Formatting block pool BP-1694914230-10.72.86.55-1305704227822 directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822/current
          11/05/18 09:18:57 INFO common.Storage: Locking is disabled
          11/05/18 09:18:57 INFO datanode.DataNode: setting up storage: nsid=413952175;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current
          11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current
          11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current
          11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current
          11/05/18 09:18:57 INFO datanode.DataNode: Registered FSDatasetState MBean
          11/05/18 09:18:57 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:18:57 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305728372371 with interval 21600000
          11/05/18 09:18:57 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 09:18:57 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
          11/05/18 09:18:57 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 09:18:57 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
          11/05/18 09:18:57 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs
          11/05/18 09:18:57 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@8de972
          11/05/18 09:18:57 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000.
          11/05/18 09:18:57 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
          11/05/18 09:19:02 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%

          Case 4: All good volumes and Vol Tolerated = 0. Outcome: BP Service should not exit.

          11/05/18 09:24:16 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:24:16 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:24:16 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:24:16 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.
          11/05/18 09:24:16 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          11/05/18 09:24:16 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          11/05/18 09:24:16 INFO impl.MetricsSystemImpl: DataNode metrics system started
          11/05/18 09:24:16 INFO impl.MetricsSystemImpl: Registered source UgiMetrics
          11/05/18 09:24:16 INFO datanode.DataNode: Opened info server at 50010
          11/05/18 09:24:16 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s
          11/05/18 09:24:16 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
          11/05/18 09:24:16 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
          11/05/18 09:24:16 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075
          11/05/18 09:24:16 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075
          11/05/18 09:24:16 INFO http.HttpServer: Jetty bound to port 50075
          11/05/18 09:24:16 INFO mortbay.log: jetty-6.1.14
          11/05/18 09:24:16 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode___hwtdwq_5258458250806180443
          11/05/18 09:24:17 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075
          11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #1 for port 50020
          11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #2 for port 50020
          11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #3 for port 50020
          11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #4 for port 50020
          11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #5 for port 50020
          11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020
          11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020
          11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source JvmMetrics
          11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010
          11/05/18 09:24:17 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null
          11/05/18 09:24:17 INFO ipc.Server: IPC Server Responder: starting
          11/05/18 09:24:17 INFO ipc.Server: IPC Server listener on 50020: starting
          11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 0 on 50020: starting
          11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 1 on 50020: starting
          11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 2 on 50020: starting
          11/05/18 09:24:17 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:24:17 INFO common.Storage: Locking is disabled
          11/05/18 09:24:17 INFO common.Storage: Locking is disabled
          11/05/18 09:24:17 INFO common.Storage: Locking is disabled
          11/05/18 09:24:17 INFO common.Storage: Locking is disabled
          11/05/18 09:24:17 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current
          11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current
          11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current
          11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current
          11/05/18 09:24:17 INFO datanode.DataNode: Registered FSDatasetState MBean
          11/05/18 09:24:17 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822
          11/05/18 09:24:17 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305719970633 with interval 21600000
          11/05/18 09:24:17 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 09:24:17 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
          11/05/18 09:24:17 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
          11/05/18 09:24:17 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
          11/05/18 09:24:17 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs
          11/05/18 09:24:17 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@18c5e67
          11/05/18 09:24:17 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000.
          11/05/18 09:24:17 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
          11/05/18 09:24:22 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%

          Show
          Bharath Mundlapudi added a comment - Eli, thanks for your review and comments. Yes, I have tested against trunk. How did you test this? Did you configure volumes tolerated correctly? The expected behavior is - if volumes failed are more than volumes tolerated, BPOfferService daemon will fail to start. Also, note that, i have filed another Jira for - if all BPService exit due to some reason, Datanode should exit. This is a bug in the current code. Please see the following four tests i have performed and their outcome on trunk. Case 1: One disk failure (/grid/2) and Vol Tolerated = 0. Outcome: BP Service should exit. 11/05/18 07:48:56 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 07:48:56 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 07:48:56 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 07:48:56 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 07:48:56 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir: java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:424) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:131) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:2154) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2133) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2074) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2097) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2240) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2250) 11/05/18 07:48:56 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 11/05/18 07:48:56 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 11/05/18 07:48:56 INFO impl.MetricsSystemImpl: DataNode metrics system started 11/05/18 07:48:56 INFO impl.MetricsSystemImpl: Registered source UgiMetrics 11/05/18 07:48:56 INFO datanode.DataNode: Opened info server at 50010 11/05/18 07:48:56 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s 11/05/18 07:48:56 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 11/05/18 07:48:56 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 11/05/18 07:48:56 INFO http.HttpServer: Port returned by webServer.getConnectors() [0] .getLocalPort() before open() is -1. Opening the listener on 50075 11/05/18 07:48:56 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors() [0] .getLocalPort() returned 50075 11/05/18 07:48:56 INFO http.HttpServer: Jetty bound to port 50075 11/05/18 07:48:56 INFO mortbay.log: jetty-6.1.14 11/05/18 07:48:56 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___ hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode ___hwtdwq_6441176730816569391 11/05/18 07:49:01 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075 11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #1 for port 50020 11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #2 for port 50020 11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #3 for port 50020 11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #4 for port 50020 11/05/18 07:49:01 INFO ipc.Server: Starting Socket Reader #5 for port 50020 11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020 11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020 11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source JvmMetrics 11/05/18 07:49:01 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010 11/05/18 07:49:01 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null 11/05/18 07:49:01 INFO ipc.Server: IPC Server Responder: starting 11/05/18 07:49:01 INFO ipc.Server: IPC Server listener on 50020: starting 11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 0 on 50020: starting 11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 1 on 50020: starting 11/05/18 07:49:01 INFO ipc.Server: IPC Server handler 2 on 50020: starting 11/05/18 07:49:01 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 07:49:01 INFO common.Storage: Locking is disabled 11/05/18 07:49:01 INFO common.Storage: Locking is disabled 11/05/18 07:49:01 INFO common.Storage: Locking is disabled 11/05/18 07:49:01 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 07:49:01 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822 org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for volumes required - validVolsRequired: 4, Current valid volumes: 3, volsConfigured: 4, volFailuresTolerated: 0 at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160) at org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191) at java.lang.Thread.run(Thread.java:619) 11/05/18 07:49:01 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822 Case 2: One disk failure (/grid/2) and Vol Tolerated = 1. Outcome: BP Service should not exit 11/05/18 08:48:39 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 08:48:39 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 08:48:39 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 08:48:39 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 08:48:39 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir: java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:424) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:131) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:2154) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2133) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2074) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2097) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2240) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2250) 11/05/18 08:48:40 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: DataNode metrics system started 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source UgiMetrics 11/05/18 08:48:40 INFO datanode.DataNode: Opened info server at 50010 11/05/18 08:48:40 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s 11/05/18 08:48:40 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 11/05/18 08:48:40 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 11/05/18 08:48:40 INFO http.HttpServer: Port returned by webServer.getConnectors() [0] .getLocalPort() before open() is -1. Opening the listener on 50075 11/05/18 08:48:40 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors() [0] .getLocalPort() returned 50075 11/05/18 08:48:40 INFO http.HttpServer: Jetty bound to port 50075 11/05/18 08:48:40 INFO mortbay.log: jetty-6.1.14 11/05/18 08:48:40 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___ hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode ___hwtdwq_4334063446071982759 11/05/18 08:48:40 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075 11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #1 for port 50020 11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #2 for port 50020 11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #3 for port 50020 11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #4 for port 50020 11/05/18 08:48:40 INFO ipc.Server: Starting Socket Reader #5 for port 50020 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source JvmMetrics 11/05/18 08:48:40 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010 11/05/18 08:48:40 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null 11/05/18 08:48:40 INFO ipc.Server: IPC Server Responder: starting 11/05/18 08:48:40 INFO ipc.Server: IPC Server listener on 50020: starting 11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 0 on 50020: starting 11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 1 on 50020: starting 11/05/18 08:48:40 INFO ipc.Server: IPC Server handler 2 on 50020: starting 11/05/18 08:48:40 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 08:48:40 INFO common.Storage: Locking is disabled 11/05/18 08:48:40 INFO common.Storage: Locking is disabled 11/05/18 08:48:40 INFO common.Storage: Locking is disabled 11/05/18 08:48:40 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current 11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current 11/05/18 08:48:40 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current 11/05/18 08:48:40 INFO datanode.DataNode: Registered FSDatasetState MBean 11/05/18 08:48:40 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822 11/05/18 08:48:40 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305719925918 with interval 21600000 11/05/18 08:48:40 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 08:48:40 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010 11/05/18 08:48:40 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 08:48:40 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 11/05/18 08:48:40 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs 11/05/18 08:48:40 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@135ae7e 11/05/18 08:48:40 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000. 11/05/18 08:48:40 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1 11/05/18 08:48:45 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% Case 3: All good volumes and Vol Tolerated = 1. Outcome: BP Service should not exit. 11/05/18 09:18:56 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:18:56 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:18:56 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:18:56 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:18:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 11/05/18 09:18:56 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 11/05/18 09:18:56 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 11/05/18 09:18:56 INFO impl.MetricsSystemImpl: DataNode metrics system started 11/05/18 09:18:56 INFO impl.MetricsSystemImpl: Registered source UgiMetrics 11/05/18 09:18:56 INFO datanode.DataNode: Opened info server at 50010 11/05/18 09:18:56 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s 11/05/18 09:18:56 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 11/05/18 09:18:56 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 11/05/18 09:18:56 INFO http.HttpServer: Port returned by webServer.getConnectors() [0] .getLocalPort() before open() is -1. Opening the listener on 50075 11/05/18 09:18:56 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors() [0] .getLocalPort() returned 50075 11/05/18 09:18:56 INFO http.HttpServer: Jetty bound to port 50075 11/05/18 09:18:56 INFO mortbay.log: jetty-6.1.14 11/05/18 09:18:56 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___ hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode ___hwtdwq_5832726280495656689 11/05/18 09:18:56 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075 11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #1 for port 50020 11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #2 for port 50020 11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #3 for port 50020 11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #4 for port 50020 11/05/18 09:18:57 INFO ipc.Server: Starting Socket Reader #5 for port 50020 11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020 11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020 11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source JvmMetrics 11/05/18 09:18:57 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010 11/05/18 09:18:57 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null 11/05/18 09:18:57 INFO ipc.Server: IPC Server Responder: starting 11/05/18 09:18:57 INFO ipc.Server: IPC Server listener on 50020: starting 11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 1 on 50020: starting 11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 0 on 50020: starting 11/05/18 09:18:57 INFO ipc.Server: IPC Server handler 2 on 50020: starting 11/05/18 09:18:57 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:18:57 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted. 11/05/18 09:18:57 INFO common.Storage: Formatting ... 11/05/18 09:18:57 INFO common.Storage: Locking is disabled 11/05/18 09:18:57 INFO common.Storage: Locking is disabled 11/05/18 09:18:57 INFO common.Storage: Locking is disabled 11/05/18 09:18:57 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 is not formatted. 11/05/18 09:18:57 INFO common.Storage: Formatting ... 11/05/18 09:18:57 INFO common.Storage: Formatting block pool BP-1694914230-10.72.86.55-1305704227822 directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822/current 11/05/18 09:18:57 INFO common.Storage: Locking is disabled 11/05/18 09:18:57 INFO datanode.DataNode: setting up storage: nsid=413952175;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current 11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current 11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current 11/05/18 09:18:57 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current 11/05/18 09:18:57 INFO datanode.DataNode: Registered FSDatasetState MBean 11/05/18 09:18:57 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:18:57 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305728372371 with interval 21600000 11/05/18 09:18:57 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 09:18:57 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010 11/05/18 09:18:57 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 09:18:57 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 11/05/18 09:18:57 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs 11/05/18 09:18:57 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@8de972 11/05/18 09:18:57 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000. 11/05/18 09:18:57 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1 11/05/18 09:19:02 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% Case 4: All good volumes and Vol Tolerated = 0. Outcome: BP Service should not exit. 11/05/18 09:24:16 WARN common.Util: Path /grid/0/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:24:16 WARN common.Util: Path /grid/1/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:24:16 WARN common.Util: Path /grid/2/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:24:16 WARN common.Util: Path /grid/3/testing/hadoop-logs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. 11/05/18 09:24:16 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 11/05/18 09:24:16 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 11/05/18 09:24:16 INFO impl.MetricsSystemImpl: DataNode metrics system started 11/05/18 09:24:16 INFO impl.MetricsSystemImpl: Registered source UgiMetrics 11/05/18 09:24:16 INFO datanode.DataNode: Opened info server at 50010 11/05/18 09:24:16 INFO datanode.DataNode: Balancing bandwith is 1048576 bytes/s 11/05/18 09:24:16 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 11/05/18 09:24:16 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 11/05/18 09:24:16 INFO http.HttpServer: Port returned by webServer.getConnectors() [0] .getLocalPort() before open() is -1. Opening the listener on 50075 11/05/18 09:24:16 INFO http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors() [0] .getLocalPort() returned 50075 11/05/18 09:24:16 INFO http.HttpServer: Jetty bound to port 50075 11/05/18 09:24:16 INFO mortbay.log: jetty-6.1.14 11/05/18 09:24:16 WARN mortbay.log: Can't reuse /tmp/Jetty_0_0_0_0_50075_datanode___ hwtdwq, using /tmp/Jetty_0_0_0_0_50075_datanode ___hwtdwq_5258458250806180443 11/05/18 09:24:17 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50075 11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #1 for port 50020 11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #2 for port 50020 11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #3 for port 50020 11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #4 for port 50020 11/05/18 09:24:17 INFO ipc.Server: Starting Socket Reader #5 for port 50020 11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source RpcActivityForPort50020 11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort50020 11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source JvmMetrics 11/05/18 09:24:17 INFO impl.MetricsSystemImpl: Registered source DataNodeActivity-hadooplab40.yst.corp.yahoo.com-50010 11/05/18 09:24:17 INFO datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0)In BPOfferService.run, data = null;bp=null 11/05/18 09:24:17 INFO ipc.Server: IPC Server Responder: starting 11/05/18 09:24:17 INFO ipc.Server: IPC Server listener on 50020: starting 11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 0 on 50020: starting 11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 1 on 50020: starting 11/05/18 09:24:17 INFO ipc.Server: IPC Server handler 2 on 50020: starting 11/05/18 09:24:17 INFO datanode.DataNode: handshake: namespace info = lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:24:17 INFO common.Storage: Locking is disabled 11/05/18 09:24:17 INFO common.Storage: Locking is disabled 11/05/18 09:24:17 INFO common.Storage: Locking is disabled 11/05/18 09:24:17 INFO common.Storage: Locking is disabled 11/05/18 09:24:17 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current 11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current 11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current 11/05/18 09:24:17 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current 11/05/18 09:24:17 INFO datanode.DataNode: Registered FSDatasetState MBean 11/05/18 09:24:17 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822 11/05/18 09:24:17 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305719970633 with interval 21600000 11/05/18 09:24:17 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 09:24:17 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010 11/05/18 09:24:17 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0 11/05/18 09:24:17 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 11/05/18 09:24:17 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs 11/05/18 09:24:17 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@18c5e67 11/05/18 09:24:17 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000. 11/05/18 09:24:17 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1 11/05/18 09:24:22 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          Hide
          Eli Collins added a comment - - edited

          Thanks for the info Bharath. I tested on trunk, but also when I filed HDFS-1849 I knew the current code wouldn't tolerate a failed volume. There's an issue with the 2nd test case:

          Case 2: One disk failure (/grid/2) and Vol Tolerated = 1. Outcome: BP Service should not exit
          ...
          11/05/18 08:48:39 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir:
          java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist.

          A missing data directory is not a disk failure, the datanode will happily notice it and recreate the directory successfully.

          If you swap out a disk from a host or just make part of the data directory inaccessible, eg by changing the perms on the host file system, you'll see that this is a fatal error for the DN, eg

          11/05/18 15:57:23 FATAL datanode.DataNode: DatanodeRegistration(localhost.localdomain:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization failed for block pool BP-1288327361-127.0.0.1-1305593076974
          java.io.IOException: Cannot remove current directory: /home/eli/hadoop-dirs1/dfs/data1/current
          	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:332)
          	at org.apache.hadoop.hdfs.server.datanode.DataStorage.format(DataStorage.java:264)
          	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:166)
          	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:216)
          	at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797)
          	at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
          	at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1186)
          	at java.lang.Thread.run(Thread.java:662)
          

          Your four test cases are great. Please write a unit test for each. This way we can make sure the patch works for each and that future changes don't break this feature.

          Show
          Eli Collins added a comment - - edited Thanks for the info Bharath. I tested on trunk, but also when I filed HDFS-1849 I knew the current code wouldn't tolerate a failed volume. There's an issue with the 2nd test case: Case 2: One disk failure (/grid/2) and Vol Tolerated = 1. Outcome: BP Service should not exit ... 11/05/18 08:48:39 WARN datanode.DataNode: Invalid directory in: dfs.datanode.data.dir: java.io.FileNotFoundException: File file:/grid/2/testing/hadoop-logs/dfs/data does not exist. A missing data directory is not a disk failure, the datanode will happily notice it and recreate the directory successfully. If you swap out a disk from a host or just make part of the data directory inaccessible, eg by changing the perms on the host file system, you'll see that this is a fatal error for the DN, eg 11/05/18 15:57:23 FATAL datanode.DataNode: DatanodeRegistration(localhost.localdomain:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization failed for block pool BP-1288327361-127.0.0.1-1305593076974 java.io.IOException: Cannot remove current directory: /home/eli/hadoop-dirs1/dfs/data1/current at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:332) at org.apache.hadoop.hdfs.server.datanode.DataStorage.format(DataStorage.java:264) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:166) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1186) at java.lang.Thread.run(Thread.java:662) Your four test cases are great. Please write a unit test for each. This way we can make sure the patch works for each and that future changes don't break this feature.
          Hide
          Bharath Mundlapudi added a comment -

          Attaching a patch which addresses Eli's comments.

          Show
          Bharath Mundlapudi added a comment - Attaching a patch which addresses Eli's comments.
          Hide
          Bharath Mundlapudi added a comment -

          First, Thank you for identifying this issue, Eli. Great job!

          Couple of comments,
          1. We did test couple of things like masking permissions still dfs level. That didn't catch this issue. You pointed in making specific directory permissions helped us to reproduce this case. Thanks again.
          2. We tested by unmounting disks also.
          3. Then we tested with injecting failures at kernel level.

          Regarding testcases,
          I agree with you that we need more tests, But I think, we should do that in another jira. Since, we have already spent lot of effort in manual testing. Can we file another Jira to track this?

          With this new patch, i have tested following new cases. Can you please review and provide your feedback?

          case 1: All four good volumes, Vol Tolerated=1, expected outcome = BPservice should start

          11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current
          11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current
          11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current
          11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current
          11/05/19 04:57:51 INFO datanode.DataNode: Registered FSDatasetState MBean
          11/05/19 04:57:51 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822
          11/05/19 04:57:51 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305782678947 with interval 21600000
          11/05/19 04:57:51 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
          11/05/19 04:57:51 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
          11/05/19 04:57:51 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
          11/05/19 04:57:51 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
          11/05/19 04:57:51 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 3 msecs
          11/05/19 04:57:51 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@3e5a91
          11/05/19 04:57:51 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000.
          11/05/19 04:57:51 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
          11/05/19 04:57:56 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%

          case 2: One failed volume(/grid/2), three good volumes, Vol Tolerated=1, expected outcome = BPService should start

          11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:01:27 INFO common.Storage: Formatting ...
          11/05/19 05:01:27 WARN common.Storage: Invalid directory in: /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:01:27 INFO common.Storage: Locking is disabled
          11/05/19 05:01:27 INFO common.Storage: Locking is disabled
          11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:01:27 INFO common.Storage: Locking is disabled
          11/05/19 05:01:27 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current
          11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current
          11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current
          11/05/19 05:01:27 INFO datanode.DataNode: Registered FSDatasetState MBean
          11/05/19 05:01:27 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822
          11/05/19 05:01:27 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305789604425 with interval 21600000
          11/05/19 05:01:27 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
          11/05/19 05:01:27 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
          11/05/19 05:01:27 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
          11/05/19 05:01:27 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
          11/05/19 05:01:27 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs
          11/05/19 05:01:27 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@1adb7b8
          11/05/19 05:01:27 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000.
          11/05/19 05:01:27 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
          11/05/19 05:01:32 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%

          case 3: Two failed volumes(/grid/1,/grid/2), two good volumes, Vol Tolerated=1, expected outcome = BPService should NOT start

          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:04:06 INFO common.Storage: Formatting ...
          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:04:06 INFO common.Storage: Formatting ...
          11/05/19 05:04:06 WARN common.Storage: Invalid directory in: /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 WARN common.Storage: Invalid directory in: /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 INFO common.Storage: Locking is disabled
          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist.
          11/05/19 05:04:06 INFO common.Storage: Locking is disabled
          11/05/19 05:04:06 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
          11/05/19 05:04:06 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822
          org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for volumes required - validVolsRequired: 3, Current valid volumes: 2, volsConfigured: 4, volFailuresTolerated: 1
          at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
          at java.lang.Thread.run(Thread.java:619)
          11/05/19 05:04:06 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822

          case 4: All failed volumes, Vol Tolerated=1, expected outcome = BPService should NOT start

          11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/0/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:07:51 INFO common.Storage: Formatting ...
          11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:07:51 INFO common.Storage: Formatting ...
          11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:07:51 INFO common.Storage: Formatting ...
          11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/3/testing/hadoop-logs/dfs/data is not formatted.
          11/05/19 05:07:51 INFO common.Storage: Formatting ...
          11/05/19 05:07:51 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822
          java.io.IOException: All specified directories are not accessible or do not exist.
          at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:182)
          at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:217)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
          at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
          at java.lang.Thread.run(Thread.java:619)
          11/05/19 05:07:51 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822

          Show
          Bharath Mundlapudi added a comment - First, Thank you for identifying this issue, Eli. Great job! Couple of comments, 1. We did test couple of things like masking permissions still dfs level. That didn't catch this issue. You pointed in making specific directory permissions helped us to reproduce this case. Thanks again. 2. We tested by unmounting disks also. 3. Then we tested with injecting failures at kernel level. Regarding testcases, I agree with you that we need more tests, But I think, we should do that in another jira. Since, we have already spent lot of effort in manual testing. Can we file another Jira to track this? With this new patch, i have tested following new cases. Can you please review and provide your feedback? case 1: All four good volumes, Vol Tolerated=1, expected outcome = BPservice should start 11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current 11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current 11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/2/testing/hadoop-logs/dfs/data/current 11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current 11/05/19 04:57:51 INFO datanode.DataNode: Registered FSDatasetState MBean 11/05/19 04:57:51 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822 11/05/19 04:57:51 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305782678947 with interval 21600000 11/05/19 04:57:51 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0 11/05/19 04:57:51 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010 11/05/19 04:57:51 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0 11/05/19 04:57:51 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 11/05/19 04:57:51 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 3 msecs 11/05/19 04:57:51 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@3e5a91 11/05/19 04:57:51 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000. 11/05/19 04:57:51 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1 11/05/19 04:57:56 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% case 2: One failed volume(/grid/2), three good volumes, Vol Tolerated=1, expected outcome = BPService should start 11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:01:27 INFO common.Storage: Formatting ... 11/05/19 05:01:27 WARN common.Storage: Invalid directory in: /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:01:27 INFO common.Storage: Locking is disabled 11/05/19 05:01:27 INFO common.Storage: Locking is disabled 11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:01:27 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:01:27 INFO common.Storage: Locking is disabled 11/05/19 05:01:27 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/0/testing/hadoop-logs/dfs/data/current 11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/1/testing/hadoop-logs/dfs/data/current 11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - /grid/3/testing/hadoop-logs/dfs/data/current 11/05/19 05:01:27 INFO datanode.DataNode: Registered FSDatasetState MBean 11/05/19 05:01:27 INFO datanode.DataNode: Adding block pool BP-1694914230-10.72.86.55-1305704227822 11/05/19 05:01:27 INFO datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1305789604425 with interval 21600000 11/05/19 05:01:27 INFO datanode.DataNode: in register: sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0 11/05/19 05:01:27 INFO datanode.DataNode: bpReg after =lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010 11/05/19 05:01:27 INFO datanode.DataNode: in register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0 11/05/19 05:01:27 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 11/05/19 05:01:27 INFO datanode.DataNode: BlockReport of 0 blocks got processed in 4 msecs 11/05/19 05:01:27 INFO datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@1adb7b8 11/05/19 05:01:27 INFO datanode.BlockPoolSliceScanner: Periodic Block Verification scan initialized with interval 1814400000. 11/05/19 05:01:27 INFO datanode.DataBlockScanner: Added bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1 11/05/19 05:01:32 INFO datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% case 3: Two failed volumes(/grid/1,/grid/2), two good volumes, Vol Tolerated=1, expected outcome = BPService should NOT start 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:04:06 INFO common.Storage: Formatting ... 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:04:06 INFO common.Storage: Formatting ... 11/05/19 05:04:06 WARN common.Storage: Invalid directory in: /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 WARN common.Storage: Invalid directory in: /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822: File file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 INFO common.Storage: Locking is disabled 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822 does not exist. 11/05/19 05:04:06 INFO common.Storage: Locking is disabled 11/05/19 05:04:06 INFO datanode.DataNode: setting up storage: nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822 11/05/19 05:04:06 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822 org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for volumes required - validVolsRequired: 3, Current valid volumes: 2, volsConfigured: 4, volFailuresTolerated: 1 at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160) at org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191) at java.lang.Thread.run(Thread.java:619) 11/05/19 05:04:06 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822 case 4: All failed volumes, Vol Tolerated=1, expected outcome = BPService should NOT start 11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/0/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:07:51 INFO common.Storage: Formatting ... 11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/1/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:07:51 INFO common.Storage: Formatting ... 11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/2/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:07:51 INFO common.Storage: Formatting ... 11/05/19 05:07:51 INFO common.Storage: Storage directory /grid/3/testing/hadoop-logs/dfs/data is not formatted. 11/05/19 05:07:51 INFO common.Storage: Formatting ... 11/05/19 05:07:51 FATAL datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization failed for block pool BP-1694914230-10.72.86.55-1305704227822 java.io.IOException: All specified directories are not accessible or do not exist. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:182) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:217) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774) at org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191) at java.lang.Thread.run(Thread.java:619) 11/05/19 05:07:51 WARN datanode.DataNode: DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) ending block pool service for: BP-1694914230-10.72.86.55-1305704227822
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479721/HDFS-1592-3.patch
          against trunk revision 1124459.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSRemove
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479721/HDFS-1592-3.patch against trunk revision 1124459. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSRemove org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/581//console This message is automatically generated.
          Hide
          Bharath Mundlapudi added a comment -

          Attaching a patch with more unit tests.

          Show
          Bharath Mundlapudi added a comment - Attaching a patch with more unit tests.
          Hide
          Bharath Mundlapudi added a comment -

          Eli,

          I have added more unit tests as mentioned above. Also, note that, the case you pointed is a rare condition. In our tests, making file system readonly through mount or umounting disks or even setting permission one level above, we will not hit this issue. Only, when we set the permission on this particular directory then only we will hit this issue. Anyways, i have fixed the case you pointed also.

          Thanks for spotting this though.

          Show
          Bharath Mundlapudi added a comment - Eli, I have added more unit tests as mentioned above. Also, note that, the case you pointed is a rare condition. In our tests, making file system readonly through mount or umounting disks or even setting permission one level above, we will not hit this issue. Only, when we set the permission on this particular directory then only we will hit this issue. Anyways, i have fixed the case you pointed also. Thanks for spotting this though.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479883/HDFS-1592-4.patch
          against trunk revision 1125217.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestHDFSTrash

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479883/HDFS-1592-4.patch against trunk revision 1125217. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/600//console This message is automatically generated.
          Hide
          Bharath Mundlapudi added a comment -

          These failing tests are not related to this patch.

          Eli, If you don't have any comments, we will commit this patch today.

          Show
          Bharath Mundlapudi added a comment - These failing tests are not related to this patch. Eli, If you don't have any comments, we will commit this patch today.
          Hide
          Jitendra Nath Pandey added a comment -

          The new testcases look good and cover most of the use-case scenarios. However, please see if these test cases can be refactored to re-use code for failure simulation and cluster startup etc.

          Show
          Jitendra Nath Pandey added a comment - The new testcases look good and cover most of the use-case scenarios. However, please see if these test cases can be refactored to re-use code for failure simulation and cluster startup etc.
          Hide
          Eli Collins added a comment -

          Hey Bharath,

          Apologies for the slow response, I didn't see the jira updates for some reason. Thanks for adding the tests, the new test cases look good. I think other types of faults can be handled in a separate jira.

          Some minor comments, otherwise looks great!

          • DataStorage#recoverTransitionRead should log the IOE now that it's swallowed instead of thrown
          • It would be more clear to users if the message in the exception thrown FSDataSet was something like "Too many failed volumes". Since "volumes required" is an internal variable rather than a config option users won't what "Invalid value for volumes required" means.
          • Nit: volsConfigured can be initialized in 1 line: int volsConfigured = (dataDirs == null) ? 0 : dataDirs.length
          • startCluster should probably be renamed restartCluster since it's always called on active clusters
          • Jitendra's comment wrt refactoring makes sense

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Bharath, Apologies for the slow response, I didn't see the jira updates for some reason. Thanks for adding the tests, the new test cases look good. I think other types of faults can be handled in a separate jira. Some minor comments, otherwise looks great! DataStorage#recoverTransitionRead should log the IOE now that it's swallowed instead of thrown It would be more clear to users if the message in the exception thrown FSDataSet was something like "Too many failed volumes". Since "volumes required" is an internal variable rather than a config option users won't what "Invalid value for volumes required" means. Nit: volsConfigured can be initialized in 1 line: int volsConfigured = (dataDirs == null) ? 0 : dataDirs.length startCluster should probably be renamed restartCluster since it's always called on active clusters Jitendra's comment wrt refactoring makes sense Thanks, Eli
          Hide
          Bharath Mundlapudi added a comment -

          Thanks the review, Eli and Jitendra. I am attaching a patch which incorporates your comments.

          Show
          Bharath Mundlapudi added a comment - Thanks the review, Eli and Jitendra. I am attaching a patch which incorporates your comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480337/HDFS-1592-5.patch
          against trunk revision 1127317.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480337/HDFS-1592-5.patch against trunk revision 1127317. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/619//console This message is automatically generated.
          Hide
          Jitendra Nath Pandey added a comment -

          +1 for the patch.

          Show
          Jitendra Nath Pandey added a comment - +1 for the patch.
          Hide
          Jitendra Nath Pandey added a comment -

          Eli,
          Do you have any more concerns? I intend to commit this patch by tomorrow.

          Show
          Jitendra Nath Pandey added a comment - Eli, Do you have any more concerns? I intend to commit this patch by tomorrow.
          Hide
          Eli Collins added a comment -

          +1 lgtm. Thanks Bharath

          Nit: "due an exception" should "due to exception"

          Show
          Eli Collins added a comment - +1 lgtm. Thanks Bharath Nit: "due an exception" should "due to exception"
          Hide
          Jitendra Nath Pandey added a comment -

          I have committed this. Thanks to Bharath.

          Show
          Jitendra Nath Pandey added a comment - I have committed this. Thanks to Bharath.
          Hide
          Jitendra Nath Pandey added a comment -

          > Nit: "due an exception" should "due to exception"
          I fixed it before the commit.

          Show
          Jitendra Nath Pandey added a comment - > Nit: "due an exception" should "due to exception" I fixed it before the commit.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #689 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/689/)
          HDFS-1592. Datanode startup doesn't honor volumes.tolerated. Contributed by Bharath Mundlapudi.

          jitendra : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127995
          Files :

          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
          • /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailureToleration.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/hdfs/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #689 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/689/ ) HDFS-1592 . Datanode startup doesn't honor volumes.tolerated. Contributed by Bharath Mundlapudi. jitendra : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127995 Files : /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailureToleration.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/hdfs/trunk/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #679 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/679/)
          HDFS-1592. Datanode startup doesn't honor volumes.tolerated. Contributed by Bharath Mundlapudi.

          jitendra : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127995
          Files :

          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
          • /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailureToleration.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/hdfs/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #679 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/679/ ) HDFS-1592 . Datanode startup doesn't honor volumes.tolerated. Contributed by Bharath Mundlapudi. jitendra : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127995 Files : /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailureToleration.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/hdfs/trunk/CHANGES.txt
          Hide
          Owen O'Malley added a comment -

          Hadoop 0.20.204.0 was released.

          Show
          Owen O'Malley added a comment - Hadoop 0.20.204.0 was released.

            People

            • Assignee:
              Bharath Mundlapudi
              Reporter:
              Bharath Mundlapudi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development