Details
Description
Reproducing steps
1. Start namenode
./bin/hdfs --daemon start namenode
2. Start datanode
./bin/hdfs datanode
will see following connection issues
17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
this is expected because scm is not started yet
3. Start scm
./bin/hdfs scm
expecting datanode can register to this scm, expecting the log in scm
17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: af22862d-aafa-4941-9073-53224ae43e2c Registered.
but did NOT see this log. (I debugged into the code and found the datanode state was transited SHUTDOWN unexpectedly because the thread leaks, each of those threads counted to set to next state and they all set to SHUTDOWN state)
4. Create a container from scm CLI
./bin/hdfs scm -container -create -c 20170714c0
this fails with following exception
Creating container : 20170714c0. Error executing command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): Unable to create container while in chill mode at org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) at org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) at org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
datanode was not registered to scm, thus it's still in chill mode.
Note, if we start scm first, there is no such issue, I can create container from CLI without any problem.