Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7240 Scaling HDFS
  3. HDFS-12098

Ozone: Datanode is unable to register with scm if scm starts later

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Cannot Reproduce
    • HDFS-7240
    • HDFS-7240
    • datanode, ozone, scm

    Description

      Reproducing steps
      1. Start namenode

      ./bin/hdfs --daemon start namenode

      2. Start datanode

      ./bin/hdfs datanode

      will see following connection issues

      17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
      17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
      17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
      17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
      

      this is expected because scm is not started yet

      3. Start scm

      ./bin/hdfs scm

      expecting datanode can register to this scm, expecting the log in scm

      17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: af22862d-aafa-4941-9073-53224ae43e2c Registered.
      

      but did NOT see this log. (I debugged into the code and found the datanode state was transited SHUTDOWN unexpectedly because the thread leaks, each of those threads counted to set to next state and they all set to SHUTDOWN state)

      4. Create a container from scm CLI

      ./bin/hdfs scm -container -create -c 20170714c0

      this fails with following exception

      Creating container : 20170714c0.
      Error executing command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): Unable to create container while in chill mode
      	at org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
      	at org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
      	at org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
      

      datanode was not registered to scm, thus it's still in chill mode.

      Note, if we start scm first, there is no such issue, I can create container from CLI without any problem.

      Attachments

        1. thread_dump.log
          57 kB
          Weiwei Yang
        2. HDFS-12098-HDFS-7240.001.patch
          15 kB
          Weiwei Yang
        3. HDFS-12098-HDFS-7240.002.patch
          15 kB
          Weiwei Yang
        4. Screen Shot 2017-07-11 at 4.58.08 PM.png
          90 kB
          Anu Engineer
        5. disabled-scm-test.patch
          5 kB
          Anu Engineer
        6. HDFS-12098-HDFS-7240.testcase.patch
          19 kB
          Weiwei Yang
        7. HDFS-12098-HDFS-7240.testcase-1.patch
          18 kB
          Weiwei Yang

        Activity

          People

            cheersyang Weiwei Yang
            cheersyang Weiwei Yang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: