Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-2182

Exceptions in DataXceiver#run can result in a zombie datanode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Later
    • None
    • None
    • datanode
    • None

    Description

      DataXceiver#run currently swallows all exceptions, it should instead plumb them up to DataXceiverServer#run so it can decide whether the exception should be tolerated or the daemon should exit. An IOE should be tolerated (because it's likely just an issue with a particular thread, or an intermittent failure), as it is today, but eg j.l.Error should not.

      This came up in the following bug I'm seeing on a test cluster: if there's eg a NoClassDefFoundError thrown in DataXceiver#run (because the host jars were replaced out from underneath it, it ran out of descriptors, etc.) we'll end up with a datanode that is alive but always fails because it can't create any DataXceiver threads. In this case the datanode should shut itself down rather than continue to run.

      Attachments

        1. hdfs-2182-2.patch
          1 kB
          Eli Collins
        2. hdfs-2182-1.patch
          2 kB
          Eli Collins

        Issue Links

          Activity

            People

              eli Eli Collins
              eli Eli Collins
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: