Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Later
-
None
-
None
-
None
Description
DataXceiver#run currently swallows all exceptions, it should instead plumb them up to DataXceiverServer#run so it can decide whether the exception should be tolerated or the daemon should exit. An IOE should be tolerated (because it's likely just an issue with a particular thread, or an intermittent failure), as it is today, but eg j.l.Error should not.
This came up in the following bug I'm seeing on a test cluster: if there's eg a NoClassDefFoundError thrown in DataXceiver#run (because the host jars were replaced out from underneath it, it ran out of descriptors, etc.) we'll end up with a datanode that is alive but always fails because it can't create any DataXceiver threads. In this case the datanode should shut itself down rather than continue to run.
Attachments
Attachments
Issue Links
- relates to
-
HDFS-9684 DataNode stopped sending heartbeat after getting OutOfMemoryError form DataTransfer thread.
- Resolved