Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9409

DataNode shutdown does not guarantee full shutdown of all threads due to race condition.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      DataNode#shutdown is documented to return "only after shutdown is complete". Even after completion of this method, it's possible that threads started by the DataNode are still running. Race conditions in the shutdown sequence may cause it to skip stopping and joining the BPServiceActor threads.

      This is likely not a big problem in normal operations, because these are daemon threads that won't block overall process exit. It is more of a problem for tests, because it makes it impossible to write reliable assertions that these threads exited cleanly. For large test suites, it can also cause an accumulation of unneeded threads, which might harm test performance.

        Activity

        Hide
        cnauroth Chris Nauroth added a comment -

        DataNode#shutdown calls BlockPoolManager#getAllNamenodeThreads to get every BPOfferService. Then, later in shutdown, these are passed to BlockPoolManager#shutDownAll, which eventually stops and joins each BPServiceActor thread. There are a few problems:

        1. BlockPoolManager#getAllNamenodeThreads returns an unmodifiable wrapper over its underlying list, so callers can't mutate the list, but it's still the same shared backing list. Later during shutdown, the BPServiceActor is told that it can exit its main loop. Part of that is a call on the BPServiceActor thread to BlockPoolManager#remove. This effectively removes it from the backing list returned by BlockPoolManager#getAllNamenodeThreads, so it will appear to vanish from the list before the call to BlockPoolManager#shutDownAll.
        2. Even if point 1 is fixed by changing BlockPoolManager#getAllNamenodeThreads to return a deep copy, there is a similar problem in that BPOfferService#shutdownActor will remove the actor from its internal tracking list.

        Because of these 2 problems, DataNode#shutdown might no longer have a reference to the BPServiceActor threads when it tries to stop and join on them. Therefore, those threads might still be alive even after completion of DataNode#shutdown. I noticed this while trying to write a test that asserts a particular thread has exited after DataNode shutdown.

        Show
        cnauroth Chris Nauroth added a comment - DataNode#shutdown calls BlockPoolManager#getAllNamenodeThreads to get every BPOfferService . Then, later in shutdown , these are passed to BlockPoolManager#shutDownAll , which eventually stops and joins each BPServiceActor thread. There are a few problems: BlockPoolManager#getAllNamenodeThreads returns an unmodifiable wrapper over its underlying list, so callers can't mutate the list, but it's still the same shared backing list. Later during shutdown, the BPServiceActor is told that it can exit its main loop. Part of that is a call on the BPServiceActor thread to BlockPoolManager#remove . This effectively removes it from the backing list returned by BlockPoolManager#getAllNamenodeThreads , so it will appear to vanish from the list before the call to BlockPoolManager#shutDownAll . Even if point 1 is fixed by changing BlockPoolManager#getAllNamenodeThreads to return a deep copy, there is a similar problem in that BPOfferService#shutdownActor will remove the actor from its internal tracking list. Because of these 2 problems, DataNode#shutdown might no longer have a reference to the BPServiceActor threads when it tries to stop and join on them. Therefore, those threads might still be alive even after completion of DataNode#shutdown . I noticed this while trying to write a test that asserts a particular thread has exited after DataNode shutdown.
        Hide
        kihwal Kihwal Lee added a comment -

        BlockScanner is the same way. It removes a thread from its tracking data structure after a timed join. This is probably less problematic because the timeout in each join() is 5 minutes. Although this wait may be good for unit tests, waiting up to 5 minutes per each volume scanner thread is unreasonable when the datanode needs to be restarted quickly. We actually internally reduced it down to 100ms and the rolling upgrades works a lot better.

        Perhaps we need a flag in datanode that tells whether it should wait until everything is terminated or quickly shutdown without waiting a long time for daemon threads to terminate. The shutdown code of each subsystem would then check this flag and behave differently. We would turn it on for unit testing and off for production. How does it sound?

        Show
        kihwal Kihwal Lee added a comment - BlockScanner is the same way. It removes a thread from its tracking data structure after a timed join. This is probably less problematic because the timeout in each join() is 5 minutes. Although this wait may be good for unit tests, waiting up to 5 minutes per each volume scanner thread is unreasonable when the datanode needs to be restarted quickly. We actually internally reduced it down to 100ms and the rolling upgrades works a lot better. Perhaps we need a flag in datanode that tells whether it should wait until everything is terminated or quickly shutdown without waiting a long time for daemon threads to terminate. The shutdown code of each subsystem would then check this flag and behave differently. We would turn it on for unit testing and off for production. How does it sound?
        Hide
        cnauroth Chris Nauroth added a comment -

        Using a hidden configuration flag for this sounds appropriate to me. I agree that there is no need for a strict long wait on all threads in production operations if correctness doesn't depend on it.

        Show
        cnauroth Chris Nauroth added a comment - Using a hidden configuration flag for this sounds appropriate to me. I agree that there is no need for a strict long wait on all threads in production operations if correctness doesn't depend on it.

          People

          • Assignee:
            Unassigned
            Reporter:
            cnauroth Chris Nauroth
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development