Add voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • nodemanager

    Description

      The disk health checker verifies a disk by executing mkdir and rmdir periodically.
      If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.

      I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
      (I'll introduce the reproduction script in a while)

      I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.

      Attachments

        1. concept-async-diskchecker.txt
          3 kB
          Akihiro Suda
        2. YARN-4301-1.patch
          7 kB
          Akihiro Suda
        3. YARN-4301-2.patch
          10 kB
          Akihiro Suda
        4. YARN-4301-3-fail.patch
          13 kB
          Akihiro Suda

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            suda Akihiro Suda
            suda Akihiro Suda

            Dates

              Created:
              Updated:

              Slack

                Issue deployment