XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:

      Description

      The disk health checker verifies a disk by executing mkdir and rmdir periodically.
      If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.

      I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
      (I'll introduce the reproduction script in a while)

      I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.

        Attachments

        1. YARN-4301-3-fail.patch
          13 kB
          Akihiro Suda
        2. YARN-4301-2.patch
          10 kB
          Akihiro Suda
        3. YARN-4301-1.patch
          7 kB
          Akihiro Suda
        4. concept-async-diskchecker.txt
          3 kB
          Akihiro Suda

          Issue Links

            Activity

              People

              • Assignee:
                suda Akihiro Suda
                Reporter:
                suda Akihiro Suda
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: