Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3850

NM fails to read files from full disks which can lead to container logs being lost and other issues

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Container logs can be lost if disk has become full(~90% full).
      When application finishes, we upload logs after aggregation by calling AppLogAggregatorImpl#uploadLogsForContainers. But this call in turns checks the eligible directories on call to LocalDirsHandlerService#getLogDirs which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded.
      But on application finish, we also call AppLogAggregatorImpl#doAppLogAggregationPostCleanUp(). This deletes the application directory which contains container logs. This is because it calls LocalDirsHandlerService#getLogDirsForCleanup which returns the full disks as well.
      So we are left with neither aggregated logs for the app nor the individual container logs for the app.

      In addition to this, there are 2 more issues :

      1. ContainerLogsUtil#getContainerLogDirs does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces.
      2. RecoveredContainerLaunch#locatePidFile also does not consider full disks so it is possible that on container recovery, PID file is not found.

        Attachments

        1. YARN-3850.01.patch
          7 kB
          Varun Saxena
        2. YARN-3850.02.patch
          13 kB
          Varun Saxena

          Activity

            People

            • Assignee:
              varun_saxena Varun Saxena
              Reporter:
              varun_saxena Varun Saxena
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: